Designing crash-safe state for ephemeral agent processes
There is a particular category of bug that every team using AI coding agents in anger has run into: the agent appears to have finished a task, the orchestrator thinks the task is in progress, and the filesystem has a half-applied change with no commit. Restart the orchestrator and you can’t tell whether the task is done, half-done, or never started. The agent is gone. Its scrollback rolled off. The state is broken because nothing wrote it down in a durable place.
This post is about how brat avoids that. It’s also about the design choice that lets us avoid it: putting an append-only event log between the harness and reality.
The model: orchestration is bookkeeping
If you strip everything else away, what a multi-agent harness does is bookkeeping. It records that a task was created. It records that an agent was spawned to work on the task. It records what the agent produced. It records whether that output was accepted or rejected by the merge queue. The actual coding — the thing the agent does — is somewhere else, in some external binary, on its own clock, vulnerable to all of the failures any external process is vulnerable to.
The bookkeeper has one job: never lose what it has been told. Everything else is downstream.
We made an early decision that the bookkeeper would not be a database. Databases need a process. Processes need supervision. We were already going to have a supervisor problem (that’s the whole project), and we did not want to bootstrap from “you need this other thing running, too.” So we built on Gritee instead.
Gritee in one paragraph
Gritee is an append-only log that lives inside your repo. Specifically, it lives in refs/grite/wal — a chain of git refs, each one pointing at an immutable blob that encodes a single event. To write an event, you create a blob and update a ref. To read history, you walk the chain backward. There is no server, no daemon, no fsync to argue with. The durability guarantees are git’s durability guarantees, which most of your tooling already understands.
brat sits on top of Gritee. Every state change in the harness becomes a Grite event. Tasks created, sessions started, locks acquired, files modified, exit codes recorded, merges attempted. The log is the source of truth; everything else is derived.
The materialized view
You don’t actually want to scan the entire WAL every time you ask “what tasks are queued?” — that scales the wrong way. So brat keeps a materialized view in a local sled database, one per actor. Each actor reads its own view, never anyone else’s.
The trick — and this is what makes the crash story work — is that the view is purely derivable. If it gets corrupted, if it goes missing, if it’s from an older version of the schema, you can throw it away and rebuild it from the WAL. No information is in the view that isn’t also in the log. The view is a cache for performance, not a system of record.
This means the rebuild routine is a real path that runs all the time, not a recovery procedure that runs once a year. Every brat command starts by reconciling its view against the WAL head. If a previous session crashed mid-update, the reconcile catches up. If it crashed pre-update, the reconcile sees no events and moves on. Either way, the system converges.
What “crash” actually means here
When we say crash-safe we don’t just mean “the process died unexpectedly.” We mean any of:
- Process killed by a parent timeout
- Process killed by OS OOM
- Disk full, write failed midway
- Network died during a remote call
- User hit ctrl-C
- Laptop went to sleep mid-task
- Power loss
- Bug in the harness itself
All of these are routine. None of them should require a human to “un-stick” the state. The contract is: if an event made it into the WAL, it’s real and will be replayed; if it didn’t, it’s as if it never happened. There is no in-between.
To make this concrete: when the Witness spawns a polecat (an agent session), the events go in this order:
session.requested— written before the binary is invokedsession.started— written after the OS reports a PIDsession.heartbeat— written periodically with the current cursorsession.completed(with exit code) — written when the process exits
If you crash between 1 and 2, the next replay sees a requested-but-not-started session and the Witness reissues. If you crash between 2 and 4, the next replay sees a started-but-not-completed session and the Deacon waits for its lock lease to expire before declaring it dead. If you crash between 3 and 4 with stale heartbeats, the lease expires and the session is reaped.
There is no race condition because there is no shared mutable state. Every transition is an event. Events are immutable. The view rebuilds.
Locks without zombies
Speaking of leases — the other place this design pays off is resource locking. brat uses TTL-based leases for all coordination: file locks, named-resource locks, port reservations. The lease holder writes a lock.acquired event with a TTL; it must write lock.renewed before the TTL expires; if it fails to, the Deacon writes a lock.expired event and the resource is free for whoever wants it next.
This is the only design we know of where “process died holding a lock” is not a crash. It’s an entry in the log. Sweep, expire, move on.
What we don’t try to do
The honest section.
Crash-safety in the harness does not extend to the work the agents themselves produce. If Claude Code is halfway through editing a file when it gets killed, the file is in whatever state it was in. brat will know the session is dead, will reap the lease, will mark the task as failed, and will leave the file as-is for you (or another agent) to look at. We do not roll back partial edits because partial edits are a property of the engine, not the harness.
Similarly, we do not retry tasks automatically just because they failed. Sometimes a failure means “the model was confused and would have been confused again.” We surface the failure in the dashboard and the CLI, and we let you (or the Mayor) decide whether to re-queue.
These are deliberate non-features. They keep the boundary clean: the harness owns coordination state, the agent owns work output, and the human owns judgment.
The payoff
The payoff for all of this is a property we did not have to think about much: when an engineer joins the project, they don’t ask “how do I recover from a crash?” They ask “what’s the demo command?” Recovery is not a workflow because crashes are not exceptional. They’re events.
If you want to see what the WAL looks like, run brat init in a sandbox repo and then git log refs/grite/wal --format='%H %s'. Every event is right there. You can replay them by reading the blobs, but you almost never need to — that’s what brat is for.
Try it
cd your-project
grite init
brat init
brat mayor start
brat mayor ask "list any todo comments and create tasks for them"
brat status
If you ctrl-C any of those commands, run them again. The system should pick up where it left off. If it doesn’t, that’s a bug — open an issue. Crash-safety is not a marketing claim. It’s the design point the whole project is built around.