Running 50 Claude Code agents in parallel: what breaks first

neul-labs · May 12, 2026 ·

operationsscale

The pitch that landed us on this project was simple and a little ridiculous: what if you could point fifty Claude Code sessions at the same repo and let them work in parallel? The instinct most engineers have when they hear that is two-fold. First, “no way that works.” Second, “actually, what would break first?”

We ran the experiment. This post is what we learned, in the order we learned it, and how each failure mode shaped the harness that became brat.

The naive setup

Our first try was about as crude as it sounds. A bash script that ran claude in a loop, fifty times, on fifty different feature branches, with each invocation handed a different task description from a queue file. wait at the end. Tmux for visibility, optional.

The script ran. Not all fifty agents finished. We expected that. What surprised us was the failure mode.

Failure 1: the working tree

The first thing that broke wasn’t anything exotic. Two Claude sessions had checked out adjacent branches in the same working tree and were editing files at the same time. Git doesn’t care which agent wins — it just writes whoever flushed last. We lost work in the first ten minutes.

The fix at the bash layer is to give each agent its own worktree. That’s correct, and it’s basically what brat does under the hood — each session (“polecat”, in our vocabulary) gets its own actor directory, and any coordination between them goes through the event log rather than the filesystem. But it forces a design choice we hadn’t made yet: where does the shared state actually live?

Failure 2: there is no shared state, only scrollback

A few rounds in, we realized we couldn’t answer a basic question: which of the fifty agents had actually finished? Some had exited cleanly. Some had been killed by our timeout. Some had hung. Some had reported “done” but committed nothing. Some had committed something but not pushed.

The state of the system was in fifty separate scrollbacks. Three of them had already rolled off the buffer. Two of the tmux panes had been accidentally closed when someone resized a window. There was no source of truth.

This is the moment we wrote the first version of what would become the WAL. We needed a single append-only log where every state transition — “task started”, “session spawned”, “exit code 0”, “files modified”, “merge attempted” — was a durable event, stamped with a timestamp and an actor ID. The harness then replays the log to know what is real.

In brat, that’s the Gritee substrate. Events live in refs/grite/wal, which is just a chain of git refs. There is no database to install and no service to keep running. Each event is immutable, and the materialized view that the CLI reads from is always derivable by replaying events forward.

Failure 3: the merge fight

Once we had agents finishing and committing, the next thing that broke was integration. Twelve agents had landed their work on branches; six of those branches conflicted with each other; CI didn’t know which order to run them in; and one bad agent had landed changes that broke the build for everyone behind it.

We needed a merge queue. Not in the GitHub-product sense — in the “this is the next thing to land, and we are running CI against it now” sense. The Refinery role in brat owns this. Each task moves through queued → running → merged states; the Refinery applies a configurable policy (rebase, squash, or merge) and runs your existing CI per task before flipping the state. The honest version: it doesn’t fix conflicts, it just orders the work so conflicts surface predictably and humans resolve them.

Failure 4: zombie locks

Around agent 20 we hit our first deadlock. A polecat had grabbed an exclusive lock on a particular file (we use lock leases for resource coordination — files, ports, named external services), the process had then been killed by a parent timeout, and the lock was still held.

The fix is something every distributed-systems person reading this is already typing: TTL-based leases. Locks expire if their holder doesn’t renew them. The Deacon — a background janitor role in the harness — sweeps for stale leases and releases them. After this, “what happens if an agent crashes mid-task” went from a debug session to a non-event.

Failure 5: the API rate limit cliff

Even with the harness sorted out, we hit one wall we couldn’t engineer around: the rate limit on the upstream agent. Claude Code can run forty concurrent sessions on our account before things slow down; somewhere around fifty, requests start getting throttled. This was important to learn early because it shaped what brat is and isn’t responsible for.

Engine reliability — API limits, vendor outages, auth flow — is explicitly outside the harness. The README says this. The home page says this. We say it again here because it’s the first thing teams want to ask us to fix.

The right design response is bounded timeouts on every engine call (so a slow upstream doesn’t pin a polecat forever) and backoff with jitter at the witness layer. brat ships both, but the actual rate-limit budget is a property of your account, not of the orchestrator.

Failure 6: visibility

With fifty agents running, you can’t read fifty scrollbacks. We added a brat status command that shows the convoy and task tree, with state badges for each one. That’s the CLI surface. There’s also a web dashboard (bratd + a small frontend) that gives you live cards for queued / running / blocked / merged tasks and lets you tail any individual session.

The dashboard was originally an afterthought. Once we had fifty agents running it became the most-watched window in the office. The reason is mundane: you need to see what’s happening, in real time, in one place. The CLI tells you the truth; the dashboard makes it scannable.

What we kept after the experiment

The experiment didn’t actually run fifty Claude Code sessions in steady state — we capped at thirty-two and that already saturated our upstream allowance. But the failures we hit getting there are exactly what brat is now built to absorb:

Actor isolation, so the filesystem is never the contention surface
Append-only event log, so state survives crashes and the team can always answer “what happened”
Lock leases with TTL, so dead agents don’t hold resources
Merge queue, so integration is ordered rather than a race
Bounded timeouts, so slow upstreams don’t poison the harness
Honest scope, so we don’t pretend to fix what the engines themselves are responsible for

If you take one thing from this post, take this: when you scale agent counts, you don’t discover new model failures. You discover the old distributed-systems failures, in the order they always show up. The harness is what stops them from being your problem.

Try it yourself

The demo script in the brat repo runs the same shape of experiment at much smaller scale:

git clone https://github.com/neul-labs/brat
cd brat
cargo build --release
./scripts/mayor-demo.sh --with-ui

It boots a sample Python project with intentional bugs, starts the Mayor, has it create a convoy of fix tasks, and runs the agents on them. The UI is at localhost:5173 if you started it with --with-ui. It’s the same architecture that survived our fifty-agent experiment, scaled down to something you can run on a laptop.