Your whole multi-agent run, in a bucket you own

Community Article Published June 10, 2026

Your whole multi-agent run, in a bucket you own

I run several coding agents most days — but rarely as a team on one task. It's parallel work: Claude Code in this terminal, Codex in that one, different repos, different jobs. What that actually produces is sprawl. Each agent leaves its transcript in its own format in its own directory, whatever they build lands wherever it lands — and on the occasions two runs do touch the same work, a second agent picking up a step the first already started, or me handing yesterday's session to today's, nothing is arbitrating that at all. And when a run goes sideways, I can't reconstruct which agent knew what, which artifact came from where, or why the handoff failed.

So I built tracecraft (pip install tracecraft-ai, MIT): one bucket you own where all of it lands. Session mirrors of every agent's transcript, artifacts for anything they build — code, reports, tarballs, plain backups — and, for the moments runs do overlap, shared memory, per-agent mailboxes, atomic task claims, and handoffs. Coordination events are human-readable JSON and transcripts are JSONL, so the whole thing stays greppable. Point it at a local MinIO while you develop and at a Hugging Face bucket when you want it on the Hub (anything S3-compatible works). No server, no database, no daemon: every command is a stateless read or write against the bucket, so there's nothing to deploy and nothing to trust except your own storage.

And it's a CLI on purpose, because the user isn't only you — it's your agents. Tell your coding agent to install tracecraft and it can do all of this itself: claim a step before starting it, message the other agents, mirror its own transcript, and back up whatever it just built. No SDK integration, no framework adoption — an agent that can run shell commands already knows how to use it.

There's real momentum right now behind the idea that agents should store their traces on the Hub — so you can keep a history of them, analyze them, and post-train better models and harnesses on them. I think that's exactly right. This post is about what happens when you take it one step further: a bucket can hold more than one agent's transcript. It can hold the whole run.

N agents, one bucket

Here's a real run I drove through the CLI — staged on purpose, three agents on one project compressed into sixteen seconds, so every primitive shows up at least once. Your agents won't usually collaborate this tightly; the point is that when any two of them do cross paths, these are the mechanics that catch it. A project called replaydemo with agents designer, developer, and reviewer, against a MinIO (S3-compatible) bucket. Eighteen events, as a narrative:

The designer claims the design step. Then the developer tries to claim the same step — and is atomically rejected: already claimed by designer. Exactly one agent wins. Under the hood that claim is an S3 If-None-Match conditional PUT, so the race is decided by the object store, not by a coordinator I'd have to run and trust.

The designer finishes and hands off to the developer. The developer does the work: uploads an artifact (tracecraft artifact upload takes any file — the same mechanism backs up whatever the run produced), broadcasts a message to the others, and hands off to the reviewer. The reviewer looks at it, finds a problem, and marks the review blocked:

tracecraft complete review --blocked --note "missing auth check, needs rework"

That writes a real status.json with "status": "blocked" and a handoff record with "state": "blocked" and "next_agent": null — so the run ends on a blocked handoff rather than pretending everything was fine.

Every one of those eighteen events is plain JSON in the bucket. The claim, the rejection, both handoffs, the broadcast, the artifact, the block. Here's the whole run laid out on a timeline — rendered straight from the bucket's JSON by a little replay viewer I'm building on top of it (the viewer isn't released yet; the data it reads is just the files above):

The full replaydemo run: 3 agents, 18 events — claims, a memory write, messages, a broadcast, an artifact, and a blocked handoff in red

And the trace isn't all there is: the same bucket also carries the run's working state — a shared key-value memory where the agents keep the current goal and decisions (tracecraft memory set, the design.contract line above), and the per-agent mailboxes they talk through (tracecraft send, tracecraft inbox) — so the coordination and its record are literally the same bytes. On the bucket, the whole run looks like this:

replaydemo/
  agents/designer.json                      ← registration + heartbeat
  memory/design.contract.json               ← shared state: the goal, the API contract
  messages/developer/<ts>_designer_<id>.json   ← direct mail
  messages/_broadcast/<ts>_developer_<id>.json ← broadcast to all
  steps/design/claim.json                   ← the atomic claim
  steps/design/status.json                  ← pending / in_progress / complete / blocked
  steps/review/handoff.json                 ← incl. the blocked one
  artifacts/build/patch-auth.diff           ← anything the agents produce
  sessions/claude-code/<session-id>/meta.json
  sessions/claude-code/<session-id>/part-00000.jsonl   ← the mirrored transcript

No magic anywhere: ls, grep, and jq are a complete observability stack for it.

Mirroring sessions to Hugging Face

The single-agent transcript matters too, and since the run's coordination already lives in a bucket, the transcripts belong next to it. You point tracecraft at a coding-agent session and it mirrors it in:

pip install 'tracecraft-ai[huggingface]'
tracecraft init --backend hf --bucket arrmlet/tracecraft-test --project hftest --agent tester
tracecraft session mirror --harness claude-code

The HF backend uses HfFileSystem under the hood, so --bucket is just an HF bucket handle (user/name). And one detail I care about, because "in private" tends to get said almost in passing: as of 0.2.2, init creates the bucket for you if it doesn't exist — private by default. Private shouldn't be a setting you have to remember when the thing you're uploading is your agents' reasoning, so it's the default, with an explicit --public opt-out. init also prints the bucket's actual visibility read back from the Hub, and if the bucket already exists as public, it warns you loudly instead of quietly mirroring your transcripts into the open.

When I ran the mirror against a real Claude Code session, this is the line it printed:

uploaded part-00000-77ea1f78  source=12,138B  upload=12,138B  redactions=none

And these are the keys that showed up on the Hub:

hftest/agents/tester.json
hftest/sessions/claude-code/5efa6a34-.../meta.json
hftest/sessions/claude-code/5efa6a34-.../part-00000-77ea1f78.jsonl

Reading it back with tracecraft session list shows 12,138 bytes in 1 part — same bytes in as out, which is the whole point. Your transcript, in your repo, unchanged. Here it is in the Hub's bucket browser — meta.json plus the mirrored part, 12.1 kB:

The mirrored Claude Code session in the HF bucket browser: meta.json and part-00000-77ea1f78.jsonl

That bucket is intentionally public so you can poke at the real keys yourself: huggingface.co/buckets/arrmlet/tracecraft-test — agent registrations, mirrored sessions, and coordination runs, browsable in the Hub's bucket viewer (and if you run init against it, you'll see the loud PUBLIC warning doing its job).

Two things worth saying plainly. First, the source transcript is never modified; the mirror only reads it and writes new part keys. Second, secret-shape redaction is on by default — it scans for the usual shapes (AWS, Anthropic, OpenAI, HF, GitHub, Slack, and generic bearer tokens) and records the count in meta.json. In this run there was nothing to redact, hence redactions=none. There are four harnesses you can mirror today: claude-code, codex, hermes, openclaw.

Does the atomic claim actually hold up? I tried to break it.

Here's the mechanism live — two agents, one shared bucket, racing for the same step:

Two agents race to claim the same step; the second is atomically rejected — no lock service, no server, the bucket coordinates

The coordination story rests on one mechanism — the atomic claim — so asserting "exactly one agent wins" wasn't enough for me. I had 2 to 50 agents race for the same claim at the same instant against MinIO — deliberately the worst case, a stampede on one hot key — 1,200 races in total. The exactly-one-winner invariant held on all 1,200, zero duplicate wins, verified by re-reading the stored object rather than trusting each agent's own belief. Latency degrades gracefully under pile-up (median ~6–11ms up to 8 simultaneous agents, ~41ms at 50), and a staggered control — same agent counts, fired 25ms apart — stays flat, which pins the cost on simultaneity, not agent count. Agents claiming different steps don't contend at all.

One honest note and one confession. The note: this is single-node MinIO over loopback, so the milliseconds are a floor, not cloud-S3 wall-clock — the invariant and the shape are the claim. The confession: the benchmark flushed out a real bug — same-second message bursts collided on one key and silently dropped, 32 of 960 delivered. Fixed in the released 0.2.2 (unique keys, 960/960). I'd rather find that myself than have you find it.

Raw per-trial logs, the harness, a regenerable report, and the latency charts are in the repo — and the entire S3 backend is about 120 lines, the claim a single conditional PUT, so the cheapest way to check me is to run your own stampede against a local MinIO. If you get a different shape, I want to know.

Why store the whole run?

Even if your agents never exchange a single message, having every session from every harness land in one bucket is what makes them analyzable — and backed up — at all. But the coordination events are the part nothing else captures: a single-agent trace tells you how one agent reasoned, while a whole run — transcripts, coordination events, the artifacts it produced — is a record of how agents actually interact, including where it goes wrong, like that blocked review handoff. The failures are arguably the most valuable rows in the set. If you want to post-train a model — or build a better harness — "here is a real run where a handoff got blocked, and here is everything that led to it" is exactly the kind of signal you'd want, and it's signal no single transcript contains.

Honest limits

I'd rather you trust this than be impressed by it, so:

  • Atomic claims need an S3-compatible backend. HF buckets don't expose a conditional-write primitive today, so on the HF backend claims fall back to check-then-write with a real race window — I measured it, and under deliberate contention it's real, not theoretical. Everything else — mirroring, memory, messages, artifacts — works the same on HF, because those only ever write unique keys. And if HF ships conditional writes, claims become atomic there too without tracecraft changing at all; I'd genuinely love that.
  • No claim TTL or heartbeat yet. If an agent crashes mid-step, it keeps its claim until something clears it — complete --force is the manual override, but there's no automatic expiry today.
  • There's no public multi-agent traces dataset up yet. Mine are staying private for now — which is, after all, the default this whole post argues for. Your agents' traces are some of the most valuable data you own; "store them in a bucket" and "publish them" are different decisions, and the second one deserves the same care "in private" implies.

None of the coordination primitives here are novel on their own — mailboxes and task claims exist in plenty of systems. What's actually different is the combination: server-less, backend-agnostic, cross-harness, and multi-agent in one bucket you own. That's the claim, no more than that.

Try it

pip install 'tracecraft-ai[huggingface]'
tracecraft init --backend hf --bucket <user>/<name> --project demo --agent tester
tracecraft session mirror --harness claude-code

Or skip the manual setup entirely and paste this into your coding agent:

Install tracecraft-ai with the huggingface extra. Run tracecraft init with backend hf, bucket <user>/<name>, project demo, and your own name as the agent. Mirror your session into it, and upload anything you build as artifacts.

Or run it against any S3-compatible bucket with --backend s3 (a local MinIO is enough). Source, docs, and the benchmark harness are at github.com/Arrmlet/tracecraft — MIT, ~530 lines of SDK, issues and PRs welcome (a claim TTL is the next thing I want to get right). Agent traces belong in buckets you own — and I think the runs between agents belong there too. So mirror your own multi-agent runs, keep them private by default, and if you ever do choose to share one, I'd genuinely love to see what coordination looks like in the wild.

Community

Sign up or log in to comment