A debugging session in March made me realise something obvious: I had been building agents the wrong way for two years. The bug was simple — a routing decision flipped between runs on the same input — but the fix took four days, because I had no way to replay anything.
The agent ran a classifier, called a tool, then summarised. Each step was a model call. Each model call was, by construction, a black box that nobody — not me, not the vendor — could reproduce bit-for-bit. So when the bug appeared in production, I could not put it back in a cage and study it.
The deterministic envelope#
The fix was to separate the workflow into two layers. The outer layer — routing, tool selection, retry policy, memory writes — is plain code. It runs the same way every time given the same inputs. The inner layer — the model call — is non-deterministic by nature, but I record its inputs and outputs in a content-addressed cache.
Now when a bug appears, I replay the outer envelope from the trace. Tool calls, branching, state mutations: all reproducible. The model outputs are pulled from the cache rather than re-sampled. The bug shows up identically every time.
What this buys you#
- Bisecting regressions across releases. When eval scores drop, I can replay yesterday’s traces against today’s code and isolate the change.
- Eval-as-fixture. Every production trace becomes a test case. Failures replay locally without burning tokens.
- Audit. The trace is the artifact a regulator can inspect — every input, every output, every decision, in order.
A worked example#
Here is the smallest version of the contract I now require from a runtime:
from agentloom import Workflow
graph = Workflow()graph.step("classify", model="gpt-4o", temperature=0)graph.step("route", fn=route_intent) # deterministicgraph.step("respond", model="claude-haiku-4-5")
trace = graph.run(input, seed=42)trace.replay() # bit-for-bit on every deterministic edgeThe interesting thing is what is missing. There is no global “reproducible” flag. There is no claim that the model itself is deterministic — it is not, and pretending otherwise leads to fragile code. The runtime simply records what the model produced and replays the recording.
What I would do differently#
Start with the trace format. Everything else — the runtime, the eval harness, the dashboard — is downstream of the schema you record. We rewrote ours twice before it was stable.