Deterministic orchestration around non-deterministic models

A debugging session in March made me realise something obvious: I had been building agents the wrong way for two years. The bug was simple — a routing decision flipped between runs on the same input — but the fix took four days, because I had no way to replay anything.

The agent ran a classifier, called a tool, then summarised. Each step was a model call. Each model call was, by construction, a black box that nobody — not me, not the vendor — could reproduce bit-for-bit. So when the bug appeared in production, I could not put it back in a cage and study it.

The deterministic envelope#

The fix was to separate the workflow into two layers. The outer layer — routing, tool selection, retry policy, memory writes — is plain code. It runs the same way every time given the same inputs. The inner layer — the model call — is non-deterministic by nature, but I record its inputs and outputs in a content-addressed cache.

Now when a bug appears, I replay the outer envelope from the trace. Tool calls, branching, state mutations: all reproducible. The model outputs are pulled from the cache rather than re-sampled. The bug shows up identically every time.

What this buys you#

Bisecting regressions across releases. When eval scores drop, I can replay yesterday’s traces against today’s code and isolate the change.
Eval-as-fixture. Every production trace becomes a test case. Failures replay locally without burning tokens.
Audit. The trace is the artifact a regulator can inspect — every input, every output, every decision, in order.

A worked example#

Here is the smallest version of the contract I now require from a runtime:

from agentloom import Workflow

graph = Workflow()
graph.step("classify", model="gpt-4o", temperature=0)
graph.step("route",    fn=route_intent)        # deterministic
graph.step("respond",  model="claude-haiku-4-5")

trace = graph.run(input, seed=42)
trace.replay()  # bit-for-bit on every deterministic edge

The interesting thing is what is missing. There is no global “reproducible” flag. There is no claim that the model itself is deterministic — it is not, and pretending otherwise leads to fragile code. The runtime simply records what the model produced and replays the recording.

What I would do differently#

Start with the trace format. Everything else — the runtime, the eval harness, the dashboard — is downstream of the schema you record. We rewrote ours twice before it was stable.

AgentsReliabilityInfra

Deterministic orchestration around non-deterministic models

The deterministic envelope#

What this buys you#

A worked example#

What I would do differently#

End-to-end evals for agentic systems

End-to-end evals for agentic systems

Agent security is not model security

The Article 15 testing gap