Trust, but Trace

Evaluation methodology

Deterministic where you can, judge where you must

Tool calls deserve deterministic comparison. Goal completion needs LLM-based assessment. Some tools draw the line; others don't.

Part 5 read a tool that tells the judge what NOT to evaluate — rubric anchoring. The next strategy doesn’t tighten the judge’s question. It asks where you should be using a judge at all.

I’m building AgentAnvil, a contract-driven testing framework for LLM-based agents, and reading other evaluators is how I work out what those contracts should constrain. Deterministic hybridisation is the simplest of the strategies in this series and the one easiest to get half-right. Some questions about an agent have one correct answer — did it call the right tool, does the JSON parse, did it pick option B — and a regex catches them. Other questions don’t — was the response actually helpful — and a regex can’t. The strategy is to use each tool where it fits, and to make the partition visible. One tool — Inspect AI — gets the architecture right.

This post reads it from the source.

The intuition

When you have an LLM in your stack, the easy reach is to use it for everything. Score this output. Compare these two. Rate this on 0-3. Every evaluation becomes a prompt. Every prompt is a few cents and a few seconds. After a thousand samples and four scorers, the eval costs more than the run.

Many of the questions you ask in an evaluation don’t actually require a language model. “Did the agent call get_weather with the right city?” is a string compare. “Does the response include the user’s name?” is a substring check. “Is the output a valid JSON object matching this schema?” is a parser. These are cheaper, faster, and deterministic — the same input always gives the same answer. Run them locally, in a loop, for free.

The reflex to LLM-everything happens because the modern tool palette pushes you there. Most LLM-eval frameworks lead with their model-graded scorers because that’s the differentiator — the new capability that makes them an LLM-eval framework rather than a regular test runner. The match/regex/pattern scorers exist but read like a step backward. The user thinks: I have a powerful judge — why use anything else?

Because the judge is the right tool for some questions and the wrong tool for others. Use each where it fits. The hardest part is making the partition explicit so a reviewer can audit it.

Inspect AI, step by step

Inspect AI defines Scorer as a Protocol, not a class (scorer/_scorer.py:33-61):

@runtime_checkable
class Scorer(Protocol):
async def __call__(
self,
state: TaskState,
target: Target,
) -> Score | None:
...

That’s the entire contract. A scorer is an async callable that takes the task state (input, model output, tool calls, metadata) and the target (the dataset’s expected answer) and returns a Score or None. Deterministic scorers and LLM-graded scorers both satisfy this protocol. The framework doesn’t distinguish them at the type level — they’re both Scorers. The partition is a user choice, not a framework constraint.

The deterministic family lives across _match.py, _pattern.py, _classification.py, _choice.py. Built-in scorers: match() (string compare with location ∈ begin/end/any/exact); includes() (substring test); pattern() (regex over the completion); exact() (token-set normalisation + equality); f1() (SQuAD-style F1, returns a continuous 0-1); answer() (LETTER/WORD/LINE regex wrapper); choice() (multiple-choice via state.choices[i].correct). None of them import the Model type. They are pure functions of completion and target — no LLM call, no cost, no variance.

The LLM-graded family lives in _model.py. Two surfaces: model_graded_qa() for open-ended grading, model_graded_fact() for comparison against a reference. model_graded_fact is literally a one-line wrapper that calls model_graded_qa with a different template (_model.py:75-83). The actual implementation in both cases lands in _model_graded_qa_single() at _model.py:154-237. The call site that matters:

result = await model.generate([scoring_prompt])
match = re.search(grade_pattern or DEFAULT_GRADE_PATTERN, result.completion)

The default grade pattern is r"(?i)GRADE\s*:\s*([CPI])(.*)$" (_model.py:293). The judge is expected to write GRADE: C or GRADE: I (or GRADE: P if partial credit is enabled). partial_credit: bool = False by default — graders are binary unless you ask for the partial.

Be honest about the heritage. A comment at _model.py:240-241 says: “these templates are based on the openai closedqa templates” with a link to github.com/openai/evals/blob/main/evals/registry/modelgraded/closedqa.yaml. The grading approach in Inspect AI is older than most LLM-eval frameworks I’ve read — the OpenAI evals templates date from 2023. It works because the people writing it were not in a hurry to be original.

Six evaluation tasks. For each, pick the scorer family. The bottom row reports the cost of the current partition over 100 samples (rough $0.005 per LLM call).
  • Did the agent call get_weather with the right args?
    match(location="exact") — string compare on state.output.completion
  • Does the response contain a phone number?
    pattern(r"\d{3}[-.]\d{3}[-.]\d{4}") — regex against completion
  • Is the answer the multiple-choice letter the dataset expected?
    choice() — checks state.choices[i].correct
  • Does the SQuAD-style answer overlap the gold answer?
    f1() — token-set F1, returns continuous 0-1
  • Is the summary faithful to the source paragraph?
    model_graded_fact() — graded comparison against a reference
  • Was the response actually helpful to the user?
    model_graded_qa() — open-ended grading, GRADE: C/I protocol
graded tasks2 of 6
LLM calls / 100 samples200
~ cost @ $0.005$1.00

Two ways to compose scorers. The first is multi_scorer(scorers, reducer) (_multi.py:13-30): runs N scorers per sample and reduces them to one Score via a reducer (max, mean, mode, custom). The second is to pass a list directly to the Task constructor — Task(scorer=[exact(), model_graded_qa()]) — and each scorer contributes its own metric column to the report. No reducer needed.

What Inspect AI does not ship is per-sample dispatch. “For this row use exact, for that row use graded” is not a first-class declaration. The user writes a custom @scorer-decorated function that branches on state.metadata or target, or splits the dataset into two tasks. The framework gives you the primitives; it doesn’t make the partition decision for you.

The most telling detail is at _model.py:341-354,372-414. The [BEGIN DATA] ... [END DATA] markers that delimit the agent’s output inside the grading prompt are sanitized in any dataset-controlled field — question, answer, criterion, sanitized metadata. Anywhere those markers appear in user-controlled text, the spaces are replaced with dashes. A comment is explicit: “Literal space (not \\s) is intentional — \\s also matches U+00A0 (NBSP), which would let a model pre-neutralize its own output and bypass the mitigation.” The maintainers thought about prompt injection at the grader level. Few other OSS evaluators ship a comparable mitigation.

Be charitable about what this is. Inspect AI’s Scorer protocol is the cleanest deterministic-vs-graded partition of the OSS evaluators in this series — one interface, both families satisfying it, the user choosing per task. The maintainers’ own framing is direct (docs/scorers.qmd:55): a graded scorer is “used when model output is too complex to be assessed using a simple match() or pattern() scorer”. Graded is the fallback, not the default.

What you lose

You have to think. Most of the time the choice between match() and model_graded_qa() is obvious — exact answers want exact match, open-ended answers want a graded scorer. The hard cases are in the middle: a structured response the model might phrase differently, a function name that might be capitalised differently, a numeric answer with a unit the model might or might not include. The reader picks. The framework doesn’t help. That’s a feature when the picker knows what they’re doing, and a cost when they don’t.

The partition is at the task level, not the sample level — a Task in Inspect AI binds one scorer (or one fixed list of scorers) to an entire dataset. If 800 of your 1000 samples can use exact-match and 200 need graded scoring, you can’t just declare that. You write a custom scorer that switches on metadata, or you split the dataset into two tasks. Both are real fixes; neither is one line. Mixing scorer kinds at sample granularity is the missing primitive — where a more ambitious framework would automate the decision.

And the graded prompt is shared. There’s no per-task customisation of the grading template without writing your own scorer. For most open-ended questions the default template is fine — generic grading does generalise. Where the criterion is unusual or the rubric is specialised, you’re back to writing a custom @scorer that calls model.generate() yourself. That’s the same exit hatch every other framework gives you, and it does the job — but it stops being a one-liner.

When this approach fits

One question decides whether you should be using this strategy.

Are any of your evaluation questions answerable by a regex? Most evals have at least one. Tool calls have a correct answer. Structured outputs have a schema. Numeric responses have an expected value. Multiple-choice tasks have a letter. If you’re paying for an LLM call to check whether the JSON parses, you’re spending money to do what a parser does for free, and you’re adding the judge’s variance to a question that doesn’t have any.

Graded scoring earns its keep where the answer is genuinely subjective: was the response helpful, was the summary faithful, was the tone appropriate. These are questions where the answer isn’t in any string compare. The judge is the right tool. Reach for it explicitly, not by default.

What deterministic hybridisation doesn’t solve is the calibration problem on the graded scorers themselves. By splitting the work, you’ve localised the variance: everything deterministic is pristine; everything graded carries the noise the series has been documenting. The graded scorer is an LLM, and the LLM is exactly where the calibration question lives. Some tools give you statistical guarantees on that variance (Part 4’s ARES); most don’t. The final walkthrough reads the gap nothing fills.

Rule of thumb: every regex you can write is one fewer LLM call you have to debug. Use the judge where the judge is required, and not anywhere else.

LLM-as-judgeEvaluation methodology