Trust, but Trace

Evaluation methodology

Binary verdicts: how Ragas keeps judges honest

Ragas decomposes complex judgments into atomic yes/no questions. The trade-off: you trade information for variance reduction.

Part 1 named four axes of variance in LLM-as-judge. The simplest thing a tool can do about it is restrict what the judge can say. That strategy is scope constraint. This post is a walkthrough of how Ragas does it, in two of its metrics.

I’m building AgentAnvil, a contract-driven testing framework for LLM-based agents. Reading other evaluators is how I figure out what AgentAnvil’s contracts should look like. Ragas’s faithfulness pipeline is one of the cleanest examples of the pattern I’ve seen — and a year later they applied the same shape to agents.

The intuition

The variance you fight in LLM-as-judge is largest in the middle of the scale. Ask a judge for a 1-5 score and the boundary between a 3 and a 4 is where the noise lives. Two judges disagree there; the same judge disagrees with itself there. The ends are easier — a 1 and a 5 are usually both clearly wrong or clearly right, not a coin flip.

Eliminate the middle and the noise has nowhere to go.

That’s the move. Replace a continuous rubric with a binary verdict. “Is this answer faithful to the context?” becomes “Is statement K supported — yes or no?” The judge no longer ranges over a scale; it falls on one side of a line. Variance survives at the boundary, but the boundary is now an entailment decision, not a quality grade.

Aggregate the bits and you get a continuous-looking number back: the fraction supported. It’s built from binary atoms, not from a continuous judgement. With ten statements, a verdict flip moves the score by 0.1; with twenty, by 0.05. The judge’s noise averages instead of compounding.

A simulation: one answer, scored over and over by a noisy judge. The histogram is where the score lands across runs; the dashed line is its average. More statements tighten the spread. The solid line is the gate's pass mark — slide it into the spread and the gate turns flaky, pull it clear and the gate steadies.
Run-to-run spread of the faithfulness scoreA histogram of simulated faithfulness scores across many runs of the same judge on the same answer. More statements tighten the distribution into a narrow spike. A dashed line marks the score's average; a solid line marks the gate's pass mark. Bars left of the pass mark are runs the gate fails; bars right of it, runs it passes.pass mark00.51faithfulness score

Ragas’s faithfulness pipeline is built around this move.

Ragas faithfulness, step by step

Faithfulness in Ragas is two LLM calls and an arithmetic step. The first call decomposes; the second judges; the score is a fraction.

Ragas faithfulness pipelineA vertical data flow. The input answer is decomposed by an LLM into atomic statements. Each statement is then judged by another LLM as supported (1) or not (0). The score is the mean of the binary verdicts, in the range zero to one.answerStatementGeneratorPromptLLM call[s₁, s₂, …, sₙ]NLIStatementPromptLLM call[v₁, v₂, …, vₙ]∈ {0, 1}Σ vᵢ / narithmeticfaithfulness∈ [0, 1]
Ragas faithfulness: decompose, judge, aggregate.

The decomposition lives in StatementGeneratorPrompt:

class StatementGeneratorPrompt(
PydanticPrompt[StatementGeneratorInput, StatementGeneratorOutput]
):
instruction = (
"Given a question and an answer, analyze the complexity of each "
"sentence in the answer. Break down each sentence into one or more "
"fully understandable statements. Ensure that no pronouns are used "
"in any statement. Format the outputs in JSON."
)

The “no pronouns” instruction is the part that makes the rest of the pipeline work. Each statement has to stand alone, because the next call judges it without the rest of the answer in scope. Ragas’s own decomposition prompt carries a few-shot example: a two-sentence Einstein answer decomposes into four standalone claims, each judgeable in isolation. Pronouns leak the dependency; good decomposition removes the leak.

The verdict call is NLIStatementPrompt:

class NLIStatementPrompt(PydanticPrompt[NLIStatementInput, NLIStatementOutput]):
instruction = (
"Your task is to judge the faithfulness of a series of statements "
"based on a given context. For each statement you must return "
"verdict as 1 if the statement can be directly inferred based on "
"the context or 0 if the statement can not be directly inferred "
"based on the context."
)

One call, one verdict per statement, batched. The instruction explicitly bounds the output: 1 or 0, no middle.

Then the score:

def _compute_score(self, answers: NLIStatementOutput):
# check the verdicts and compute the score
faithful_statements = sum(
1 if answer.verdict else 0 for answer in answers.statements
)
num_statements = len(answers.statements)
if num_statements:
score = faithful_statements / num_statements
else:
logger.warning("No statements were generated from the answer.")
score = np.nan
return score

Fraction supported. A continuous-looking output produced by averaging binary judgements.

Three implementation details.

First, the judge is not scoring “faithfulness” — it is performing natural-language inference, a problem the literature already solved at the binary level. The framing is borrowed, not invented.

Second, every statement counts equally — no importance weighting, no length penalty, no confidence. That uniformity makes the variance bound cheap. It also makes the score gameable: pad an answer with three trivially-supported claims before the load-bearing one, and faithfulness rises. The score doesn’t know which atom carries the meaning.

Third, the type contract is verdict: int — any integer, with the prompt bounding the value to 0 or 1 and _compute_score falling back to truthy semantics. The constraint sits at the prompt layer; the agent metric tightens it at the type layer.

The same pattern in AgentGoalAccuracy

Within a year, Ragas brought the same pattern to agents.

AgentGoalAccuracy works on a multi-turn conversation trace. Two LLM calls; the verdict is the metric. The first infers what the run was for and how it ended:

class WorkflowOutput(BaseModel):
user_goal: str = Field(...)
end_state: str = Field(...)

The second compares the two and answers a binary question:

class CompareOutcomeOutput(BaseModel):
reason: str = Field(...)
verdict: t.Literal["0", "1"] = Field(...)

There it is: Literal["0", "1"]. The scope is in the type, not just the prompt. Pydantic refuses anything that isn’t "0" or "1". An LLM that wants to return "1.0" or "yes" or a JSON object gets rejected at the parser. The constraint is enforced before the score is ever computed.

This is a tightening over faithfulness, where verdict: int accepts any integer. Two variants exist: WithReference compares to a supplied target; WithoutReference compares to the goal the same model just inferred. The mechanism — infer, then ask one binary question — is identical, and transfers because the underlying judgement transfers: “Did the run achieve the user’s goal?” is yes-or-no in most reasonable framings.

One detail for anyone wiring this into CI: the trace never reaches the judge raw. The metric calls sample.pretty_repr() to flatten the multi-turn conversation into a single string before either prompt sees it. Two pipelines using different formatters could produce different verdicts from the same trace. The formatter is part of the metric, even though it never appears in the metric’s name.

What you lose

Binary verdicts are not a free win.

A binary verdict can’t tell you that one answer is partially right and another is excellent — both end at “supported” or “not”. For pass/fail gates, that’s the right granularity. For developer feedback — “how do I make this better?” — binary verdicts are flat. A score of 0.6 doesn’t say which three of five statements were supported, nor whether the unsupported two were minor stretches or full hallucinations. The shape of the failure is gone.

A constructed example: a wrong answer slips past the gate. Its one real claim — the red one — is made up; the context below doesn't support it. The rest is padding. Push the score up with padding, then raise the gate's pass mark to fight back — and watch how little it helps.
context TechCorp published its Q3 financial report on October 14. The 12-page report lists revenue, operating margin, and headcount for the quarter — but it makes no year-over-year comparisons.
Q3 revenue grew 30% year over year. made up
The report covers the third quarter.
TechCorp published the report.
The report came out on October 14.
supported by the context not in the context
0.80 gate
3 of 4 statements supported 0.75 gate · fail
padding to clear this gate 4
It fails for now — and it should: its one real claim is made up.

Time to be charitable about the design. Ragas didn’t pick the binary form because they didn’t think about granularity. They picked it because the use case Ragas grew up around — RAG output evaluation — has the right shape for it. Faithfulness in RAG is entailment: either the claim is in the retrieved context or it isn’t. Agent goal achievement, the metric they built next, is similarly entailment-shaped. The shape of the question fits the shape of the verdict.

The question isn’t “is binary good?” but “is the judgement binary-shaped?”

When this approach fits

Three conditions tell you binary verdicts are the right approach.

The decision is pass/fail, not gradient. A CI gate is pass/fail in spirit even when the score is a 0-1 float. A user-facing metric that needs to distinguish “good”, “great”, and “excellent” is not.

The intermediate judgements decompose into independent yes/no checks. Faithfulness does this naturally because each statement is independent in context. Judgements where the parts only mean something together won’t survive the pipeline.

Middle-zone variance is more expensive than information loss. A flaky CI gate that fails 1 in 20 runs from judge wobble is a worse outcome than a gate that gives no diagnostic detail. Pick the side of the trade-off where the cost is lower.

On thresholds: pick the cutoff and the typical N together. A faithfulness threshold of 0.8 with three statements means “two of three” — already binary at the metric level. With twenty, it means “sixteen of twenty” — a closer-to-continuous gate. The threshold is a number with no operational meaning until you decide how many atoms you expect.

When all three conditions hold, binary verdicts are the cleanest tool in the box.

LLM-as-judgeEvaluation methodology