From rubric to graph: how DeepEval splits a judgment

Part 2 read a tool that bounds a judge’s variance by shrinking what it can say. Ragas swaps a 1-5 score for a yes/no verdict — the judge falls on one side of a line instead of ranging over a scale.

This is that tool. I’m building AgentAnvil, a contract-driven testing framework for LLM-based agents, and reading other evaluators is how I work out what those contracts should constrain. DeepEval bounds variance without touching the judge’s output at all. It changes the question. Instead of asking one hard question once, it splits the judgment into a sequence — or a graph — of smaller questions, each easier than the whole. That strategy is structural decomposition.

DeepEval ships it in two shapes. G-Eval lays the judgment out as an ordered checklist. The DAG lays it out as a branching graph the judge is made to walk. This post reads both, from the source.

The intuition

“Is this summary good?” is one question with at least four answers folded into it. Is it faithful to the source? Does it cover the main points? Is it concise? Does it read coherently? A judge handed the whole question has to weigh all four and collapse them into a single number. Every sub-decision happens inside one forward pass — invisible, unlogged — and the weighting is whatever the model settled on that run.

Part 2’s tool shrank the judge’s output: fewer things it could say, less room to wobble. Decomposition leaves the output alone and shrinks the question. Ask the four sub-questions separately. “Is every sentence supported by the source — yes or no?” is a narrower question than “is this good?”, and a narrower question carries less variance. The judge holds less at once, and has fewer ways to disagree with itself.

Then the combination rule moves. In a monolithic judge that rule lives in the model’s weights — unwritten, and different every run. Decompose, and the rule moves into code you wrote: an average, a threshold, a branch. Code has no variance. You haven’t made the judge more reliable — each sub-call is still an LLM call — but you’ve taken the bundling out of the model. You can no longer be surprised by how the parts were weighed, because you weighed them.

The two shapes differ on one thing: one still asks the judge for a score and just briefs it better; the other never asks for a score at all.

G-Eval, step by step

G-Eval is the gentler shape. DeepEval’s GEval metric takes one of two things — a plain-language criteria string, or an explicit evaluation_steps list:

class GEval(BaseMetric):
    def __init__(
        self,
        name: str,
        evaluation_params: List[SingleTurnParams],
        criteria: Optional[str] = None,
        evaluation_steps: Optional[List[str]] = None,
        ...
    ):
        validate_criteria_and_evaluation_steps(criteria, evaluation_steps)

The validator on the first line of the body insists on exactly one of them. Give it criteria, and before the judge scores a test case, the metric turns that one sentence into steps:

def _generate_evaluation_steps(self, multimodal: bool) -> List[str]:
    if self.evaluation_steps:
        return self.evaluation_steps

    g_eval_params_str = construct_g_eval_params_string(self.evaluation_params)
    prompt = self.evaluation_template.generate_evaluation_steps(
        criteria=self.criteria,
        parameters=g_eval_params_str,
        multimodal=multimodal,
    )
    return generate_with_schema_and_extract(
        metric=self, prompt=prompt, schema_cls=gschema.Steps, ...
    )

The prompt is a single instruction: “generate 3-4 concise evaluation steps based on the criteria”. The G-Eval paper — Liu et al. (2023) — calls these steps the chain of thought. DeepEval generates the chain once, from your criterion, and reuses it on every test case. A vague instruction becomes an explicit, numbered checklist. That checklist is the decomposition.

Then the scoring call. The judge sees the numbered steps and the test case and returns a JSON object — a score and a reason. The score is an integer in a range, 0 to 10 by default; DeepEval normalises it to a 0-1 metric.

An integer score is exactly the clustered, low-resolution output Part 1 flagged: ask a model for an integer and its whole distribution collapses onto one. DeepEval does not take the integer. It reads the token-level log-probabilities of the score token — every integer the model considered, and how much probability it put on each — and returns the probability-weighted average:

# inside calculate_weighted_summed_score(...)
min_logprob = math.log(0.01)  # filter out tokens with <1% linear probability
for token_logprob in score_logprobs.top_logprobs:
    logprob = token_logprob.logprob
    if logprob < min_logprob:
        continue
    if not token_logprob.token.isdecimal():
        continue
    linear_prob = math.exp(logprob)
    token_score = int(token_logprob.token)
    # ... accumulate token_score weighted by linear_prob ...
weighted_summed_score = sum_of_weighted_scores / sum_linear_probability

Tokens under 1% probability are dropped; what survives is summed by weight. The judge says “4”; the weighted score lands near 3.8, because the model also held real probability on 3 and 5. A continuous score, recovered from a clustered one, with no second call.

An LLM judge rarely lands on one clean integer — it spreads probability across score tokens. Drag the judge's opinion. The blue bar is argmax, the top integer a raw score reports; it jumps. The amber line is DeepEval's logprob-weighted score; it slides.

the judge's opinion · 4.00

argmax score4the top integer — a raw score

weighted score4.00logprob-weighted — what GEval records

GEval scores on a 0–10 integer scale; this shows five of those integers. Bars under 1% are dropped, as in the source.

Be fair about what this is. The decomposition here is advisory. The steps brief the judge; they don’t bind it — nothing stops the model from skimming step three. What G-Eval buys for that looseness is reach. It works on any criterion you can write in a sentence, and the auto-generated checklist means you don’t have to know the sub-questions in advance. The judge still makes the whole judgment — it just makes it better briefed.

The DAG paradigm

The DAG makes the decomposition binding.

Where G-Eval briefs the judge and still asks it for a score, DAGMetric never asks for a score. You hand it a graph — a DeepAcyclicGraph — and the judge only ever answers one node at a time. There are four node types.

A TaskNode is an extraction step. It runs an LLM call to pull out intermediate material — “list the claims in the summary” — and passes the result downstream. The node does no scoring itself.

A BinaryJudgementNode asks one yes/no question. It must have exactly two children, one for each answer:

def __post_init__(self):
    if len(self.children) != 2:
        raise ValueError("BinaryJudgementNode must have exactly 2 children.")
    ...
    verdicts = [child.verdict for child in self.children]
    if verdicts.count(True) != 1 or verdicts.count(False) != 1:
        raise ValueError("BinaryJudgementNode must have one True and one False VerdictNode child.")

A NonBinaryJudgementNode asks one multiple-choice question. DeepEval builds the verdict schema at runtime, so the model cannot answer off-list:

self._verdict_schema = create_model(
    "NonBinaryJudgementVerdict",
    verdict=(Literal[tuple(self._verdict_options)], ...),
    reason=(str, ...),
)

And a VerdictNode is a branch outcome. It carries either a score or a child — never both:

def __post_init__(self):
    if self.score is not None and self.child is not None:
        raise ValueError(
            "A VerdictNode can have either a 'score' or a 'child', but not both."
        )
    if self.score is None and self.child is None:
        raise ValueError(
            "A VerdictNode must have either a 'score' or a 'child'."
        )

A score makes the node a leaf. A child continues the traversal.

Running the metric is a graph walk. Each judgement node makes its one call, gets a verdict, and the verdict picks which child runs — every other child returns early:

# VerdictNode._execute — a branch runs only if its verdict
# matches the parent judgement's verdict
if self._parent._verdict.verdict != self.verdict:
    return

You land on a leaf, and the leaf’s score, divided by ten, is the metric.

A DeepEval DAGMetric for summary quality: three binary judgements, four score leaves. Answer each question — the path lights up, dead branches drop away, and the leaf you reach sets the score.

Judgement node 1 of 3

Is every claim in the summary supported by the source?

That is the difference from G-Eval. G-Eval’s steps all feed one scoring call; the judge still makes the judgment, just well briefed. The DAG never lets the judge make the judgment. It only ever asks for one binary or n-ary verdict per node, and the graph — code you wrote — decides the score. Same verdicts, same path, same number, every run. DeepEval rejects a graph with a cycle before running anything, with a flat “Cycle detected in DAG graph.”

And the two shapes compose. A VerdictNode’s child can be a GEval. Where a branch needs a graded score rather than another fork, you hang a G-Eval off the leaf — the graph routes, and G-Eval scores the last mile. They are not rivals; one can contain the other.

What you lose

Decomposition is not free — it costs two things.

The first is surface area. A monolithic judge makes one call you can be wrong about. A DAG with a task node, two judgement nodes and four leaves makes three or four calls, each its own chance to be wrong. Those calls also run in sequence down the path, so in a CI gate that DAG costs three to four times the latency and API spend of one holistic judge. You haven’t removed the judge’s unreliability — you’ve spread it across more nodes. The bet is that each smaller call is reliable enough that the product still beats the monolith. That bet can lose. The worst case is an interior node: a wrong verdict there doesn’t nudge the score — it sends the traversal down an entirely wrong branch. The failure isn’t noisy — it’s structural, and the same every run.

The second is that you now have to be right about the structure. The graph is your theory of the judgment — which sub-questions matter, in what order, and how they combine. If the theory is wrong, the metric is wrong deterministically, on every input. G-Eval has a softer version of the same exposure: the auto-generated steps are an LLM output too, and most people never read them.

Be charitable about this. It is the honest cost of making a judgment legible. A monolithic judge hides its reasoning, so you cannot see where it went wrong. A decomposed one shows you exactly which node to distrust. Both have bugs. Only one of them has a stack trace.

When this approach fits

Three conditions tell you decomposition will pay.

The judgment genuinely has parts. “Faithful and concise and complete” comes apart cleanly into three checks. “Is this explanation elegant” does not — forcing a graph onto a judgment with no seams just adds calls and a false sense of rigour. Decompose where there are real joints.

You know the sub-questions better than the model would guess. If your decomposition is only the model’s own implicit steps written down, G-Eval’s auto-generated checklist already does that, for free. The DAG earns its complexity when you know something the model wouldn’t infer — an order, a dependency, a short-circuit. A failed faithfulness check should stop the traversal before you ever spend a call scoring style.

You need the result auditable or stable. A DAG gives a fixed path you can point at in review. When a gate’s verdict has to be explained to someone who did not write it, the traversal is the explanation. A holistic score is not.

The rule of thumb: reach for G-Eval when you want a better holistic score cheaply, and the DAG when the judgment has real structure you need enforced rather than suggested. Remember the two slide into each other — a G-Eval at a DAG leaf is a legitimate middle, not a compromise. The choice is not really which tool. It is how much of the judgment you can afford to leave inside the model.

Part 3 of seven. Structural decomposition: splitting one hard judgment into a graph of smaller ones, and handing the arithmetic to code. The next walkthrough reads a tool that leaves the judge’s reasoning alone and asks instead how far off its score could be. If you arrived here directly, Part 1 names the variance and Part 2 bounds it with binary verdicts.

Part of the same series; related to a paper I’m presenting on judge variance in OSS evaluators at DITTET 2026 (July).

Written with AI assistance — see AI policy.

LLM-as-judgeEvaluation methodology