When judges show their uncertainty: DeepEval and ARES

Part 3 read a tool that splits a judgment into smaller ones — a graph of binary questions whose path picks the score. The next strategy stops splitting and asks something else. It runs the judgment once and tries to measure how sure the answer is. That strategy is probabilistic calibration.

I’m building AgentAnvil, a contract-driven testing framework for LLM-based agents, and reading other evaluators is how I work out what those contracts should constrain. Probabilistic calibration is the strategy I had the hardest time forming an opinion on, because two tools take genuinely different routes to it. DeepEval reads uncertainty inside a single judge call, off the model’s own token probabilities. ARES wraps a formal confidence interval around the score the judge produces over an evaluation set. One returns a smoother number; the other returns an interval you can defend.

This post reads both, from the source.

The intuition

Uncertainty in an LLM judge lives in two places, and the two tools target one each.

The first is inside a single call. The judge never commits to one integer the way the JSON response makes it look. It commits to a distribution over the integers it considered — most of the probability on “4”, some on “3” and “5”, a sliver on the rest. The judge’s confidence is the spread of that distribution. A judge that puts 99% of its mass on “4” is sure. A judge that splits 35/30/25 across “3/4/5” is hedging. The raw integer hides both behind the same number.

The second is across calls. Run the judge on a thousand inputs and average the scores; you have a benchmark number. Run it on a smaller sample with human gold labels; you have a measurement of its bias. Combine those and you can produce an interval, not a point — the kind of confidence interval a statistician would write down. The judge stays exactly as noisy as it was. You just bound how wrong the aggregate could be.

Part 3 already met the DeepEval function that consumes the first kind of uncertainty — calculate_weighted_summed_score. There, the point was that probability weighting recovers a continuous number from an integer scale. Here it does something else: the spread of the distribution is the judge’s confidence, even if the score field never returns it. ARES targets the second kind, from a different angle entirely, with statistical machinery from a 2023 paper on prediction-powered inference. Two routes; the same problem; very different code.

DeepEval: a distribution behind every score

GEval makes one call to score a test case and asks for the score token’s log-probabilities at the same time. The provider call is OpenAI-only — Azure-, LiteLLM-, OpenRouter-wrapped paths each duplicate it — and the parameters are explicit:

completion = client.chat.completions.create(
    model=self.name,
    messages=messages,
    temperature=self.temperature,
    logprobs=True,
    top_logprobs=top_logprobs,
    **self.generation_kwargs,
)

top_logprobs is a GEval constructor parameter; the default is 20, which is the OpenAI API’s hard limit. The response carries the top-20 alternatives the model held probability on for each generated token. The provider returns the distribution; GEval consumes it.

Then the noise floor. Part 3 already showed the loop body of calculate_weighted_summed_score — the one that filters tokens below 1% linear probability and sums the rest by weight. The 1% threshold is a hard-coded local:

# Filter out tokens with <1% linear probability, i.e., logprobs < math.log(0.01)
min_logprob = math.log(0.01)

Not configurable, not exposed, not a constant — a magic number embedded in the function body. The author of GEval picked the line where the judge’s noise stops counting.

The judge's distribution over integer scores. Drag the sharpness. The weighted score stays at 4.00 — the distribution stays symmetric. The judge's confidence does not; GEval.score records neither.

judge's sharpness · 0.60

weighted score4.00what GEval records

judge's confidencemediumwhat GEval drops

Drag the sharpness. The weighted score stays at 4.00 — the distribution is symmetric, and probability weighting respects symmetry. The chip beside it moves between “high” and “low” confidence as the spread changes. That chip is what GEval doesn’t return.

The collapse happens at the metric’s edge. The full distribution flows through calculate_weighted_summed_score and lands as a single normalised scalar:

self.score = (
    (float(g_score) - self.score_range[0])
    / self.score_range_span
    if not self.strict_mode
    else int(g_score)
)
self.success = self.score >= self.threshold

After that, the only fields the caller can read are score, success, reason, verbose_logs, and evaluation_cost. No per-token probabilities, no count of dropped tokens, no entropy, no second moment. The weighted mean is the verdict; pass/fail compares it to the threshold. The judge’s distribution informed the score and then disappeared.

Be fair about what this is. GEval implements the Liu et al. (2023) recipe faithfully — probability weighting is the move that recovers a continuous score from an integer rubric, and DeepEval ships it cleanly with a documented filter and a reasonable default. The mechanism is right. The omission is at the surface to the caller. The information exists for one function call; nothing in the API hands it back.

ARES: a confidence interval, not a score

ARES asks a different question. Not “how sure is the judge about this one input?” but “how sure can I be about the aggregate score, given that a small slice of it has gold labels?” The judge in ARES is a fine-tuned DeBERTa-v3-large classifier by default; an LLM judge can be swapped in. Either way, what the judge returns is a per-example 0/1 prediction. The interesting code is what ARES does with those predictions.

The setup is three arrays. Y_labeled — the human gold labels on a small annotated subset, typically a few hundred examples. Yhat_labeled — the judge’s predictions on those same examples. Yhat_unlabeled — the judge’s predictions on the large unlabeled evaluation set. Then one function does the statistics:

def pp_mean_iid_asymptotic(Y_labeled, Yhat_labeled, Yhat_unlabeled, alpha):
    n = Y_labeled.shape[0]
    N = Yhat_unlabeled.shape[0]
    tildethetaf = Yhat_unlabeled.mean()
    rechat = (Yhat_labeled - Y_labeled).mean()
    thetahatPP = tildethetaf - rechat
    sigmaftilde = np.std(Yhat_unlabeled)
    sigmarec = np.std(Yhat_labeled - Y_labeled)
    hw = norm.ppf(1 - alpha / 2) * np.sqrt(
        sigmaftilde ** 2 / N + sigmarec ** 2 / n
    )
    return [thetahatPP - hw, thetahatPP + hw]

That’s the whole estimator. Three pieces.

The first is the point estimate. tildethetaf is the judge’s mean on the unlabeled set — a biased number if the judge is biased. rechat is the judge’s mean error on the labeled subset, mean(Yhat - Y). Subtracting one from the other gives thetahatPP: the judge’s aggregate score with its own measured bias removed. The labeled subset is what makes the correction possible — without gold labels, no rechat, no rectification.

The second is the interval. hw is the half-width: sigma_f² / N from the judge across the large unlabeled set, plus sigma_rec² / n from the rectifier across the small labeled one. The unlabeled set is usually orders of magnitude larger, so its floor is small and the rectifier term dominates. With the default alpha=0.05 the Z multiplier is ~1.96 — the familiar 95% interval.

The PPI half-width has two variance terms — one shrinks with the human-labeled set n, one with the unlabeled set N. Drag both. The interval narrows; the budget shifts.

n · human-labeled set · 300 N · unlabeled set · 2,000

half-width±0.041

95% CI[0.569, 0.651]

variance budget

σ²_rec/n · rectifier σ²_f/N · judge floor

What ARES reports is the pair [thetahatPP - hw, thetahatPP + hw] — a span, not a point. The README’s worked example shows [0.547, 0.664]: the downstream consumer reads that as “the true aggregate score is in this range with 95% confidence”, not “the judge gave 0.61”.

Two quirks worth flagging. The default human set is capped at 300 (.head(300)) — a hard floor on the rectifier term. And the classical-interval comparison the code looks like it computes is dead: binomial_iid is called but never imported, so only the PPI interval survives.

Be charitable about the choice. The statistics are the right primitive for the question ARES asks. PPI is recent — Angelopoulos et al. (2023) — and most evaluation tools haven’t caught up to the idea that “we have a small gold set and a large unlabeled set” is a setting with a published estimator. ARES picked it up and wrote it down. That alone makes it worth reading.

What you lose

Both routes pay, and the cost shapes are different. DeepEval pays at the interface: the distribution exists inside one call, the function uses it, then it stops existing for the caller — a CI gate firing on self.score >= self.threshold can’t ask “was this a confident pass or a hedged one?” because the field that would answer isn’t there. ARES pays in human labels: 300 annotated examples per evaluation. For a paper or a regulator-facing benchmark that’s a one-time cost; for a CI gate that runs on every PR it’s prohibitive. Each tool’s cost shape matches its intended use, not an oversight in either.

And both miss the same thing. Probability weighting and PPI both treat the prompt as fixed. Neither bounds the prompt-sensitivity axis Part 1 flagged. Reword the rubric and the distribution moves, but the function still reports the inside-this-prompt confidence as if nothing changed. Probabilistic calibration is calibration given the prompt. Calibrating over prompts is a different problem and a different paper.

When this approach fits

One question decides which route, or whether either.

Are you producing a score, or a measurement? A score is what a CI gate reads — one number, today, on this PR. A measurement is what a paper, leaderboard, or audit report carries — a claim about the aggregate that has to survive scrutiny. DeepEval’s probability weighting is a better score primitive: cheap, single-call, no annotation cost, returns a continuous number that respects the judge’s hedging. ARES’s PPI is a measurement primitive: it returns an interval whose width you can argue about, whose assumptions you can audit, but the rectifier term needs ~300 ground-truth labels you have to budget once. If you can’t pay that, PPI isn’t on the table.

What neither buys you is bounded variance across prompt rewordings. If the rubric edit is what changed between deploys, both tools will quietly accommodate it and report new numbers that aren’t comparable to the old ones. The next walkthrough reads a tool that goes after that — not by measuring uncertainty, but by telling the judge in advance what parts of the judgment to leave out.

Rule of thumb: probability weighting is decoration on a single judgment; PPI is statistical inference on a population of judgments. Pick the one whose level of analysis matches the question you’re being asked to answer.

Part 4 of seven. Probabilistic calibration: two routes to telling you how sure the judge is — token probabilities inside one call, and a confidence interval around an aggregate. The next walkthrough reads a tool that doesn’t try to measure the judge’s uncertainty at all; it tells the judge in advance which parts of the judgment to leave out. If you arrived here directly, Part 1 names the variance, Part 2 bounds it with binary verdicts, and Part 3 splits one hard judgment into a graph of smaller ones.

Part of the same series; related to a paper I’m presenting on judge variance in OSS evaluators at DITTET 2026 (July).

Written with AI assistance — see AI policy.

LLM-as-judgeEvaluation methodology