The negative rubric: telling a judge what NOT to evaluate

Part 4 read two tools that tried to measure how sure the judge was — one inside a single call, one across many. The next strategy stops asking how sure. It asks what the judge should be answering in the first place — and tells it, explicitly, what to leave out.

I’m building AgentAnvil, a contract-driven testing framework for LLM-based agents, and reading other evaluators is how I work out what those contracts should constrain. Rubric anchoring is the strategy I had the easiest time recognising and the hardest time pinning down: most LLM-judge tools state the criterion the judge should score, and stop there. One tool — TruLens — adds a second clause: what falls outside the scope of judgment. Most prompts don’t have it. When you find it, it’s load-bearing.

This post reads it from the source.

The intuition

A rubric is a contract with the judge. Score this thing on this scale. The wording is what the judge reads. Whatever the contract doesn’t say, the judge fills in from what it has seen — its training distribution, its default sense of what a good answer looks like, the implicit overlap between this metric and every other rubric it has ever scored. Those defaults are silent. They show up in the score; they don’t show up in the prompt.

The negative-rubric pattern reverses the default. Instead of writing the criterion and trusting the judge to interpret it cleanly, you write the criterion and enumerate what the judge should ignore. Plan quality is what you’re scoring; don’t reference whether the plan succeeded. When tool calling is what you’re scoring, tool selection belongs to a sibling metric — leave it alone. Execution efficiency cares about sequencing, not whether the agent reached the goal. The space the judge is allowed to weigh shrinks deliberately.

This sounds defensive, and it is. The negative-rubric clause is there because the author of the rubric has seen the judge double-count, or score the wrong thing, or contaminate one dimension with another. It’s the seam where the author’s prior work — what the model gets wrong, repeatedly — gets baked into the prompt.

Part 4 read tools that measured the judge’s uncertainty after the fact. This is a different lever entirely. The judge doesn’t get more sure; the judge gets a tighter target.

TruLens, step by step

Every built-in TruLens feedback function lands in one of two methods on LLMProvider (feedback/llm_provider.py:180,283): generate_score or generate_score_and_reasons. Both assemble a [system, user] message pair and dispatch to _create_chat_completion. The OpenAI provider’s implementation is short:

return self.endpoint.client.chat.completions.create(
    messages=input_messages, **kwargs
)

kwargs carries a quiet detail. SEED: int = 123 is hardcoded at the top of the OpenAI provider module and injected by default into every call unless the caller passes its own seed (providers/openai/provider.py:23,442-443). TruLens trades sampling diversity for run-to-run determinism — a defensible call for an evaluation harness, surprising for a reader who came in expecting temperature-driven randomness.

The system prompt is template-driven. Each domain has its own file: templates/rag.py for RAG metrics, templates/quality.py for general quality, templates/safety.py for safety, templates/agent.py for agent-specific metrics.

Grep across all template files for “do not / don’t / only / regardless / out-of-scope” — thirty-six matches. Fifteen of them are in templates/agent.py, and those are the structured ones: In-scope:/Out-of-scope: blocks, sibling-metric deferrals, “ignore execution”. The rest are mostly boilerplate (“respond only as a number”) in the other files. Plan, tool, and execution metrics carry the pattern. Faithfulness, harmfulness, correctness, sentiment — almost none do.

The clearest case is the system prompt for PlanQuality (templates/agent.py:212-225). After a positive scoring criterion it inserts a block whose first line is CRITICAL: and whose body forbids the judge from inferring plan defects from agent behaviour, even when the trace shows the agent failed. The same template later instructs the judge: “You are judging the strategy, not the outcome.” The negative clause appears twice in one prompt — and the prompt is short.

The most structured example is in the ToolCalling system prompt (templates/agent.py:361-363), where TruLens introduces a named section:

Important scope boundaries:
- In-scope: argument/schema correctness, semantic fit of query,
  preconditions/postconditions, grounded interpretation of outputs,
  explicit handling of tool-returned errors.
- Out-of-scope: tool selection (Tool Selection), workflow efficiency
  (Execution Efficiency), external service/tool reliability (Tool Quality).

Notice the parenthetical references. The out-of-scope clauses name the sibling metrics the judge should defer to. ToolSelection (agent.py:305-307) and ToolQuality (agent.py:416-418) carry their own Important scope boundaries: blocks — three of the seven agent metrics use the exact same structure, sister sections that reference each other. The judgment space across the seven agent metrics is partitioned by hand — ToolCalling, ToolSelection, ToolQuality, ExecutionEfficiency, PlanQuality, PlanAdherence, LogicalConsistency — and each rubric tells the judge what to leave to the others.

The CoT layer is grafted on top of that, the same way for every metric. COT_REASONS_TEMPLATE lives in templates/base.py:198-206 and is a fixed scaffold — “Criteria: …, Supporting Evidence: …, Score: …”. Every *_with_cot_reasons method substitutes that block into the user prompt’s trailing label — user_prompt.replace("<METRIC> SCORE:", COT_REASONS_TEMPLATE), with the exact key varying per metric (TOOL CALLING SCORE:, LOGICAL CONSISTENCY SCORE:, etc.). The system prompt — where the negative clauses live — stays untouched.

After all that prompting, the score itself is small. The default output space is a Likert 0-3 (templates/base.py:179). The model returns a structured JSON; on parse failure the code falls back to a regex that catches numeric tokens (generated.py:35,97). If the response contains multiple numbers — “5 out of 10”, “between 2 and 4” — it picks the minimum. That’s a defensive default that biases low. It rarely matters until it does. The accepted number is then normalised by (raw - min) / (max - min) to a 0-1 float, and that float is what the user sees.

Be honest about what this is. The negative-rubric pattern in TruLens isn’t a general design principle running through every prompt — it shows up where it was needed and not elsewhere. It’s an agent-evaluation invention, added when TruLens 2.x grew seven agent metrics that needed to disambiguate themselves. Read the older rubrics — RAG, quality, safety — and the negative clauses thin out or disappear.

What you lose

Both halves of the prompt grow. The negative clauses make the rubric longer; the longer rubric makes every score call cost more tokens. For a CI gate that runs hundreds of feedback calls per evaluation, the cost is real — though the same is true of any structured rubric.

A TruLens agent rubric, built up one layer at a time. Each toggle grafts a real clause from templates/agent.py. Watch the prompt grow.

CoT scaffolding base.py:198 In-scope / Out-of-scope block agent.py:361 CRITICAL failure warning agent.py:212 "Strategy, not outcome" clause agent.py:224

approx. tokens52

growth vs base1.00×

base 5×

Show assembled prompt

The harder cost is invisible. A negative clause that’s wrong tells the judge to ignore the very signal it should weigh. The author’s prior, encoded in the prompt, is a load-bearing assumption. If the prior is wrong, the metric is wrong on every test case the same way — quietly, and without a parser to catch it.

The pattern also requires you to have something to disambiguate. TruLens needed it because seven agent metrics overlap. A single-purpose judge — one rubric, one score — has nothing to push against. The negative clause becomes decorative (“don’t penalise verbosity”), then unused, then deleted in the next revision. And rubric anchoring narrows the judge’s question; it doesn’t narrow the judge’s output. The judge still returns a 0-3 Likert that gets normalised to a 0-1 float. Whatever variance lives in producing that integer is still there. You’ve changed what gets scored, not how stably.

When this approach fits

One question decides whether this strategy applies to your setup.

Do you have more than one metric that could plausibly weigh the same evidence? If yes, rubric anchoring is the cheapest move you can make to keep them from double-counting. Plan quality and execution efficiency both have a view of whether the agent succeeded; telling each prompt to ignore the other’s territory keeps them measuring different things. If no — if your judge has one rubric and one verdict — the negative clauses don’t add anything you can verify, and the prompt grows without payoff.

Agent evaluation is the natural home — once you’re scoring plan, execution, and tool dimensions independently, overlapping criteria are unavoidable. RAG might benefit at the margin; quality and safety usually don’t.

What rubric anchoring doesn’t do is tell you whether the rubric is the right one. The negative clauses encode the author’s prior about where the judge fails. That prior is a guess until it’s measured. The next walkthrough reads a tool that goes after the same problem from the other direction — it asks where you can skip the judge entirely and use deterministic checks instead.

Rule of thumb: rubric anchoring is what you write when you’ve watched the judge confuse itself. If you haven’t yet, you don’t need it. If you have, you’ll know what to ignore.

Part 5 of seven. Rubric anchoring: tell the judge what to score, then tell it what to ignore. The negative-rubric clauses TruLens uses for agent metrics are the seam where the rubric author’s prior work — what the judge gets wrong, repeatedly — gets baked into the prompt. The next walkthrough reads a tool that goes further and asks where you can skip the judge entirely. If you arrived here directly, Part 1 names the variance, Part 2 bounds it with binary verdicts, Part 3 splits one hard judgment into a graph of smaller ones, and Part 4 measures how sure the answer was.

Part of the same series; related to a paper I’m presenting on judge variance in OSS evaluators at DITTET 2026 (July).

Written with AI assistance — see AI policy.

LLM-as-judgeEvaluation methodology

The negative rubric: telling a judge what NOT to evaluate

The intuition

TruLens, step by step

What you lose

When this approach fits

When judges show their uncertainty: DeepEval and ARES

Deterministic where you can, judge where you must