Judge Criteria

Judge criteria let you use an LLM or agent CLI to evaluate work, configured via TOML files. This makes it trivial to reuse and share rubrics between tasks.

LLM judge

tests/quality.toml

[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py", "/app/utils.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0

[[criterion]]
description = "Rate the test coverage on a scale from 0 to 100"
type = "numeric"
min = 0
max = 100

The judge field accepts any LiteLLM model string. It can be overridden at invocation time without editing the rubric. See Provider routing.

Agent judge

Agent judges shell out to a CLI like Claude Code or Codex. Unlike LLM judges, they can explore the filesystem and run commands:

tests/review.toml

[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"

Agent judges are slower and more expensive but can interact with the workspace directly.

Individual mode

Set mode = "individual" to make one LLM call per criterion instead of batching them. Each criterion can scope its own files:

tests/quality.toml

[judge]
judge = "anthropic/claude-sonnet-4-6"
mode = "individual"

[[criterion]]
description = "Is the analysis correct?"
files = ["/app/analysis.pdf"]

[[criterion]]
description = "Is the spreadsheet well-structured?"
files = ["/app/data.xlsx"]

Criteria without files fall back to [judge].files.

Controls how the judge's criteria are aggregated into a single score. This only affects criteria within this TOML file — it does not change how programmatic and judge scores are combined across the directory.

[scoring]
aggregation = "all_pass"  # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7           # only used with "threshold" aggregation

Score normalization

Binary: yes/true/1 → 1.0, anything else → 0.0
Likert: normalized to [0, 1] as (raw - 1) / (points - 1)
Numeric: normalized to [0, 1] as (raw - min) / (max - min)

Trajectory evaluation

To evaluate the agent's process rather than just its output, point the judge at the trajectory file:

[judge]
judge = "anthropic/claude-sonnet-4-6"
atif-trajectory = "/logs/trajectory.json"
files = ["/app/main.py"]

[[criterion]]
description = "Did the agent take an efficient approach?"
type = "likert"
points = 5

The trajectory content is truncated proportionally to fit within the model's context window while preserving all steps.

Custom prompt templates

You can provide your own prompt instead of the built-in one:

[judge]
judge = "anthropic/claude-sonnet-4-6"
prompt_template = "my_prompt.md"

The template must contain a {criteria} placeholder where criterion descriptions get injected.

Judge Criteria

LLM judge

Agent judge

Individual mode

Configuration reference

`[judge]` section

`[[criterion]]` entries

`[scoring]` section

Score normalization

Trajectory evaluation

Custom prompt templates

On this page