Judge Criteria
Evaluating agent outputs with LLMs or agent CLIs via TOML configuration
Judge criteria let you use an LLM or agent CLI to evaluate work, configured via TOML files. This makes it trivial to reuse and share rubrics between tasks.
LLM judge
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py", "/app/utils.py"]
[[criterion]]
description = "Is the code correct?"
type = "binary"
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0
[[criterion]]
description = "Rate the test coverage on a scale from 0 to 100"
type = "numeric"
min = 0
max = 100The judge field accepts any LiteLLM model string.
Agent judge
Agent judges shell out to a CLI like Claude Code or Codex. Unlike LLM judges, they can explore the filesystem and run commands:
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true
[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"Agent judges are slower and more expensive but can interact with the workspace directly.
Configuration reference
[judge] section
Prop
Type
[[criterion]] entries
Prop
Type
[scoring] section
Controls how the judge's criteria are aggregated into a single score. This only affects criteria within this TOML file — it does not change how programmatic and judge scores are combined across the directory.
[scoring]
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7 # only used with "threshold" aggregationScore normalization
- Binary: yes/true/1 → 1.0, anything else → 0.0
- Likert: normalized to [0, 1] as
(raw - 1) / (points - 1) - Numeric: normalized to [0, 1] as
(raw - min) / (max - min)
Trajectory evaluation
To evaluate the agent's process rather than just its output, point the judge at the trajectory file:
[judge]
judge = "anthropic/claude-sonnet-4-6"
atif-trajectory = "/logs/trajectory.json"
files = ["/app/main.py"]
[[criterion]]
description = "Did the agent take an efficient approach?"
type = "likert"
points = 5The trajectory content is truncated proportionally to fit within the model's context window while preserving all steps.
Custom prompt templates
You can provide your own prompt instead of the built-in one:
[judge]
judge = "anthropic/claude-sonnet-4-6"
prompt_template = "my_prompt.md"The template must contain a {criteria} placeholder where criterion descriptions get injected.