harbor

Reward Kit

Judge Criteria

Evaluating agent outputs with LLMs or agent CLIs via TOML configuration

Judge criteria let you use an LLM or agent CLI to evaluate work, configured via TOML files. This makes it trivial to reuse and share rubrics between tasks.

LLM judge

tests/quality.toml
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py", "/app/utils.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0

[[criterion]]
description = "Rate the test coverage on a scale from 0 to 100"
type = "numeric"
min = 0
max = 100

The judge field accepts any LiteLLM model string.

Agent judge

Agent judges shell out to a CLI like Claude Code or Codex. Unlike LLM judges, they can explore the filesystem and run commands:

tests/review.toml
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"

Agent judges are slower and more expensive but can interact with the workspace directly.

Configuration reference

[judge] section

Prop

Type

[[criterion]] entries

Prop

Type

[scoring] section

Controls how the judge's criteria are aggregated into a single score. This only affects criteria within this TOML file — it does not change how programmatic and judge scores are combined across the directory.

[scoring]
aggregation = "all_pass"  # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7           # only used with "threshold" aggregation

Score normalization

  • Binary: yes/true/1 → 1.0, anything else → 0.0
  • Likert: normalized to [0, 1] as (raw - 1) / (points - 1)
  • Numeric: normalized to [0, 1] as (raw - min) / (max - min)

Trajectory evaluation

To evaluate the agent's process rather than just its output, point the judge at the trajectory file:

[judge]
judge = "anthropic/claude-sonnet-4-6"
atif-trajectory = "/logs/trajectory.json"
files = ["/app/main.py"]

[[criterion]]
description = "Did the agent take an efficient approach?"
type = "likert"
points = 5

The trajectory content is truncated proportionally to fit within the model's context window while preserving all steps.

Custom prompt templates

You can provide your own prompt instead of the built-in one:

[judge]
judge = "anthropic/claude-sonnet-4-6"
prompt_template = "my_prompt.md"

The template must contain a {criteria} placeholder where criterion descriptions get injected.

On this page