harbor

Reward Kit

Reward Kit

Lightweight package to define and run verifiers

Agent skill available

Want your coding agent to help design verifiers with Reward Kit? Install the rewardkit skill:

npx skills add harbor-framework/harbor --skill rewardkit

Rewardkit let's you design verifiers against an agent's workspace and trajectory. It writes scores to a JSON file and has zero external dependencies. Verifiers are defined using directory structure: Each folder maps to a reward, making them easy to read, share, and extend. All criteria are evaluated in parallel by default. You can define criteria in two ways:

  • Programmatic: Python functions that check files, run commands, or inspect outputs
  • Judge-based: LLM- or Agent-as-a-Judge evaluation configured via reusable TOML files

Standalone package

Rewardkit is an independent Python package. You can use it with Harbor or on its own.

Installation

uv tool install harbor-rewardkit

For criteria running on image or office files, install the extras:

uv tool install harbor-rewardkit[all]

Using with Harbor

Harbor copies your task's tests/ directory into the container at /tests/ and runs test.sh. Put your criteria files alongside test.sh:

checks.py
judge.toml
test.sh
tests/test.sh
#!/bin/bash
uvx --with harbor-rewardkit@0.1 rewardkit /tests

This discovers all criteria in /tests/, runs them against the workspace at /app, and writes the result to /logs/verifier/reward.json:

/logs/verifier/reward.json
{ "reward": 0.75 }

All defaults match Harbor's conventions.

If your judge criteria need API keys, pass them through [verifier.env] in task.toml:

task.toml
[verifier]
timeout_sec = 300.0

[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

There's a complete working example you can use as a starting point.

Built-in criteria

Many common checks we saw being used in popular benchmarks are already built in. Just call them from a Python file in your tests directory:

tests/check.py
import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py")

There are 20+ built-in criteria covering frequently used criteria for grading files, commands, JSON, CSV, HTTP, images, and agent trajectories. See Built-in Criteria for the full list.

Custom criteria

When you need logic specific to your task, define your own with the @criterion decorator. The first parameter is always workspace: Path, and the function returns a bool or float. Zero-parameter criteria are registered automatically:

tests/check.py
import rewardkit as rk
from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    output = (workspace / "output.txt").read_text()
    return len(output.splitlines()) >= 10

Criteria can also take custom arguments. These need to be called explicitly through rk to register:

tests/check.py
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    output = (workspace / "output.txt").read_text()
    return len(output.splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)

Judge criteria

Some things are hard to check with code — code quality, edge case handling, readability. For these, you can have an LLM evaluate the workspace by creating a TOML file:

tests/quality.toml
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5

Both .py and .toml files can coexist in the same directory — their scores are averaged together into one reward.

See Judge Criteria for the full TOML configuration reference including agent judges and trajectory evaluation.

Multi-reward tasks

When you want separate scores for different dimensions (correctness, structure, quality), organize criteria into subdirectories. Each subdirectory becomes its own reward:

test.sh
check.py
files_exist.py
quality.toml

This produces separate scores:

/logs/verifier/reward.json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }

To reuse custom criteria across subdirectories, define them in a root-level file with shared=True:

tests/criteria.py
from pathlib import Path
from rewardkit import criterion

@criterion(shared=True)
def word_count_correct(workspace: Path) -> float:
    # test logic ...
    return score
tests/correctness/check.py
import rewardkit as rk

rk.word_count_correct(weight=3.0)

Weights and scoring

Every criterion accepts an optional weight (default 1.0):

rk.file_exists("output.txt", weight=3.0)
rk.file_exists("readme.md", weight=1.0)

All criteria in a reward directory are aggregated using a weighted average. Judge criteria are first aggregated per judge — you can control this with the [scoring] section in the TOML file:

tests/quality.toml
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]
weight = 2.0  # weight of this judge in the overall reward (default 1.0)

[[criterion]]
description = "Is the code correct?"
type = "binary"
weight = 2.0

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5

[scoring]
aggregation = "all_pass"  # weighted_mean | all_pass | any_pass | threshold

Isolation

Some criteria modify the workspace by running commands that make edits, create files, etc. To prevent one criterion from affecting another, you can run it in isolation. This uses overlayfs to mount the workspace as read-only with writes going to a temporary layer that gets discarded after the criterion completes.

For programmatic criteria, pass isolated=True:

rk.command_succeeds("python main.py", isolated=True)

For agent judges, set isolated = true in the [judge] section:

[judge]
judge = "claude-code"
isolated = true

Output

Rewardkit writes two files side by side:

  • reward.json — per-reward scores, e.g. {"correctness": 0.75, "quality": 0.9}
  • reward-details.json — per-criterion results including individual scores, judge reasoning, and any errors. harbor view renders this under Verifier Logs → Rewards as a collapsible tree.

Comparing verifiers

You can pass multiple test directories to compare different verifier designs side by side:

rewardkit /tests/v1 /tests/v2

This runs each directory independently and prints a comparison table for overlapping reward names:

Comparison:
------------------------------------------
reward          v1      v2    diff
------------------------------------------
correctness  0.7500  0.9000  -0.1500
structure    1.0000  1.0000  +0.0000
------------------------------------------

The combined output is written to reward.json with namespaced keys like "v1/correctness" and "v2/correctness".

CLI

rewardkit <tests_dirs...> \
    --workspace /app \
    --output /logs/verifier/reward.json \
    --max-concurrent-programmatic 8 \
    --max-concurrent-llm 8 \
    --max-concurrent-agent 2

All flags are optional with sensible defaults. Passing multiple test directories runs each independently and prints a comparison table.

Python API

import rewardkit as rk

# Run and get scores
scores = rk.run("/tests", workspace="/app")

# Inspect discovered rewards without running them
rewards = rk.discover("/tests", workspace="/app")

On this page