Reward Kit
Lightweight package to define and run verifiers
Agent skill available
Want your coding agent to help design verifiers with Reward Kit? Install the rewardkit skill:
npx skills add harbor-framework/harbor --skill rewardkitRewardkit let's you design verifiers against an agent's workspace and trajectory. It writes scores to a JSON file and has zero external dependencies. Verifiers are defined using directory structure: Each folder maps to a reward, making them easy to read, share, and extend. All criteria are evaluated in parallel by default. You can define criteria in two ways:
- Programmatic: Python functions that check files, run commands, or inspect outputs
- Judge-based: LLM- or Agent-as-a-Judge evaluation configured via reusable TOML files
Standalone package
Rewardkit is an independent Python package. You can use it with Harbor or on its own.
Installation
uv tool install harbor-rewardkitFor criteria running on image or office files, install the extras:
uv tool install harbor-rewardkit[all]Using with Harbor
Harbor copies your task's tests/ directory into the container at /tests/ and runs test.sh. Put your criteria files alongside test.sh:
#!/bin/bash
uvx --with harbor-rewardkit@0.1 rewardkit /testsThis discovers all criteria in /tests/, runs them against the workspace at /app, and writes the result to /logs/verifier/reward.json:
{ "reward": 0.75 }All defaults match Harbor's conventions.
If your judge criteria need API keys, pass them through [verifier.env] in task.toml:
[verifier]
timeout_sec = 300.0
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"There's a complete working example you can use as a starting point.
Built-in criteria
Many common checks we saw being used in popular benchmarks are already built in. Just call them from a Python file in your tests directory:
import rewardkit as rk
rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py")There are 20+ built-in criteria covering frequently used criteria for grading files, commands, JSON, CSV, HTTP, images, and agent trajectories. See Built-in Criteria for the full list.
Custom criteria
When you need logic specific to your task, define your own with the @criterion decorator. The first parameter is always workspace: Path, and the function returns a bool or float. Zero-parameter criteria are registered automatically:
import rewardkit as rk
from pathlib import Path
from rewardkit import criterion
@criterion
def has_valid_output(workspace: Path) -> bool:
output = (workspace / "output.txt").read_text()
return len(output.splitlines()) >= 10Criteria can also take custom arguments. These need to be called explicitly through rk to register:
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
output = (workspace / "output.txt").read_text()
return len(output.splitlines()) >= n
rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)Judge criteria
Some things are hard to check with code — code quality, edge case handling, readability. For these, you can have an LLM evaluate the workspace by creating a TOML file:
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]
[[criterion]]
description = "Is the code correct?"
type = "binary"
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5Both .py and .toml files can coexist in the same directory — their scores are averaged together into one reward.
See Judge Criteria for the full TOML configuration reference including agent judges and trajectory evaluation.
Multi-reward tasks
When you want separate scores for different dimensions (correctness, structure, quality), organize criteria into subdirectories. Each subdirectory becomes its own reward:
This produces separate scores:
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }To reuse custom criteria across subdirectories, define them in a root-level file with shared=True:
from pathlib import Path
from rewardkit import criterion
@criterion(shared=True)
def word_count_correct(workspace: Path) -> float:
# test logic ...
return scoreimport rewardkit as rk
rk.word_count_correct(weight=3.0)Weights and scoring
Every criterion accepts an optional weight (default 1.0):
rk.file_exists("output.txt", weight=3.0)
rk.file_exists("readme.md", weight=1.0)All criteria in a reward directory are aggregated using a weighted average. Judge criteria are first aggregated per judge — you can control this with the [scoring] section in the TOML file:
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]
weight = 2.0 # weight of this judge in the overall reward (default 1.0)
[[criterion]]
description = "Is the code correct?"
type = "binary"
weight = 2.0
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
[scoring]
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | thresholdIsolation
Some criteria modify the workspace by running commands that make edits, create files, etc. To prevent one criterion from affecting another, you can run it in isolation. This uses overlayfs to mount the workspace as read-only with writes going to a temporary layer that gets discarded after the criterion completes.
For programmatic criteria, pass isolated=True:
rk.command_succeeds("python main.py", isolated=True)For agent judges, set isolated = true in the [judge] section:
[judge]
judge = "claude-code"
isolated = trueOutput
Rewardkit writes two files side by side:
reward.json— per-reward scores, e.g.{"correctness": 0.75, "quality": 0.9}reward-details.json— per-criterion results including individual scores, judge reasoning, and any errors.harbor viewrenders this under Verifier Logs → Rewards as a collapsible tree.
Comparing verifiers
You can pass multiple test directories to compare different verifier designs side by side:
rewardkit /tests/v1 /tests/v2This runs each directory independently and prints a comparison table for overlapping reward names:
Comparison:
------------------------------------------
reward v1 v2 diff
------------------------------------------
correctness 0.7500 0.9000 -0.1500
structure 1.0000 1.0000 +0.0000
------------------------------------------The combined output is written to reward.json with namespaced keys like "v1/correctness" and "v2/correctness".
CLI
rewardkit <tests_dirs...> \
--workspace /app \
--output /logs/verifier/reward.json \
--max-concurrent-programmatic 8 \
--max-concurrent-llm 8 \
--max-concurrent-agent 2All flags are optional with sensible defaults. Passing multiple test directories runs each independently and prints a comparison table.
Python API
import rewardkit as rk
# Run and get scores
scores = rk.run("/tests", workspace="/app")
# Inspect discovered rewards without running them
rewards = rk.discover("/tests", workspace="/app")