Reward Kit
Lightweight package to define and run verifiers
Agent skill available
Want your coding agent to help design verifiers with Reward Kit? Install the rewardkit skill:
npx skills add harbor-framework/harbor --skill rewardkitRewardkit let's you design verifiers against an agent's workspace and trajectory. It writes scores to a JSON file and has zero external dependencies. Verifiers are defined using directory structure: Each folder maps to a reward, making them easy to read, share, and extend. All criteria are evaluated in parallel by default. You can define criteria in two ways:
- Programmatic: Python functions that check files, run commands, or inspect outputs
- Judge-based: LLM- or Agent-as-a-Judge evaluation configured via reusable TOML files
Standalone package
Rewardkit is an independent Python package. You can use it with Harbor or on its own.
Installation
uv tool install harbor-rewardkitFor criteria or judges that read images or common document files (PDF, DOCX, PPTX, XLSX), install the extras:
uv tool install harbor-rewardkit[all]Using with Harbor
Harbor copies your task's tests/ directory into the container at /tests/ and runs test.sh. Put your criteria files alongside test.sh:
#!/bin/bash
uvx --with harbor-rewardkit@0.1 rewardkit /testsThis discovers all criteria in /tests/, runs them against the workspace at /app, and writes the result to /logs/verifier/reward.json:
{ "reward": 0.75 }All defaults match Harbor's conventions.
If your judge criteria need API keys, pass them through [verifier.env] in task.toml:
[verifier]
timeout_sec = 300.0
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"There's a complete working example you can use as a starting point.
To swap providers without editing the rubric, see Provider routing.
Built-in criteria
Many common checks we saw being used in popular benchmarks are already built in. Just call them from a Python file in your tests directory:
import rewardkit as rk
rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py")There are 20+ built-in criteria covering frequently used criteria for grading files, commands, JSON, CSV, HTTP, images, and agent trajectories. See Built-in Criteria for the full list.
Custom criteria
When you need logic specific to your task, define your own with the @criterion decorator. The first parameter is always workspace: Path, and the function returns a bool or float. Zero-parameter criteria are registered automatically:
import rewardkit as rk
from pathlib import Path
from rewardkit import criterion
@criterion
def has_valid_output(workspace: Path) -> bool:
output = (workspace / "output.txt").read_text()
return len(output.splitlines()) >= 10Criteria can also take custom arguments. These need to be called explicitly through rk to register:
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
output = (workspace / "output.txt").read_text()
return len(output.splitlines()) >= n
rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)Judge criteria
Some things are hard to check with code — code quality, edge case handling, readability. For these, you can have an LLM evaluate the workspace by creating a TOML file:
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]
[[criterion]]
description = "Is the code correct?"
type = "binary"
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5Both .py and .toml files can coexist in the same directory — their scores are averaged together into one reward.
See Judge Criteria for the full TOML configuration reference including agent judges and trajectory evaluation.
Multi-reward tasks
When you want separate scores for different dimensions (correctness, structure, quality), organize criteria into subdirectories. Each subdirectory becomes its own reward:
This produces separate scores:
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }To reuse custom criteria across subdirectories, define them in a root-level file with shared=True:
from pathlib import Path
from rewardkit import criterion
@criterion(shared=True)
def word_count_correct(workspace: Path) -> float:
# test logic ...
return scoreimport rewardkit as rk
rk.word_count_correct(weight=3.0)Weights and scoring
Every criterion accepts an optional weight (default 1.0):
rk.file_exists("output.txt", weight=3.0)
rk.file_exists("readme.md", weight=1.0)All criteria in a reward directory are aggregated using a weighted average. Judge criteria are first aggregated per judge — you can control this with the [scoring] section in the TOML file:
[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]
weight = 2.0 # weight of this judge in the overall reward (default 1.0)
[[criterion]]
description = "Is the code correct?"
type = "binary"
weight = 2.0
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
[scoring]
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | thresholdIsolation
Some criteria modify the workspace by running commands that make edits, create files, etc. To prevent one criterion from affecting another, you can run it in isolation. This uses overlayfs to mount the workspace as read-only with writes going to a temporary layer that gets discarded after the criterion completes.
For programmatic criteria, pass isolated=True:
rk.command_succeeds("python main.py", isolated=True)For agent judges, set isolated = true in the [judge] section:
[judge]
judge = "claude-code"
isolated = trueOutput
Rewardkit writes two files side by side:
reward.json— per-reward scores, e.g.{"correctness": 0.75, "quality": 0.9}reward-details.json— per-criterion results including individual scores, judge reasoning, and any errors.harbor viewrenders this under Verifier Logs → Rewards as a collapsible tree.
Comparing verifiers
You can pass multiple test directories to compare different verifier designs side by side:
rewardkit /tests/v1 /tests/v2This runs each directory independently and prints a comparison table for overlapping reward names:
Comparison:
------------------------------------------
reward v1 v2 diff
------------------------------------------
correctness 0.7500 0.9000 -0.1500
structure 1.0000 1.0000 +0.0000
------------------------------------------The combined output is written to reward.json with namespaced keys like "v1/correctness" and "v2/correctness".
CLI
rewardkit <tests_dirs...> \
--workspace /app \
--output /logs/verifier/reward.json \
--max-concurrent-programmatic 8 \
--max-concurrent-llm 8 \
--max-concurrent-agent 2All flags are optional with sensible defaults. Passing multiple test directories runs each independently and prints a comparison table.
Provider routing
Judges call LiteLLM, which reads credentials from host environment variables. A few flags help you avoid pinning a rubric to one provider:
--je KEY=VALUEsets env vars for the run (repeatable). Same shape as Harbor's--ve.--judge MODEL_OR_AGENToverwrites the rubric's[judge].judgefield. This can also be done by settingREWARDKIT_JUDGEin the environment, so Harbor users can pass--ve REWARDKIT_JUDGE=...to do the same thing.--model MODELoverwrites the rubric's[judge].modelfield when the judge is an agent (e.g.claude-code,codex). Equivalent to settingREWARDKIT_MODEL; Harbor users can pass--ve REWARDKIT_MODEL=....
rewardkit /tests \
--judge bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0 \
--je AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--je AWS_REGION_NAME=us-east-1
# Agent judge with an overridden model
rewardkit /tests \
--judge claude-code \
--model anthropic/claude-sonnet-4-6The LiteLLM provider docs list the env var configurations for each provider.
Subscription auth
For Anthropic LLM judge models, a Claude subscription token is used when it is the only Anthropic credential present. Create one with claude setup-token and set CLAUDE_CODE_OAUTH_TOKEN. When both ANTHROPIC_API_KEY and the token are set, the API key has priority; set REWARDKIT_FORCE_OAUTH=1 to prefer the subscription token instead.
For the codex agent judge, set CODEX_ACCESS_TOKEN to a ChatGPT access token (created by Business/Enterprise workspace admins at chatgpt.com/admin/access-tokens) and rewardkit logs the CLI in with it before grading. As with Anthropic, OPENAI_API_KEY has priority when both are set; set REWARDKIT_FORCE_OAUTH=1 to prefer the access token instead.
Python API
import rewardkit as rk
# Run and get scores
scores = rk.run("/tests", workspace="/app")
# Inspect discovered rewards without running them
rewards = rk.discover("/tests", workspace="/app")