Reward Kit

Agent skill available

Want your coding agent to help design verifiers with Reward Kit? Install the rewardkit skill:

npx skills add harbor-framework/harbor --skill rewardkit

Rewardkit let's you design verifiers against an agent's workspace and trajectory. It writes scores to a JSON file and has zero external dependencies. Verifiers are defined using directory structure: Each folder maps to a reward, making them easy to read, share, and extend. All criteria are evaluated in parallel by default. You can define criteria in two ways:

Programmatic: Python functions that check files, run commands, or inspect outputs
Judge-based: LLM- or Agent-as-a-Judge evaluation configured via reusable TOML files

Standalone package

Rewardkit is an independent Python package. You can use it with Harbor or on its own.

Installation

uv tool install harbor-rewardkit

For criteria or judges that read images or common document files (PDF, DOCX, PPTX, XLSX), install the extras:

uv tool install harbor-rewardkit[all]

Using with Harbor

Harbor copies your task's tests/ directory into the container at /tests/ and runs test.sh. Put your criteria files alongside test.sh:

checks.py

judge.toml

test.sh

tests/test.sh

#!/bin/bash
uvx --with harbor-rewardkit@0.1 rewardkit /tests

This discovers all criteria in /tests/, runs them against the workspace at /app, and writes the result to /logs/verifier/reward.json:

/logs/verifier/reward.json

{ "reward": 0.75 }

All defaults match Harbor's conventions.

If your judge criteria need API keys, pass them through [verifier.env] in task.toml:

task.toml

[verifier]
timeout_sec = 300.0

[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

There's a complete working example you can use as a starting point.

To swap providers without editing the rubric, see Provider routing.

Built-in criteria

Many common checks we saw being used in popular benchmarks are already built in. Just call them from a Python file in your tests directory:

tests/check.py

import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py")

There are 20+ built-in criteria covering frequently used criteria for grading files, commands, JSON, CSV, HTTP, images, and agent trajectories. See Built-in Criteria for the full list.

Custom criteria

When you need logic specific to your task, define your own with the @criterion decorator. The first parameter is always workspace: Path, and the function returns a bool or float. Zero-parameter criteria are registered automatically:

tests/check.py

import rewardkit as rk
from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    output = (workspace / "output.txt").read_text()
    return len(output.splitlines()) >= 10

Criteria can also take custom arguments. These need to be called explicitly through rk to register:

tests/check.py

@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    output = (workspace / "output.txt").read_text()
    return len(output.splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)

Judge criteria

Some things are hard to check with code — code quality, edge case handling, readability. For these, you can have an LLM evaluate the workspace by creating a TOML file:

tests/quality.toml

[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5

Both .py and .toml files can coexist in the same directory — their scores are averaged together into one reward.

See Judge Criteria for the full TOML configuration reference including agent judges and trajectory evaluation.

Multi-reward tasks

When you want separate scores for different dimensions (correctness, structure, quality), organize criteria into subdirectories. Each subdirectory becomes its own reward:

test.sh

check.py

files_exist.py

quality.toml

This produces separate scores:

/logs/verifier/reward.json

{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }

To reuse custom criteria across subdirectories, define them in a root-level file with shared=True:

tests/criteria.py

from pathlib import Path
from rewardkit import criterion

@criterion(shared=True)
def word_count_correct(workspace: Path) -> float:
    # test logic ...
    return score

tests/correctness/check.py

import rewardkit as rk

rk.word_count_correct(weight=3.0)

Weights and scoring

Every criterion accepts an optional weight (default 1.0):

rk.file_exists("output.txt", weight=3.0)
rk.file_exists("readme.md", weight=1.0)

All criteria in a reward directory are aggregated using a weighted average. Judge criteria are first aggregated per judge — you can control this with the [scoring] section in the TOML file:

tests/quality.toml

[judge]
judge = "anthropic/claude-sonnet-4-6"
files = ["/app/main.py"]
weight = 2.0  # weight of this judge in the overall reward (default 1.0)

[[criterion]]
description = "Is the code correct?"
type = "binary"
weight = 2.0

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5

[scoring]
aggregation = "all_pass"  # weighted_mean | all_pass | any_pass | threshold

Isolation

Some criteria modify the workspace by running commands that make edits, create files, etc. To prevent one criterion from affecting another, you can run it in isolation. This uses overlayfs to mount the workspace as read-only with writes going to a temporary layer that gets discarded after the criterion completes.

For programmatic criteria, pass isolated=True:

rk.command_succeeds("python main.py", isolated=True)

For agent judges, set isolated = true in the [judge] section:

[judge]
judge = "claude-code"
isolated = true

Output

Rewardkit writes two files side by side:

reward.json — per-reward scores, e.g. {"correctness": 0.75, "quality": 0.9}
reward-details.json — per-criterion results including individual scores, judge reasoning, and any errors. harbor view renders this under Verifier Logs → Rewards as a collapsible tree.

Comparing verifiers

You can pass multiple test directories to compare different verifier designs side by side:

rewardkit /tests/v1 /tests/v2

This runs each directory independently and prints a comparison table for overlapping reward names:

Comparison:
------------------------------------------
reward          v1      v2    diff
------------------------------------------
correctness  0.7500  0.9000  -0.1500
structure    1.0000  1.0000  +0.0000
------------------------------------------

The combined output is written to reward.json with namespaced keys like "v1/correctness" and "v2/correctness".

CLI

rewardkit <tests_dirs...> \
    --workspace /app \
    --output /logs/verifier/reward.json \
    --max-concurrent-programmatic 8 \
    --max-concurrent-llm 8 \
    --max-concurrent-agent 2

All flags are optional with sensible defaults. Passing multiple test directories runs each independently and prints a comparison table.

Provider routing

Judges call LiteLLM, which reads credentials from host environment variables. A few flags help you avoid pinning a rubric to one provider:

--je KEY=VALUE sets env vars for the run (repeatable). Same shape as Harbor's --ve.
--judge MODEL_OR_AGENT overwrites the rubric's [judge].judge field. This can also be done by setting REWARDKIT_JUDGE in the environment, so Harbor users can pass --ve REWARDKIT_JUDGE=... to do the same thing.
--model MODEL overwrites the rubric's [judge].model field when the judge is an agent (e.g. claude-code, codex). Equivalent to setting REWARDKIT_MODEL; Harbor users can pass --ve REWARDKIT_MODEL=....

rewardkit /tests \
  --judge bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0 \
  --je AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  --je AWS_REGION_NAME=us-east-1

# Agent judge with an overridden model
rewardkit /tests \
  --judge claude-code \
  --model anthropic/claude-sonnet-4-6

The LiteLLM provider docs list the env var configurations for each provider.

Subscription auth

For Anthropic LLM judge models, a Claude subscription token is used when it is the only Anthropic credential present. Create one with claude setup-token and set CLAUDE_CODE_OAUTH_TOKEN. When both ANTHROPIC_API_KEY and the token are set, the API key has priority; set REWARDKIT_FORCE_OAUTH=1 to prefer the subscription token instead.

For the codex agent judge, set CODEX_ACCESS_TOKEN to a ChatGPT access token (created by Business/Enterprise workspace admins at chatgpt.com/admin/access-tokens) and rewardkit logs the CLI in with it before grading. As with Anthropic, OPENAI_API_KEY has priority when both are set; set REWARDKIT_FORCE_OAUTH=1 to prefer the access token instead.

Python API

import rewardkit as rk

# Run and get scores
scores = rk.run("/tests", workspace="/app")

# Inspect discovered rewards without running them
rewards = rk.discover("/tests", workspace="/app")

Reward Kit

On this page