harbor

Reward Kit

Motivation & Design

Why Reward Kit exists and its design

Motivation

When creating new tasks for Harbor, writing the verifier is often the most tedious part. So we looked at verifier implementations across popular benchmarks and found that they often are complicated scripts that are hard to read, extend, and reuse across tasks. This often leads to copy-and-pasting of boilerplace code between tasks causing subtle errors in verifiers during evals.

Reward Kit solves this. It packages common patterns into reusable components that can be shared and reused across tasks. Reward Kit gets rid of boilerplate code by enabling LLM/agent-as-a-judge evaluation and partial credit. Because independent reward components are defined through the directory structure, verifiers are easy to understand at a glance even without reading the code. Evaluating multiple criteria in parallel and in isolation is natively supported in Reward Kit.

Design principles

  • Simplicity: Verifiers are defined by directory structure, making them easy to read at a glance. Common checks are one-liners. Custom logic is a decorated function. Judges are reusable TOML files.
  • Reuse and sharing: Reward criteria are plain files in a directory. Share them across tasks and version in git.
  • Zero boilerplate: 20+ built-in criteria for files, commands, JSON, CSV, HTTP, images, and trajectories. LLM- and Agent-as-a-Judge is supported natively.
  • Isolation: Criteria can be run in isolated filesystem snapshots so they can't interfere with each other or corrupt the agent's workspace.

On this page