harbor

Run Jobs

Evals

Running a dataset

What is a dataset?

In Harbor, a dataset is a collection of Harbor tasks. A task includes an instruction, environment, and test script. Use datasets to evaluate, train, or tune prompts.

Viewing registered benchmarks

Harbor resolves published datasets from its package registry. harbor run -d "<org>/<dataset>" uses the configured registry.

Use --registry-path or --registry-url only for legacy registry.json setups.

To list available datasets:

harbor dataset list

Running a benchmark

Terminal-Bench:

harbor run -d terminal-bench/terminal-bench-2 -m "<model>" -a "<agent>"

Harbor resolves package metadata and downloads task artifacts as needed.

SWE-Bench Verified:

harbor run -d swe-bench/swe-bench-verified -m "<model>" -a "<agent>"

Omit the tag to use the latest dataset.

For dataset creation, sync, and publishing workflow:

Running a local dataset

Run a local dataset with:

harbor run -p "<path/to/dataset>" -m "<model>" -a "<agent>"

Analyzing results

Running the harbor run command creates a job which by default is stored in the jobs directory.

The file structure looks something like this:

jobs/job-name
├── config.json               # Job config
├── result.json               # Job result
├── trial-name
   ├── config.json           # Trial config
   ├── result.json           # Trial result
   ├── agent                 # Agent directory, contents depend on agent implementation
   ├── recording.cast
   └── trajectory.json
   └── verifier              # Verifier directory, contents depend on test.sh implementation
       ├── ctrf.json
       ├── reward.txt
       ├── test-stderr.txt
       └── test-stdout.txt
└── ...                       # More trials

Using the viewer

Harbor includes a web-based results viewer for browsing jobs, inspecting trials, and analyzing agent trajectories. To launch it, point it at your jobs directory:

harbor view jobs

This starts a local web server (default http://127.0.0.1:8080) where you can:

  • Browse jobs — Filter and search by agent, model, dataset, and date range.
  • Inspect trials — View trial results, rewards, durations, and errors for each task.
  • View trajectories — Step through the agent's execution including tool calls, observations, and multimodal content (text and images).
  • Analyze performance — See token usage breakdowns, timing metrics (environment setup, agent execution, verification), and verifier output.
  • Compare jobs — Select multiple jobs to view a side-by-side comparison matrix of task performance across agent/model combinations.
  • View artifacts — Browse files collected from the sandbox after each trial (see Artifact Collection).
  • Generate summaries — Use AI-powered summarization to analyze job failures.

The viewer supports keyboard navigation (j/k to move between rows, Enter to open, Esc to deselect).

OptionDescription
--port, -pPort or port range (e.g., 8080 or 8080-8089). Default: 8080-8089
--hostHost to bind the server to. Default: 127.0.0.1
--devRun frontend in development mode with hot reloading

On this page