Evals
Running a dataset
What is a dataset?
In Harbor, a dataset is a collection of Harbor tasks. A task includes an instruction, environment, and test script. Use datasets to evaluate, train, or tune prompts.
Viewing registered benchmarks
Harbor resolves published datasets from its package registry.
harbor run -d "<org>/<dataset>" uses the configured registry.
Use --registry-path or --registry-url only for legacy registry.json setups.
To list available datasets:
harbor dataset listRunning a benchmark
Terminal-Bench:
harbor run -d terminal-bench/terminal-bench-2 -m "<model>" -a "<agent>"Harbor resolves package metadata and downloads task artifacts as needed.
SWE-Bench Verified:
harbor run -d swe-bench/swe-bench-verified -m "<model>" -a "<agent>"Omit the tag to use the latest dataset.
For dataset creation, sync, and publishing workflow:
Running a local dataset
Run a local dataset with:
harbor run -p "<path/to/dataset>" -m "<model>" -a "<agent>"Analyzing results
Running the harbor run command creates a job which by default is stored in the jobs directory.
The file structure looks something like this:
jobs/job-name
├── config.json # Job config
├── result.json # Job result
├── trial-name
│ ├── config.json # Trial config
│ ├── result.json # Trial result
│ ├── agent # Agent directory, contents depend on agent implementation
│ │ ├── recording.cast
│ │ └── trajectory.json
│ └── verifier # Verifier directory, contents depend on test.sh implementation
│ ├── ctrf.json
│ ├── reward.txt
│ ├── test-stderr.txt
│ └── test-stdout.txt
└── ... # More trialsUsing the viewer
Harbor includes a web-based results viewer for browsing jobs, inspecting trials, and analyzing agent trajectories. To launch it, point it at your jobs directory:
harbor view jobsThis starts a local web server (default http://127.0.0.1:8080) where you can:
- Browse jobs — Filter and search by agent, model, dataset, and date range.
- Inspect trials — View trial results, rewards, durations, and errors for each task.
- View trajectories — Step through the agent's execution including tool calls, observations, and multimodal content (text and images).
- Analyze performance — See token usage breakdowns, timing metrics (environment setup, agent execution, verification), and verifier output.
- Compare jobs — Select multiple jobs to view a side-by-side comparison matrix of task performance across agent/model combinations.
- View artifacts — Browse files collected from the sandbox after each trial (see Artifact Collection).
- Generate summaries — Use AI-powered summarization to analyze job failures.
The viewer supports keyboard navigation (j/k to move between rows, Enter to open, Esc to deselect).
| Option | Description |
|---|---|
--port, -p | Port or port range (e.g., 8080 or 8080-8089). Default: 8080-8089 |
--host | Host to bind the server to. Default: 127.0.0.1 |
--dev | Run frontend in development mode with hot reloading |