Registry

Version 2.0 of Terminal-Bench, a benchmark for testing agents in terminal environments. More tasks, harder, and higher quality than 1.0.

uvx harbor run -d terminal-bench@2.0

uvx harbor run -d terminal-bench@2.0

89 tasks

swebench-verified

A human-validated subset of 500 SWE-bench tasks

uvx harbor run -d swebench-verified@1.0

uvx harbor run -d swebench-verified@1.0

500 tasks

researchcodebench

ResearchCodeBench evaluates AI agents' ability to implement algorithms from academic papers. Contains 212 code implementation tasks across 20 ML/AI research problems from top-tier venues (ICLR, NeurIPS, CVPR, COLM). Tests paper comprehension, algorithm understanding, and precise code implementation skills with 1,449 lines of reference code.

uvx harbor run -d researchcodebench@1.0

uvx harbor run -d researchcodebench@1.0

212 tasks

gso

GSO: 102 software optimization tasks focusing on performance improvement.

uvx harbor run -d gso@1.0

uvx harbor run -d gso@1.0

102 tasks

cooperbench

CooperBench: multi-agent cooperation benchmark. 652 feature pairs across 12 repos requiring two agents to coordinate via messaging.

uvx harbor run -d cooperbench@1.0

uvx harbor run -d cooperbench@1.0

652 tasks

legacy-bench

A benchmark for evaluating AI agents on legacy code maintenance and modernization tasks across multiple language families including COBOL, Java 7, C, Fortran, and Assembly.

uvx harbor run -d legacy-bench@1.0

uvx harbor run -d legacy-bench@1.0

10 tasks

scale-ai/swe-atlas-tw

SWE-Atlas Test Writing benchmark that evaluates AI agents' ability to write comprehensive unit tests.

uvx harbor run -d scale-ai/swe-atlas-tw@1.0

uvx harbor run -d scale-ai/swe-atlas-tw@1.0

90 tasks

scale-ai/swe-atlas-qna

SWE-Atlas Codebase QnA benchmark that evaluates AI agents' ability to comprehend and query existing codebases.

uvx harbor run -d scale-ai/swe-atlas-qna@1.0

uvx harbor run -d scale-ai/swe-atlas-qna@1.0

124 tasks

swebench_multilingual

SWE-bench Multilingual extends the original Python-focused SWE-bench benchmark to support multiple programming languages. Original benchmark: https://www.swebench.com/multilingual.html.

uvx harbor run -d swebench_multilingual@1.0

uvx harbor run -d swebench_multilingual@1.0

300 tasks

spreadsheetbench-verified

A benchmark evaluating AI agents on real-world spreadsheet manipulation tasks (400 tasks from verified_400). Tasks involve Excel file manipulation including formula writing, data transformation, formatting, and conditional logic.

uvx harbor run -d spreadsheetbench-verified@1.0

uvx harbor run -d spreadsheetbench-verified@1.0

400 tasks

pixiu

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. Original benchmark: https://github.com/TheFinAI/PIXIU. Adapter details: https://github.com/laude-institute/harbor/tree/main/adapters/pixiu. Total tasks: 435 across 29 financial NLP datasets.

uvx harbor run -d pixiu@parity

uvx harbor run -d pixiu@parity

435 tasks

rexbench

A benchmark to evaluate the ability of AI agents to extend existing AI research through research experiment implementation tasks. Original benchmark: https://github.com/tinlaboratory/rexbench. Website: https://rexbench.com/.

uvx harbor run -d rexbench@1.0

uvx harbor run -d rexbench@1.0

2 tasks

ml-dev-bench

ML-Dev-Bench: A benchmark for testing AI agents on machine learning development tasks including model implementation, training, debugging, and optimization.

uvx harbor run -d ml-dev-bench@1.0

uvx harbor run -d ml-dev-bench@1.0

33 tasks

featurebench

FeatureBench full split: 200 feature-implementation tasks across 24 Python repos. 7 tasks require Ampere+ GPU. Original benchmark: https://github.com/LiberCoders/FeatureBench. Adapter: https://github.com/harbor-framework/harbor/pull/875.

uvx harbor run -d featurebench@1.0

uvx harbor run -d featurebench@1.0

200 tasks

featurebench-lite-modal

FeatureBench lite split for Modal: 30 feature-implementation tasks with gpus=1 for GPU tasks (7/30). Use with -e modal. Original benchmark: https://github.com/LiberCoders/FeatureBench.

uvx harbor run -d featurebench-lite-modal@1.0

uvx harbor run -d featurebench-lite-modal@1.0

30 tasks

featurebench-modal

FeatureBench full split for Modal: 200 feature-implementation tasks with gpus=1 for GPU tasks (44/200). Use with -e modal. Original benchmark: https://github.com/LiberCoders/FeatureBench.

uvx harbor run -d featurebench-modal@1.0

uvx harbor run -d featurebench-modal@1.0

200 tasks

featurebench-lite

FeatureBench lite split: 30 feature-implementation tasks (26 lv1 + 4 lv2) across Python repos. Original benchmark: https://github.com/LiberCoders/FeatureBench. Adapter: https://github.com/harbor-framework/harbor/pull/875.

uvx harbor run -d featurebench-lite@1.0

uvx harbor run -d featurebench-lite@1.0

30 tasks

ade-bench

Analytics Data Engineer Bench: tasks evaluating AI agents on dbt/SQL data analytics engineering bugs. Original benchmark: https://github.com/dbt-labs/ade-bench.

uvx harbor run -d ade-bench@1.0

uvx harbor run -d ade-bench@1.0

48 tasks

medagentbench

MedAgentBench: 300 patient-specific clinically-derived tasks across 10 categories in a FHIR-compliant interactive healthcare environment.

uvx harbor run -d medagentbench@1.0

uvx harbor run -d medagentbench@1.0

300 tasks

bigcodebench-hard-complete

v1.0.0

BigCodeBench-Hard complete benchmark adapter for Harbor - challenging Python programming tasks with reward-based verification

uvx harbor run -d bigcodebench-hard-complete@1.0.0

uvx harbor run -d bigcodebench-hard-complete@1.0.0

145 tasks

deveval

DevEval benchmark: comprehensive evaluation of LLMs across software development lifecycle (implementation, unit testing, acceptance testing) for 21 real-world repositories across Python, C++, Java, and JavaScript

uvx harbor run -d deveval@1.0

uvx harbor run -d deveval@1.0

63 tasks

quixbugs

QuixBugs is a multi-lingual program repair benchmark with 40 Python and 40 Java programs, each containing a single-line defect. Tasks cover algorithms and data structures including sorting, graph, dynamic programming, math, and string/array operations.

uvx harbor run -d quixbugs@1.0

uvx harbor run -d quixbugs@1.0

80 tasks

qcircuitbench

QCircuitBench evaluates agents on quantum algorithm design using quantum programming languages. Original benchmark: https://github.com/EstelYang/QCircuitBench.

uvx harbor run -d qcircuitbench@1.0

uvx harbor run -d qcircuitbench@1.0

28 tasks

bfcl_parity

BFCL parity subset: 123 stratified sampled tasks for validating Harbor adapter equivalence with original BFCL benchmark.

uvx harbor run -d bfcl_parity@1.0

uvx harbor run -d bfcl_parity@1.0

123 tasks

bfcl

Berkeley Function-Calling Leaderboard: 3,641 function calling tasks for evaluating LLM tool use capabilities across simple, multiple, parallel, and irrelevance categories.

uvx harbor run -d bfcl@1.0

uvx harbor run -d bfcl@1.0

3641 tasks

labbench

LAB-Bench FigQA: 181 scientific figure reasoning tasks in biology from Future-House LAB-Bench.

uvx harbor run -d labbench@1.0

uvx harbor run -d labbench@1.0

181 tasks

satbench

SATBench is a benchmark for evaluating the logical reasoning capabilities of LLMs through logical puzzles derived from Boolean satisfiability (SAT) problems.

uvx harbor run -d satbench@1.0

uvx harbor run -d satbench@1.0

2100 tasks

financeagent

vpublic

Finance Agent is a tool for financial research and analysis that leverages large language models and specialized financial tools to answer complex queries about companies, financial statements, and SEC filings.Original benchmark: https://github.com/vals-ai/finance-agent

uvx harbor run -d financeagent@public

uvx harbor run -d financeagent@public

50 tasks

bird-bench

BIRD SQL parity subset (150 tasks, seed 42). Original benchmark: https://huggingface.co/datasets/birdsql/bird_sql_dev_20251106. Adapter: https://github.com/laude-institute/harbor/tree/main/adapters/bird-bench.

uvx harbor run -d bird-bench@parity

uvx harbor run -d bird-bench@parity

150 tasks

KUMO full dataset (5300 tasks; 50 instances per scenario).

uvx harbor run -d kumo@1.0

uvx harbor run -d kumo@1.0

5300 tasks

KUMO(hard) split (250 tasks; 50 instances per scenario).

vhard

uvx harbor run -d kumo@hard

uvx harbor run -d kumo@hard

250 tasks

KUMO(easy) split (5050 tasks; 50 instances per scenario).

veasy

uvx harbor run -d kumo@easy

uvx harbor run -d kumo@easy

5050 tasks

KUMO parity subset (seeds 0/1; 212 tasks).

uvx harbor run -d kumo@parity

uvx harbor run -d kumo@parity

212 tasks

gaia

GAIA (General AI Assistants): 165 validation tasks for multi-step reasoning, tool use, and multimodal question answering.

uvx harbor run -d gaia@1.0

uvx harbor run -d gaia@1.0

165 tasks

simpleqa

SimpleQA: 4,326 short, fact-seeking questions from OpenAI for evaluating language model factuality. Uses LLM-as-a-judge grading. Source: https://openai.com/index/introducing-simpleqa/

uvx harbor run -d simpleqa@1.0

uvx harbor run -d simpleqa@1.0

4326 tasks

termigen-environments

3,500+ verified Docker environments for training and evaluating terminal agents, spanning 11 task categories across infrastructure, data/algorithm applications, and specialized domains including software build, system administration, security, data processing, ML/MLOps, algorithms, scientific computing, and more.

uvx harbor run -d termigen-environments@1.0

uvx harbor run -d termigen-environments@1.0

3566 tasks

openthoughts-tblite

OpenThoughts-TBLite: A difficulty-calibrated benchmark of 100 tasks for building terminal agents. By OpenThoughts Agent team, Snorkel AI, Bespoke Labs.

uvx harbor run -d openthoughts-tblite@2.0

uvx harbor run -d openthoughts-tblite@2.0

100 tasks

dabstep

DABstep: Data Agent Benchmark for Multi-step Reasoning. 450 tasks where agents analyze payment transaction data with Python/pandas to answer business questions.

uvx harbor run -d dabstep@1.0

uvx harbor run -d dabstep@1.0

450 tasks

code-contests

A competitive programming benchmark from DeepMind that evaluates AI agents' ability to solve algorithmic problems, covering algorithms, data structures, and competitive programming challenges.

uvx harbor run -d code-contests@1.0

uvx harbor run -d code-contests@1.0

9644 tasks

binary-audit

An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.

uvx harbor run -d binary-audit@1.0

uvx harbor run -d binary-audit@1.0

46 tasks

otel-bench

OpenTelemetry Benchmark - evaluates AI agents' ability to instrument applications with OpenTelemetry tracing across multiple languages.

uvx harbor run -d otel-bench@1.0

uvx harbor run -d otel-bench@1.0

26 tasks

seta-env

CAMEL SETA Environment for RL training

uvx harbor run -d seta-env@1.0

uvx harbor run -d seta-env@1.0

1376 tasks

vmax-tasks

A collection of 1,043 validated real-world bug-fixing tasks from popular open-source JavaScript projects including Vue.js, Docusaurus, Redux, and Chalk. Each task presents an authentic bug report with reproduction steps and expected behavior.

uvx harbor run -d vmax-tasks@1.0

uvx harbor run -d vmax-tasks@1.0

1043 tasks

mmmlu

MMMLU (Multilingual MMLU) parity validation subset with 10 tasks per language across 15 languages (150 tasks total). Evaluates language models' subject knowledge and reasoning across multiple languages using multiple-choice questions covering 57 academic subjects.

uvx harbor run -d mmmlu@parity

uvx harbor run -d mmmlu@parity

150 tasks

swe-gen-js

SWE-gen-JS: 1000 JavaScript/TypeScript bug fix tasks from 30 open-source GitHub repos, generated using SWE-gen.

uvx harbor run -d swe-gen-js@1.0

uvx harbor run -d swe-gen-js@1.0

1000 tasks

reasoning-gym-easy

Reasoning Gym benchmark (easy difficulty). Original benchmark: https://github.com/open-thought/reasoning-gym

uvx harbor run -d reasoning-gym-easy@parity

uvx harbor run -d reasoning-gym-easy@parity

288 tasks

reasoning-gym-hard

Reasoning Gym benchmark (hard difficulty). Original benchmark: https://github.com/open-thought/reasoning-gym

uvx harbor run -d reasoning-gym-hard@parity

uvx harbor run -d reasoning-gym-hard@parity

288 tasks

terminal-bench-sample

A sample of tasks from Terminal-Bench 2.0.

uvx harbor run -d terminal-bench-sample@2.0

uvx harbor run -d terminal-bench-sample@2.0

10 tasks

swe-lancer-diamond

vmanager

Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Only the manager tasks.

uvx harbor run -d swe-lancer-diamond@manager

uvx harbor run -d swe-lancer-diamond@manager

265 tasks

swe-lancer-diamond

vic

Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Only the individual contributor SWE tasks.

uvx harbor run -d swe-lancer-diamond@ic

uvx harbor run -d swe-lancer-diamond@ic

198 tasks

swe-lancer-diamond

vall

Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Both manager and individual contributor tasks.

uvx harbor run -d swe-lancer-diamond@all

uvx harbor run -d swe-lancer-diamond@all

463 tasks

lawbench

LawBench: Benchmarking Legal Knowledge of Large Language Models

uvx harbor run -d lawbench@1.0

uvx harbor run -d lawbench@1.0

1000 tasks

crustbench

CRUST-bench: 100 C-to-safe-Rust transpilation tasks from real-world C repositories.

uvx harbor run -d crustbench@1.0

uvx harbor run -d crustbench@1.0

100 tasks

bixbench-cli

v1.5

bixbench-cli - A benchmark for evaluating AI agents on bioinformatics and computational biology tasks. (Adapted for CLI execution)

uvx harbor run -d bixbench-cli@1.5

uvx harbor run -d bixbench-cli@1.5

205 tasks

spider2-dbt

Spider 2.0-DBT is a comprehensive code generation agent task that includes 68 examples. Solving these tasks requires models to understand project code, navigating complex SQL environments and handling long contexts, surpassing traditional text-to-SQL challenges.

uvx harbor run -d spider2-dbt@1.0

uvx harbor run -d spider2-dbt@1.0

64 tasks

algotune

AlgoTune: 154 algorithm optimization tasks focusing on speedup-based scoring from the AlgoTune benchmark.

uvx harbor run -d algotune@1.0

uvx harbor run -d algotune@1.0

154 tasks

ineqmath

This adapter brings IneqMath, the dev set of the first inequality-proof Q\&A benchmark for LLMs, into Harbor, enabling standardized evaluation of models on mathematical reasoning and proof construction.

uvx harbor run -d ineqmath@1.0

uvx harbor run -d ineqmath@1.0

100 tasks

ds-1000

vhead

DS-1000 is a code generation benchmark with 1000 realistic data science problems across seven popular Python libraries.

uvx harbor run -d ds-1000@head

uvx harbor run -d ds-1000@head

1000 tasks

hello-world

A simple example task to create a hello.txt file with 'Hello, world!' as content.

uvx harbor run -d hello-world@1.0

uvx harbor run -d hello-world@1.0

1 tasks

bixbench

v1.5

BixBench - A benchmark for evaluating AI agents on bioinformatics and computational biology tasks.

uvx harbor run -d bixbench@1.5

uvx harbor run -d bixbench@1.5

205 tasks

strongreject

StrongReject benchmark for evaluating LLM safety and jailbreak resistance. Parity subset with 150 tasks (50 prompts * 3 jailbreaks).

uvx harbor run -d strongreject@parity

uvx harbor run -d strongreject@parity

150 tasks

arc_agi_2

ARC-AGI-2: A benchmark measuring abstract reasoning through visual grid puzzles requiring rule inference and generalization.

uvx harbor run -d arc_agi_2@1.0

uvx harbor run -d arc_agi_2@1.0

167 tasks

humanevalfix

HumanEvalFix: 164 Python code repair tasks from HumanEvalPack.

uvx harbor run -d humanevalfix@1.0

uvx harbor run -d humanevalfix@1.0

164 tasks

mmau

MMAU: 1000 carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music.

uvx harbor run -d mmau@1.0

uvx harbor run -d mmau@1.0

1000 tasks

swtbench-verified

SWTBench Verified - Software Testing Benchmark for code generation

uvx harbor run -d swtbench-verified@1.0

uvx harbor run -d swtbench-verified@1.0

433 tasks

mlgym-bench

Evaluates agents on ML tasks across computer vision, RL, tabular ML, and game theory.

uvx harbor run -d mlgym-bench@1.0

uvx harbor run -d mlgym-bench@1.0

12 tasks

gpqa-diamond

GPQA Diamond subset: 198 graduate-level multiple-choice questions in biology, physics, and chemistry for evaluating scientific reasoning.

uvx harbor run -d gpqa-diamond@1.0

uvx harbor run -d gpqa-diamond@1.0

198 tasks

replicationbench

ReplicationBench - A benchmark for evaluating AI agents on reproducing computational results from astrophysics research papers. Adapted from Christine8888/replicationbench-release.

uvx harbor run -d replicationbench@1.0

uvx harbor run -d replicationbench@1.0

90 tasks

aider-polyglot

A polyglot coding benchmark that evaluates AI agents' ability to perform code editing and generation tasks across multiple programming languages.

uvx harbor run -d aider-polyglot@1.0

uvx harbor run -d aider-polyglot@1.0

225 tasks

terminal-bench-pro

Terminal-Bench Pro (Public Set) is an extended benchmark dataset for testing AI agents in real terminal environments. From compiling code to training models and setting up servers, Terminal-Bench Pro evaluates how well agents can handle real-world, end-to-end tasks autonomously.

uvx harbor run -d terminal-bench-pro@1.0

uvx harbor run -d terminal-bench-pro@1.0

200 tasks

swesmith

SWE-smith is a synthetically generated dataset of software engineering tasks derived from GitHub issues for training and evaluating code generation models.

uvx harbor run -d swesmith@1.0

uvx harbor run -d swesmith@1.0

100 tasks

swebenchpro

SWE-bench Pro: A multi-language software engineering benchmark with 731 instances covering Python, JavaScript/TypeScript, and Go. Evaluates AI systems' ability to resolve real-world bugs and implement features across diverse production codebases. Original benchmark: https://github.com/scaleapi/SWE-bench_Pro-os. Adapter details: https://github.com/laude-institute/harbor/tree/main/adapters/swebenchpro

uvx harbor run -d swebenchpro@1.0

uvx harbor run -d swebenchpro@1.0

731 tasks

sldbench

SLDBench: A benchmark for scaling law discovery with symbolic regression tasks.

uvx harbor run -d sldbench@1.0

uvx harbor run -d sldbench@1.0

8 tasks

compilebench

Version 1.0 of CompileBench, a benchmark on real open-source projects against dependency hell, legacy toolchains, and complex build systems.

uvx harbor run -d compilebench@1.0

uvx harbor run -d compilebench@1.0

15 tasks

autocodebench

vlite200

Adapter for AutoCodeBench (https://github.com/Tencent-Hunyuan/AutoCodeBenchmark).

uvx harbor run -d autocodebench@lite200

uvx harbor run -d autocodebench@lite200

200 tasks

usaco

USACO: 304 Python programming problems from USACO competition.

uvx harbor run -d usaco@2.0

uvx harbor run -d usaco@2.0

304 tasks

aime

American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning and problem-solving capabilities. Contains 60 competition-level mathematics problems from AIME 2024, 2025-I, and 2025-II competitions.

uvx harbor run -d aime@1.0

uvx harbor run -d aime@1.0

60 tasks

codepde

CodePDE evaluates code generation capabilities on scientific computing tasks, specifically focusing on Partial Differential Equation (PDE) solving.

uvx harbor run -d codepde@1.0

uvx harbor run -d codepde@1.0

5 tasks

evoeval