harbor

Leaderboard

Submit to a leaderboard

Upload evaluation jobs to Harbor Hub and submit them to an official leaderboard

After you run a benchmark and upload the job to Harbor Hub, use harbor leaderboard submit to enter the official review queue for a leaderboard. Harbor checks your job against leaderboard rules and either accepts it as a pending submission or explains what to fix.

Available leaderboards

Only terminal-bench/terminal-bench-2-1 can be submitted through Harbor today. Additional leaderboards will be supported soon; use --leaderboard with the slug published for each benchmark when they launch.

Before you start

Sign in with harbor auth login, finish your eval run, and upload the job with harbor upload. You need the job id from the upload output and a metadata.yaml file that describes your agent and models.

Workflow

flowchart LR
  RUN["harbor run"]
  UP["harbor upload"]
  SUB["harbor leaderboard submit"]
  RUN --> UP --> SUB
  1. Run the benchmark using the dataset and settings required by the leaderboard. Many leaderboards require at least five attempts per task; pass -k 5 (or higher) on harbor run when that applies.
harbor run -d terminal-bench/terminal-bench-2-1 -a claude-code -m anthropic/claude-opus-4-1 -k 5
  1. Upload the job so Harbor Hub has your config, results, and trial artifacts.
harbor upload jobs/<job_name>/

When upload finishes, note the job id in the View at link (the UUID at the end of the URL).

  1. Submit to the leaderboard with that job id, the leaderboard slug, and your metadata file.
harbor leaderboard submit \
  --leaderboard terminal-bench/terminal-bench-2-1 \
  --job-id <JOB_UUID> \
  --metadata ./metadata.yaml

If submission succeeds, the CLI prints a submission id. That submission stays pending until leaderboard admins review and publish it.

Sign in

harbor auth login
harbor auth status

You must be signed in as the owner of every job you submit. Jobs created by another account cannot be attached to your submission.

Command reference

harbor leaderboard submit --help
FlagShortWhen you need itDescription
--leaderboard-lAlwaysLeaderboard slug (for example terminal-bench/terminal-bench-2-1).
--job-id-jNew submissions; adding jobsJob id from harbor upload. Use multiple times for several jobs in one submission.
--metadata-mNew submissions; changing metadataPath to metadata.yaml.
--submission-sUpdating an existing entrySubmission id from a previous successful submit.
--output-oOptionalSave a detailed validation report as JSON.

New submission

Provide at least one job and metadata:

harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -j <JOB_UUID> -m ./metadata.yaml

Add another job to a pending submission

Use the same submission id and pass another job id. You do not need to pass metadata again unless you want to change it.

harbor leaderboard submit \
  -l terminal-bench/terminal-bench-2-1 \
  -s <SUBMISSION_UUID> \
  -j <ANOTHER_JOB_UUID>

Every job on a submission must use the same dataset version. Trial counts and coverage rules apply across all jobs on that submission together.

Update metadata only

harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -s <SUBMISSION_UUID> -m ./metadata.yaml

metadata.yaml

Describe the agent and models you evaluated. Harbor checks the file format before submitting.

agent_url: https://github.com/example/my-agent
agent_display_name: My Agent
agent_org_display_name: My Org

models:
  - model_name: claude-opus-4-1
    model_provider: anthropic
    model_display_name: Claude Opus 4.1
    model_org_display_name: Anthropic
FieldDescription
agent_urlLink to your agent (repository or product page).
agent_display_nameName shown on the leaderboard.
agent_org_display_nameOrganization shown for the agent.
modelsOne or more models used in the run. Each entry needs model_name, model_provider, model_display_name, and model_org_display_name.

The metadata file can live anywhere on disk; Harbor does not pick it up from the job folder automatically.

Validation

Harbor validates your submission before it is accepted. Typical requirements include:

  • The leaderboard exists and your jobs belong to you.
  • Each job is uploaded with complete trial results for the leaderboard dataset.
  • Task versions match what the leaderboard dataset expects.
  • At least five trials per task (across all jobs on the submission when you attach more than one job).
  • Standard job and trial settings (no custom timeout or resource overrides).
  • Trajectories for trials that passed, when the leaderboard requires them.

If validation fails, the CLI lists what failed. Fix the underlying run or upload, then submit again.

When validation passes, you may see an unofficial accuracy figure based on completed trials. That number is informational only; admins still review the full submission.

After static validation, Harbor Hub queues dynamic validation (LLM analyze of trajectories). That runs on a separate worker service, not inside the CLI. Until it completes, dynamic_status on the submission may stay pending or running.

To keep a copy of the full report:

harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -j <JOB_UUID> -m ./metadata.yaml -o ./validation-report.json

After a successful submit

  • Your job is linked to the pending submission and made public so reviewers can inspect it.
  • You can add more jobs to the same pending submission with --submission and another --job-id.
  • You generally cannot edit or delete a job after it is part of a submission. Upload corrections as a new job and attach it, or start a new submission if the leaderboard allows it.

Only pending submissions can be updated. Published or rejected submissions cannot be changed through this command.

Multiple jobs in one submission

Shard a large run, rerun failed tasks, or upload incrementally:

harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -j <JOB_A> -j <JOB_B> -m ./metadata.yaml

Or attach jobs one at a time with the same --submission id. Minimum trials per task and dataset consistency are evaluated over the combined set of jobs.

Troubleshooting

What you seeWhat to do
Not authenticatedRun harbor auth login.
No leaderboard matches slugCheck the slug matches Harbor Hub exactly (for example terminal-bench/terminal-bench-2-1).
Job not found or not accessibleConfirm the job id from your upload and that you own the job.
No trials uploadedUpload the job again and ensure trials finished successfully.
Already linked to another pending submissionThat job is already on a different open submission for this leaderboard. Finish or withdraw that submission first, or submit a different job.
Minimum trials per taskRun more trials per task (often at least five) on the correct dataset version, then upload and submit again.
Different dataset version than the submissionAll jobs on one submission must use the same dataset revision. Check config.json / dataset pins on each job.

See also

On this page