Submit to a leaderboard
Upload evaluation jobs to Harbor Hub and submit them to an official leaderboard
After you run a benchmark and upload the job to Harbor Hub, use harbor leaderboard submit to enter the official review queue for a leaderboard. Harbor checks your job against leaderboard rules and either accepts it as a pending submission or explains what to fix.
Available leaderboards
Only terminal-bench/terminal-bench-2-1 can be submitted through Harbor
today. Additional leaderboards will be supported soon; use --leaderboard
with the slug published for each benchmark when they launch.
Before you start
Sign in with harbor auth login, finish your eval run, and upload the job
with harbor upload. You need the job id from the upload output and a
metadata.yaml file that describes your agent and models.
Workflow
flowchart LR
RUN["harbor run"]
UP["harbor upload"]
SUB["harbor leaderboard submit"]
RUN --> UP --> SUB- Run the benchmark using the dataset and settings required by the leaderboard. Many leaderboards require at least five attempts per task; pass
-k 5(or higher) onharbor runwhen that applies.
harbor run -d terminal-bench/terminal-bench-2-1 -a claude-code -m anthropic/claude-opus-4-1 -k 5- Upload the job so Harbor Hub has your config, results, and trial artifacts.
harbor upload jobs/<job_name>/When upload finishes, note the job id in the View at link (the UUID at the end of the URL).
- Submit to the leaderboard with that job id, the leaderboard slug, and your metadata file.
harbor leaderboard submit \
--leaderboard terminal-bench/terminal-bench-2-1 \
--job-id <JOB_UUID> \
--metadata ./metadata.yamlIf submission succeeds, the CLI prints a submission id. That submission stays pending until leaderboard admins review and publish it.
Sign in
harbor auth login
harbor auth statusYou must be signed in as the owner of every job you submit. Jobs created by another account cannot be attached to your submission.
Command reference
harbor leaderboard submit --help| Flag | Short | When you need it | Description |
|---|---|---|---|
--leaderboard | -l | Always | Leaderboard slug (for example terminal-bench/terminal-bench-2-1). |
--job-id | -j | New submissions; adding jobs | Job id from harbor upload. Use multiple times for several jobs in one submission. |
--metadata | -m | New submissions; changing metadata | Path to metadata.yaml. |
--submission | -s | Updating an existing entry | Submission id from a previous successful submit. |
--output | -o | Optional | Save a detailed validation report as JSON. |
New submission
Provide at least one job and metadata:
harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -j <JOB_UUID> -m ./metadata.yamlAdd another job to a pending submission
Use the same submission id and pass another job id. You do not need to pass metadata again unless you want to change it.
harbor leaderboard submit \
-l terminal-bench/terminal-bench-2-1 \
-s <SUBMISSION_UUID> \
-j <ANOTHER_JOB_UUID>Every job on a submission must use the same dataset version. Trial counts and coverage rules apply across all jobs on that submission together.
Update metadata only
harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -s <SUBMISSION_UUID> -m ./metadata.yamlmetadata.yaml
Describe the agent and models you evaluated. Harbor checks the file format before submitting.
agent_url: https://github.com/example/my-agent
agent_display_name: My Agent
agent_org_display_name: My Org
models:
- model_name: claude-opus-4-1
model_provider: anthropic
model_display_name: Claude Opus 4.1
model_org_display_name: Anthropic| Field | Description |
|---|---|
agent_url | Link to your agent (repository or product page). |
agent_display_name | Name shown on the leaderboard. |
agent_org_display_name | Organization shown for the agent. |
models | One or more models used in the run. Each entry needs model_name, model_provider, model_display_name, and model_org_display_name. |
The metadata file can live anywhere on disk; Harbor does not pick it up from the job folder automatically.
Validation
Harbor validates your submission before it is accepted. Typical requirements include:
- The leaderboard exists and your jobs belong to you.
- Each job is uploaded with complete trial results for the leaderboard dataset.
- Task versions match what the leaderboard dataset expects.
- At least five trials per task (across all jobs on the submission when you attach more than one job).
- Standard job and trial settings (no custom timeout or resource overrides).
- Trajectories for trials that passed, when the leaderboard requires them.
If validation fails, the CLI lists what failed. Fix the underlying run or upload, then submit again.
When validation passes, you may see an unofficial accuracy figure based on completed trials. That number is informational only; admins still review the full submission.
After static validation, Harbor Hub queues dynamic validation (LLM analyze of trajectories). That runs on a separate worker service, not inside the CLI. Until it completes, dynamic_status on the submission may stay pending or running.
To keep a copy of the full report:
harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -j <JOB_UUID> -m ./metadata.yaml -o ./validation-report.jsonAfter a successful submit
- Your job is linked to the pending submission and made public so reviewers can inspect it.
- You can add more jobs to the same pending submission with
--submissionand another--job-id. - You generally cannot edit or delete a job after it is part of a submission. Upload corrections as a new job and attach it, or start a new submission if the leaderboard allows it.
Only pending submissions can be updated. Published or rejected submissions cannot be changed through this command.
Multiple jobs in one submission
Shard a large run, rerun failed tasks, or upload incrementally:
harbor leaderboard submit -l terminal-bench/terminal-bench-2-1 -j <JOB_A> -j <JOB_B> -m ./metadata.yamlOr attach jobs one at a time with the same --submission id. Minimum trials per task and dataset consistency are evaluated over the combined set of jobs.
Troubleshooting
| What you see | What to do |
|---|---|
| Not authenticated | Run harbor auth login. |
| No leaderboard matches slug | Check the slug matches Harbor Hub exactly (for example terminal-bench/terminal-bench-2-1). |
| Job not found or not accessible | Confirm the job id from your upload and that you own the job. |
| No trials uploaded | Upload the job again and ensure trials finished successfully. |
| Already linked to another pending submission | That job is already on a different open submission for this leaderboard. Finish or withdraw that submission first, or submit a different job. |
| Minimum trials per task | Run more trials per task (often at least five) on the correct dataset version, then upload and submit again. |
| Different dataset version than the submission | All jobs on one submission must use the same dataset revision. Check config.json / dataset pins on each job. |