Task Structure

The Harbor task format is designed to be maximally flexible while still being intuitive to implement. The differences between the Harbor task format and the Terminal-Bench task format are documented here.

Agent skill available

Want your coding agent to guide you through task creation? Install the create-task skill:

npx skills add harbor-framework/harbor --skill create-task

Creating a task

To create a task, you can use the following command:

harbor init --task "<org>/<name>"

This will create a new task directory with the following structure:

instruction.md

task.toml

Dockerfile

...

solve.sh

...

test.sh

...

You can then populate the files with your task's content.

Multi-step tasks

Harbor also supports tasks split into sequential steps, each with its own instruction, tests, and optional pre-agent setup. See Multi-step tasks.

To evaluate an agent on your task, you can use the following command:

harbor run -p "<path/to/task>" -a "<agent>" -m "<model>"

Task files

Instruction

The instruction is a markdown file that contains the task's instruction.

Configuration & Metadata

The task.toml file contains the task's configuration and metadata. Metadata is arbitrary and can consist of any information a task developer wants. Config params are nested into their respective sections rather than flat.

An example is shown below:

schema_version = "1.1"

[task]
name = "<org>/<name>"
description = "A short description of the task"
authors = [{ name = "Steve Jobs", email = "steve@apple.com" }]
keywords = ["trivial", "programming"]

[metadata]
difficulty_explanation = "Trivial task for demonstration"
category = "programming"

[verifier]
timeout_sec = 120.0
env = { API_KEY = "sk-test-123" }
user = "root"  # optional: run the verifier as this OS user

[agent]
timeout_sec = 120.0
user = "agent"  # optional: run the agent as this OS user

[solution]
env = { API_KEY = "sk-test-123" }

[environment]
build_timeout_sec = 600.0
docker_image = "some-org/some-name:some-tag"
os = "linux"  # or "windows" to target Windows containers
cpus = 1
memory_mb = 2048
storage_mb = 10240
gpus = 0
gpu_types = ["H100", "A100"]
allow_internet = true
env = { SOME_ENV_VAR = "${SOME_ENV_VAR}" } # harbor run requests approval from the user for these env vars

[[environment.mcp_servers]]
name = "mcp-server"
transport = "streamable-http"
url = "http://mcp-server:8000/mcp"

[environment.healthcheck]
command = "curl -f http://localhost:8080/health"
interval_sec = 5.0
timeout_sec = 30.0
retries = 3

The configuration parameters are shown below:

Prop

Type

During task creation, you can pass a --metadata-template flag with a path to a TOML file to pre-populate task.toml with metadata fields and config defaults:

harbor tasks init "<task-name>" --metadata-template task-template.toml

Sections in the template override Harbor's built-in defaults. Anything not specified falls back to the defaults listed above.

Environment

The environment definition is placed in an environment/ folder. Harbor does not require any specific file to exist in that directory. Which file is required depends on the environment type being used for the evaluation. For example, to use --env docker, the DockerEnvironment class checks that an environment/Dockerfile or environment/docker-compose.yaml is present. Different environment types could require other files to be present (e.g. an Apptainer environment could check for an image.def file). Most cloud sandbox providers only support Dockerfile defined environments and not docker compose.

The target container OS is declared via [environment].os in task.toml. It defaults to "linux"; set it to "windows" to target Windows containers (see Windows tasks for details). Container-side paths, file transfer, command execution, and script discovery all adapt to this value automatically.

There are a few special paths in the environment's filesystem (Linux paths shown; Windows containers use the C: drive equivalents — C:/logs, C:/tests, C:/solution):

Path	Description
`/logs/verifier/`	Contains the reward file and other verifier logs.
`/logs/agent/`	A directory agents can use to store logs from their runs.
`/solution/`	The solution folder is copied here by the Oracle agent at runtime and executed from the working directory.
`/tests/`	The tests folder is copied here by the Harbor harness at runtime and executed from the working directory.

The /logs/ directory is downloaded to the host after the agent/verifier run and are often useful for debugging and analysis.

Solution (Optional)

The solution folder must contain a solution/solve.sh script (or solve.bat for Windows tasks; the right extension is selected based on [environment].os). Other dependencies are allowed. This folder is copied to /solution by the Oracle agent at runtime and executed from the working directory.

If no solution is provided, the Oracle agent cannot be used to sanity check the task.

Tests

The tests folder must contain a tests/test.sh script (or test.bat for Windows tasks). The test script should install test dependencies and verify the agent completed the instruction. In Terminal-Bench, this was done by running a pytest command, but this is now left to the task developer.

Other dependencies are allowed in the tests/ folder. This folder is copied to /tests by the Harbor harness at runtime and executed from the working directory. E.g. bash /tests/test.sh is executed from /app in many cases.

We recommend using absolute paths in your test script to avoid relative path issues.

Importantly, the test script must produce a reward file in the /logs/verifier/ directory. This is the file that the verifier will read to determine if the task was successful.

There are two ways to produce a reward file:

Reward File	Format	Description
`/logs/verifier/reward.txt`	Plain text (e.g. `1`)	A plain text file containing a single integer or float value, typically `1` for success or `0` for failure.
`/logs/verifier/reward.json`	JSON (e.g. `{ "runtime_sec": 1.23, "accuracy": 0.95, ... }`)	A JSON file that can define multiple metrics as rewards, but they must be floats or integers.

You may use either reward.txt or reward.json as the output of your test script. Harbor will read reward.json by default and fall back to reward.txt.

For verifiers with multiple criteria, score aggregation, and LLM judging, see Rewardkit.

Often, a reward can be determined by the exit code or a unit test command.

tests/test.sh

#!/bin/bash

uvx pytest /tests/test.py

if [ $? -eq 0 ]; then
  echo 1 > /logs/verifier/reward.txt
else
  echo 0 > /logs/verifier/reward.txt
fi

Verifier environment (Shared vs Separate)

By default, the verifier runs inside the same container as the agent — it can see the workdir, any tools the agent installed, the agent's environment variables, etc. For tasks that need an isolated grading environment — proprietary grading code that the agent must not see, a different OS for grading, a clean image with verifier-only dependencies — declare a dedicated separate verifier environment.

You opt in by adding [verifier.environment] to task.toml:

[verifier]
environment_mode = "separate"  # optional when [verifier.environment] is present

[verifier.environment]
docker_image = "my-org/grading-image:latest"
cpus = 2
memory_mb = 1024

The two new fields under [verifier]:

environment_mode: "shared" (default) or "separate".
environment: optional sub-section using the same schema as the top-level [environment] block.

Resolution rules when fields are omitted:

`environment_mode`	`[verifier.environment]`	Result
omitted	omitted	`"shared"`
omitted	present	`"separate"`
`"shared"`	omitted	`"shared"`
`"shared"`	present	validation error
`"separate"`	omitted	`"separate"` (uses a fresh copy of the top-level `[environment]`)
`"separate"`	present	`"separate"` (uses the verifier-specific environment)

For separate verifier environments, Harbor builds the verifier image from one of the task's tests/ directories — the step's own steps/<name>/tests/ when it provides an OS-appropriate test script, otherwise the task's top-level tests/. The image must provide /tests/test.sh (Linux) or /tests/test.bat (Windows) on its own; Harbor does not upload tests/ at runtime. A typical layout:

Task directory with separate verifier env

my-task/
├── task.toml
├── instruction.md
├── environment/
│   └── Dockerfile        # agent's environment
└── tests/
    ├── Dockerfile        # verifier's environment (builds /tests/test.sh into the image)
    ├── test.sh
    └── grader.py

What gets transferred from the agent env to the verifier env

When a separate verifier env runs, Harbor copies these inputs from the agent env into the verifier env at the same paths:

/logs/artifacts/ (the agent's "publish" directory — files the agent intentionally produced for grading).
Every artifact listed in the task-level artifacts = field, the trial-level artifacts, and the current step's artifacts = field.

/logs/agent/ and /logs/verifier/ are not transferred implicitly. However, if you explicitly declare them as configured artifacts, they will be transferred — this is the canonical pattern for a trajectory-grading verifier:

artifacts = ["/logs/agent/trajectory.json"]

That single line makes the agent's trajectory file available to the separate verifier container at the same path.

Per-step verifier environments (multi-step tasks)

Each step can override the trial-level verifier mode under [steps.verifier]. Mixed shared/separate is supported — for example, an early "build" step that uses the agent env for fast feedback, plus a final "grade" step that uses a separate locked-down grading image:

task.toml (multi-step mixed mode)

[[steps]]
name = "build"
# Inherits the trial-level mode (shared by default).

[[steps]]
name = "grade"
[steps.verifier.environment]
docker_image = "my-org/grading-image:latest"

Step-level resolution:

[steps.verifier].environment_mode (when set) wins.
[steps.verifier.environment] present + mode omitted → implies "separate".
Otherwise the step inherits the trial-level resolution.

Tests for each step are validated against the OS of that step's effective verifier environment, not always the top-level [environment].os. So a Linux agent can be graded by a Windows verifier env (and vice versa) — Harbor checks that the corresponding test.bat / test.sh exists in the step's tests/ dir at task-load time.

Task Structure

On this page