harbor

Tasks

Multi-step Tasks

Splitting a task into sequential steps with per-step instructions, tests, and setup hooks

A multi-step task runs an agent through a sequence of ordered steps against a single, shared environment. Each step has its own instruction, tests, and optional setup hook. Steps share an environment, run sequentially, and produce per-step verifier results that roll up into a single trial-level reward.

Multi-step tasks are helpful when implementing long-horizon tasks with early stopping conditions, testing continual learning methods like memory, and observing an agent's ability to build on its prior work.

Agent skill available

Want your coding agent to guide you through creating a multi-step task? Install the create-task skill:

npx skills add harbor-framework/harbor --skill create-task

Directory layout

A multi-step task replaces the single-step instruction.md, tests/, and solution/ at the task root with a steps/ directory containing one sub-directory per step:

task.toml
Dockerfile
instruction.md

The task-level environment/ directory (with the Dockerfile and shared environment assets) still lives at the task root — the environment is built once and shared across all steps. An optional task-level tests/ directory provides shared test helpers (and optionally a fallback test.sh) that every step can use. It's uploaded to /tests for each step's verification; any step-level tests/ files are uploaded afterward and override same-named files, so the step's own test.sh wins when present.

Configuration

Declare steps in task.toml using [[steps]] array-of-tables entries. Order determines execution order.

schema_version = "1.1"

[task]
name = "harbor/example-multi-step"
description = "A three-step example task"

[environment]
build_timeout_sec = 600.0
workdir = "/app"

[[steps]]
name = "scaffold"
# Abort the trial if this step's reward falls below 1.0 —
# later steps depend on a working scaffold.
min_reward = 1.0

[steps.agent]
timeout_sec = 60.0

[steps.verifier]
timeout_sec = 30.0

[[steps]]
name = "implement"
min_reward = 0.5

[steps.agent]
timeout_sec = 120.0

[steps.verifier]
timeout_sec = 30.0

[[steps]]
name = "document"

[steps.agent]
timeout_sec = 60.0

[steps.verifier]
timeout_sec = 30.0

Step fields

Prop

Type

The per-step healthcheck runs after the step's workdir/setup.sh (if any) completes and before the agent starts. It supplements the top-level environment healthcheck rather than replacing it; a failure aborts the step and the trial.

The workdir/ directory

Anything you place under steps/{name}/workdir/ is uploaded to the container's WORKDIR before the agent runs for that step. This is the mechanism for staging step-specific files — fixtures, configs, seed data — into the location the agent will work from.

Files persist across steps

The container filesystem is shared across all steps in a trial. Files left in WORKDIR by step N are visible to step N+1 (and to its agent and verifier). This is intentional — it's how multi-step tasks compose work — but means step N's workdir/ uploads can clobber files that an earlier step's agent created. If that matters, either pick non-colliding filenames or have an earlier step's setup.sh preserve/rename state.

workdir/setup.sh (reserved)

If steps/{name}/workdir/setup.sh exists, it is executed after the step's workdir/ contents are uploaded and before the agent runs. Use it for per-step prep that the agent shouldn't have to do itself: seeding a database, installing a one-off dependency, resetting file permissions, or running a pre-flight check.

Execution contract:

  • cwd is WORKDIR (so relative paths to sibling workdir/ files work naturally).
  • User is the step's agent user (same as the agent will run under).
  • Non-zero exit aborts the step: exception_info is recorded on the step result, the agent and verifier do not run for this step, and remaining steps are skipped.
  • The script stays in WORKDIR after execution — it's uploaded like any other workdir/ file. If the agent shouldn't see it, have the script remove itself on its last line:
#!/usr/bin/env bash
set -euo pipefail

# ... your setup work ...

rm -- "$0"

Early stopping with min_reward

min_reward gates whether remaining steps run based on the current step's verifier result. Two shapes are supported:

  • Scalar: min_reward = 1.0 — gates on rewards["reward"] (the 1D convention written by reward.txt). Abort if below.
  • Dict: min_reward = { correctness = 0.8, style = 0.5 } — gates on each declared key. Abort if any key falls below its threshold or is missing from the rewards dict. Use this for multi-dim rewards or when you want to gate on a non-default key.

The trial-level reward is computed from whatever steps did run.

  • Missing reward key or no verifier_result at all → treated as -inf (aborts).
  • When config.verifier.disable is true at the trial level, min_reward is ignored (there's nothing to compare against) — a debug log notes the skip.

Exceptions during a step (agent crash, setup failure) abort the trial independently of min_reward; the threshold check is in addition to, not in place of, the exception path.

Trial-level reward: multi_step_reward_strategy

After all steps that will run have completed, Harbor derives a single trial-level verifier_result from the per-step results. The optional multi_step_reward_strategy field at the task root selects how (defaults to "mean" when unset):

Prop

Type

"final" is the right choice when the last step is an end-to-end verifier whose reward dict already represents the full-task signal.

Early stops and `final`

If a step's min_reward triggers an abort, "final" uses the aborted step's verifier_result, not the step the task author thought of as "final." Keep this in mind when designing thresholds alongside "final" strategy.

Artifacts per step

Artifacts snapshot paths out of the environment into the trial directory. For a multi-step trial, collection runs once per step into steps/{name}/artifacts/ after that step's verification. The list of paths collected at each pass is the concatenation, in order, of:

  1. Task-level artifacts (task.toml root)
  2. Trial-level artifacts (passed via TrialConfig)
  3. Step-level steps[].artifacts
# Snapshotted after every step (e.g., the script that evolves across steps).
artifacts = ["/app/greet.sh"]

[[steps]]
name = "document"
# Only the document step writes README.md, so only collect it here.
artifacts = ["/app/README.md"]

After this trial, steps/document/artifacts/ contains both greet.sh (task-level) and README.md (step-level); steps/scaffold/artifacts/ contains only greet.sh.

Example

A comprehensive worked example lives at examples/tasks/hello-multi-step-advanced/ in the Harbor repo. It exercises per-step instructions, per-step workdir/ uploads, per-step verifier environment variables, per-step healthchecks, min_reward early-stopping, and artifact collection per step (including a step-level artifact on the final step).

On this page