Skip to content

Starflow Harness Design

This page defines the critical path for designing Spacecraft agent behavior as Starflow graphs: Starflow Designer (starflow.yaml, sf CLI, triggers) wraps Starflow Engine (@celestial/starflow-engine) node execution, while Spacecraft supplies the harness blueprint and tools.

  • We already have a working instance path (sc blueprint -> sc launch --dry-run).
  • @celestial/starflow-engine already supports harness-relevant node types:
    • llm-call
    • context-check
    • summarize
    • hitl-gate
    • session-read / session-write
    • stream-event
  • Starflow Designer gives a repeatable execution shell (pipelines, CI-shaped steps) around those behaviors; Starflow Engine executes the graph with durability primitives (wait / resumeWait; see ADR 024).
  • Local baseline:
Terminal window
npm run spacecraft:harness:design
  • Through Starflow:
Terminal window
npm run sf:spacecraft-harness-design

The baseline does two things:

  1. Runs targeted harness behavior tests in @celestial/starflow-engine.
  2. Re-runs the Spacecraft instance smoke flow.

Harness behavior model (inspired by OpenCode and Kilo)

Section titled “Harness behavior model (inspired by OpenCode and Kilo)”

Adopt a mode boundary in harness design:

  • build mode: full execution graph (tools, shell, side effects).
  • plan mode: read-only or no-op side-effect paths.

Implementation suggestion:

  • Represent mode as a context variable (mode=build|plan).
  • Gate side-effect nodes with condition expressions and flag-gate nodes.

2) Specialist subflows (OpenCode subagent pattern)

Section titled “2) Specialist subflows (OpenCode subagent pattern)”

Split long chains into focused graph segments:

  • explore segment (read-only discovery)
  • implement segment (edits / execution)
  • review segment (validation / diagnostics)

Implementation suggestion:

  • Use named nodes and explicit edges to keep transitions inspectable.
  • Emit stream-event updates between segments for observability.

3) Isolated parallel variants (Kilo worktree pattern)

Section titled “3) Isolated parallel variants (Kilo worktree pattern)”

Kilo’s worktree parallelism maps cleanly to Starflow’s parallel step model.

Implementation suggestion:

  • Run multiple harness variants in parallel steps with different variables/models.
  • Compare outputs in a downstream synthesis node.
  • Keep merge/apply as an explicit approval step (hitl-gate equivalent in policy).

4) Run-script ergonomics (Kilo run button pattern)

Section titled “4) Run-script ergonomics (Kilo run button pattern)”

Keep a single command to exercise the whole design baseline:

  • npm run spacecraft:harness:design

This mirrors Kilo’s “run script per worktree” ergonomics: one reliable path for repeated validation.

  1. Add a canonical “agent harness behavior” graph fixture in @celestial/starflow-engine.
  2. Add a Starflow pipeline that executes two model/strategy variants in parallel.
  3. Add telemetry assertions for loop count, branch selection, and pause/resume transitions.

Generate a starter harness package (blueprint + Starflow config):

Terminal window
npm run spacecraft:harness:scaffold -- my-harness

Or generate from Starflow CLI directly:

Terminal window
npm exec -w @celestial/starflow-cli -- starflow harness create my-harness --template=tool-agent

This creates:

  • .starsystem/harnesses/my-harness/blueprint.yaml
  • .starsystem/harnesses/my-harness/starflow.yaml
  • .starsystem/harnesses/my-harness/harness.yaml
  • .starsystem/harnesses/my-harness/checksums.txt

Then run the scaffolded Starflow config on a dedicated port:

Terminal window
npm exec -w @celestial/starflow-cli -- starflow server --config=".starsystem/harnesses/my-harness/starflow.yaml" --port=7712

And trigger either mode:

Terminal window
npm exec -w @celestial/starflow-cli -- starflow run harness-plan --port=7712
npm exec -w @celestial/starflow-cli -- starflow run harness-build --port=7712

Run the same prompt suite through two harnesses and capture side-by-side logs:

Terminal window
npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-tooling

Optional custom prompt file (one prompt per line):

Terminal window
npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-tooling .starsystem/harness-evals/my-prompts.txt

The script writes a timestamped report under .starsystem/harness-evals/. It also writes deterministic scoring artifacts in the same run directory:

  • scores.json for CI/machine consumption
  • scores.md for human review

To re-score an existing run directory:

Terminal window
npm run spacecraft:harness:score -- .starsystem/harness-evals/<run-id>

Use a custom rubric file (JSON) to tune weights/patterns without editing code:

Terminal window
npm run spacecraft:harness:score -- .starsystem/harness-evals/<run-id> --rubric ./scripts/harness-score-rubric.json

spacecraft:harness:ab-eval automatically uses scripts/harness-score-rubric.json when present. To override it for one run:

Terminal window
RUBRIC_FILE=./my-rubric.json npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-tooling

Lint rubric config before running eval:

Terminal window
npm run spacecraft:harness:rubric:lint -- ./scripts/harness-score-rubric.json