Starflow Harness Design

This page defines the critical path for designing Spacecraft agent behavior as Starflow graphs: Starflow Designer (starflow.yaml, sf CLI, triggers) wraps Starflow Engine (@celestial/starflow-engine) node execution, while Spacecraft supplies the harness blueprint and tools.

Local registry operator guide — ~/.celestial, hangar, optional hosted publish
Starflow Designer vs Engine
Gateway + personal agent
Sandbox adapters
VPN adapters
Starflow Guided RFP Smoke
Starflow CLI Export

Why this path

We already have a working instance path (sc blueprint -> sc launch --dry-run).
@celestial/starflow-engine already supports harness-relevant node types:
- llm-call
- context-check
- summarize
- hitl-gate
- session-read / session-write
- stream-event
Starflow Designer gives a repeatable execution shell (pipelines, CI-shaped steps) around those behaviors; Starflow Engine executes the graph with durability primitives (wait / resumeWait; see ADR 024).

Critical baseline commands

Local baseline:

npm run spacecraft:harness:design

Through Starflow:

npm run sf:spacecraft-harness-design

The baseline does two things:

Runs targeted harness behavior tests in @celestial/starflow-engine.
Re-runs the Spacecraft instance smoke flow.

Harness behavior model (inspired by OpenCode and Kilo)

1) Explicit modes (OpenCode-style)

Adopt a mode boundary in harness design:

build mode: full execution graph (tools, shell, side effects).
plan mode: read-only or no-op side-effect paths.

Implementation suggestion:

Represent mode as a context variable (mode=build|plan).
Gate side-effect nodes with condition expressions and flag-gate nodes.

2) Specialist subflows (OpenCode subagent pattern)

Split long chains into focused graph segments:

explore segment (read-only discovery)
implement segment (edits / execution)
review segment (validation / diagnostics)

Implementation suggestion:

Use named nodes and explicit edges to keep transitions inspectable.
Emit stream-event updates between segments for observability.

3) Isolated parallel variants (Kilo worktree pattern)

Kilo’s worktree parallelism maps cleanly to Starflow’s parallel step model.

Implementation suggestion:

Run multiple harness variants in parallel steps with different variables/models.
Compare outputs in a downstream synthesis node.
Keep merge/apply as an explicit approval step (hitl-gate equivalent in policy).

4) Run-script ergonomics (Kilo run button pattern)

Keep a single command to exercise the whole design baseline:

npm run spacecraft:harness:design

This mirrors Kilo’s “run script per worktree” ergonomics: one reliable path for repeated validation.

Immediate next design milestones

Add a canonical “agent harness behavior” graph fixture in @celestial/starflow-engine.
Add a Starflow pipeline that executes two model/strategy variants in parallel.
Add telemetry assertions for loop count, branch selection, and pause/resume transitions.

Starter scaffold preset

Generate a starter harness package (blueprint + Starflow config):

npm run spacecraft:harness:scaffold -- my-harness

Or generate from Starflow CLI directly:

npm exec -w @celestial/starflow-cli -- starflow harness create my-harness --template=tool-agent

This creates:

.starsystem/harnesses/my-harness/blueprint.yaml
.starsystem/harnesses/my-harness/starflow.yaml
.starsystem/harnesses/my-harness/harness.yaml
.starsystem/harnesses/my-harness/checksums.txt

Then run the scaffolded Starflow config on a dedicated port:

npm exec -w @celestial/starflow-cli -- starflow server --config=".starsystem/harnesses/my-harness/starflow.yaml" --port=7712

And trigger either mode:

npm exec -w @celestial/starflow-cli -- starflow run harness-plan --port=7712
npm exec -w @celestial/starflow-cli -- starflow run harness-build --port=7712

A/B harness evaluation

Run the same prompt suite through two harnesses and capture side-by-side logs:

npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-tooling

Optional custom prompt file (one prompt per line):

npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-tooling .starsystem/harness-evals/my-prompts.txt

The script writes a timestamped report under .starsystem/harness-evals/. It also writes deterministic scoring artifacts in the same run directory:

scores.json for CI/machine consumption
scores.md for human review

To re-score an existing run directory:

npm run spacecraft:harness:score -- .starsystem/harness-evals/<run-id>

Use a custom rubric file (JSON) to tune weights/patterns without editing code:

npm run spacecraft:harness:score -- .starsystem/harness-evals/<run-id> --rubric ./scripts/harness-score-rubric.json

spacecraft:harness:ab-eval automatically uses scripts/harness-score-rubric.json when present. To override it for one run:

RUBRIC_FILE=./my-rubric.json npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-tooling

Lint rubric config before running eval:

npm run spacecraft:harness:rubric:lint -- ./scripts/harness-score-rubric.json

Starflow Harness Design

Starflow Harness Design

Related

Why this path

Critical baseline commands

Harness behavior model (inspired by OpenCode and Kilo)

1) Explicit modes (OpenCode-style)

2) Specialist subflows (OpenCode subagent pattern)

3) Isolated parallel variants (Kilo worktree pattern)

4) Run-script ergonomics (Kilo run button pattern)

Immediate next design milestones

Starter scaffold preset

A/B harness evaluation