Starflow Harness Design
Starflow Harness Design
Section titled “Starflow Harness Design”This page defines the critical path for designing Spacecraft agent behavior as Starflow graphs: Starflow Designer (starflow.yaml, sf CLI, triggers) wraps Starflow Engine (@celestial/starflow-engine) node execution, while Spacecraft supplies the harness blueprint and tools.
Related
Section titled “Related”- Local registry operator guide —
~/.celestial, hangar, optional hosted publish - Starflow Designer vs Engine
- Gateway + personal agent
- Sandbox adapters
- VPN adapters
- Starflow Guided RFP Smoke
- Starflow CLI Export
Why this path
Section titled “Why this path”- We already have a working instance path (
sc blueprint->sc launch --dry-run). @celestial/starflow-enginealready supports harness-relevant node types:llm-callcontext-checksummarizehitl-gatesession-read/session-writestream-event
- Starflow Designer gives a repeatable execution shell (pipelines, CI-shaped steps) around those behaviors; Starflow Engine executes the graph with durability primitives (
wait/resumeWait; see ADR 024).
Critical baseline commands
Section titled “Critical baseline commands”- Local baseline:
npm run spacecraft:harness:design- Through Starflow:
npm run sf:spacecraft-harness-designThe baseline does two things:
- Runs targeted harness behavior tests in
@celestial/starflow-engine. - Re-runs the Spacecraft instance smoke flow.
Harness behavior model (inspired by OpenCode and Kilo)
Section titled “Harness behavior model (inspired by OpenCode and Kilo)”1) Explicit modes (OpenCode-style)
Section titled “1) Explicit modes (OpenCode-style)”Adopt a mode boundary in harness design:
buildmode: full execution graph (tools, shell, side effects).planmode: read-only or no-op side-effect paths.
Implementation suggestion:
- Represent mode as a context variable (
mode=build|plan). - Gate side-effect nodes with
conditionexpressions andflag-gatenodes.
2) Specialist subflows (OpenCode subagent pattern)
Section titled “2) Specialist subflows (OpenCode subagent pattern)”Split long chains into focused graph segments:
exploresegment (read-only discovery)implementsegment (edits / execution)reviewsegment (validation / diagnostics)
Implementation suggestion:
- Use named nodes and explicit edges to keep transitions inspectable.
- Emit
stream-eventupdates between segments for observability.
3) Isolated parallel variants (Kilo worktree pattern)
Section titled “3) Isolated parallel variants (Kilo worktree pattern)”Kilo’s worktree parallelism maps cleanly to Starflow’s parallel step model.
Implementation suggestion:
- Run multiple harness variants in parallel steps with different variables/models.
- Compare outputs in a downstream synthesis node.
- Keep merge/apply as an explicit approval step (
hitl-gateequivalent in policy).
4) Run-script ergonomics (Kilo run button pattern)
Section titled “4) Run-script ergonomics (Kilo run button pattern)”Keep a single command to exercise the whole design baseline:
npm run spacecraft:harness:design
This mirrors Kilo’s “run script per worktree” ergonomics: one reliable path for repeated validation.
Immediate next design milestones
Section titled “Immediate next design milestones”- Add a canonical “agent harness behavior” graph fixture in
@celestial/starflow-engine. - Add a Starflow pipeline that executes two model/strategy variants in parallel.
- Add telemetry assertions for loop count, branch selection, and pause/resume transitions.
Starter scaffold preset
Section titled “Starter scaffold preset”Generate a starter harness package (blueprint + Starflow config):
npm run spacecraft:harness:scaffold -- my-harnessOr generate from Starflow CLI directly:
npm exec -w @celestial/starflow-cli -- starflow harness create my-harness --template=tool-agentThis creates:
.starsystem/harnesses/my-harness/blueprint.yaml.starsystem/harnesses/my-harness/starflow.yaml.starsystem/harnesses/my-harness/harness.yaml.starsystem/harnesses/my-harness/checksums.txt
Then run the scaffolded Starflow config on a dedicated port:
npm exec -w @celestial/starflow-cli -- starflow server --config=".starsystem/harnesses/my-harness/starflow.yaml" --port=7712And trigger either mode:
npm exec -w @celestial/starflow-cli -- starflow run harness-plan --port=7712npm exec -w @celestial/starflow-cli -- starflow run harness-build --port=7712A/B harness evaluation
Section titled “A/B harness evaluation”Run the same prompt suite through two harnesses and capture side-by-side logs:
npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-toolingOptional custom prompt file (one prompt per line):
npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-tooling .starsystem/harness-evals/my-prompts.txtThe script writes a timestamped report under .starsystem/harness-evals/.
It also writes deterministic scoring artifacts in the same run directory:
scores.jsonfor CI/machine consumptionscores.mdfor human review
To re-score an existing run directory:
npm run spacecraft:harness:score -- .starsystem/harness-evals/<run-id>Use a custom rubric file (JSON) to tune weights/patterns without editing code:
npm run spacecraft:harness:score -- .starsystem/harness-evals/<run-id> --rubric ./scripts/harness-score-rubric.jsonspacecraft:harness:ab-eval automatically uses scripts/harness-score-rubric.json when present.
To override it for one run:
RUBRIC_FILE=./my-rubric.json npm run spacecraft:harness:ab-eval -- demo-harness demo-harness-toolingLint rubric config before running eval:
npm run spacecraft:harness:rubric:lint -- ./scripts/harness-score-rubric.json