Walkthrough

Pipeforge

Personal Claude Code orchestration plugin and PostgreSQL MCP backend. Built from a personal need to coordinate work across multiple machines and parallel worktrees, then evolved into where I prototype the context-engineering patterns I apply in client work.

This is not a product. It is the infrastructure I use daily. Below: real screenshots from real runs, in execution order. Tracker IDs from a public open-source project (`POE-###`) are kept visible — they are already public on GitHub. Tracker IDs from a proprietary client project have been replaced with `[TSK-000]` placeholders.

commands

SKILL.md files

agents

pipeline templates

MCP tools

Postgres tables

What it does, briefly

Two orchestration patterns from different context-window eras: `/pipe` — orchestrator clears, MCP holds state, sub-agents do the work (built to survive the compact-prone 200k era). `/deliver` — orchestrator keeps full context, dispatches from there (viable once 1M removed compact pressure). A configurable context_window_size drives chunk sizing independently of pattern.

Layered defence: structured state over fuzzy retrieval, source citations on KB entries, sub-agent adversarial review, verification before completion, heavy-review (up to 5 parallel reviewer agents, derived from the open-source pr-review-toolkit), and hook-level enforcement (e.g. cd blocking to prevent worktree drift).

Discovery & planning

Chunk dependency graph: backend foundation → tests + 3 independent frontends

Planner output for an epic. Chunk 1 (backend) is the foundation; Chunk 2 (backend-test) waits for it; Chunks 3 / 4 / 6 (frontend, comparator backend, desktop/OCR) can run in parallel after Chunk 1; Chunk 5 (comparator frontend) waits for Chunk 4. The numbered list translates the diagram into execution rules the orchestrator follows.

Workspace coordination

/show-routes: three concurrent worktrees, current task per route, progress bars — `/show-routes`: three concurrent worktrees, current task per route, progress bars

Lightweight workspace-state visualiser. Three worktrees (main / second / third), each with its active route, current in-progress task, and progress bar (2/6, 0/1, 5/5 done). Plus per-route next-step suggestions. Visual proof of the "parallel worktrees" workflow claim.

Workspace coordination

/whats-next: three available routes with conflict-risk and readiness pills — `/whats-next`: three available routes with conflict-risk and readiness pills

Route-discovery output across one workspace. A fresh epic (Desktop App v2, 0/4 with LOW conflict risk and 4 ordered subtasks), a 3-task actionable bundle (Standalone v1.0), and an in-flight resume option (11 fixed, ~16 remain). Status pills (●/●/●) signal readiness; conflict-risk surfaces upfront so the operator picks safely.

Plan review gate

Plan-reviewer self-loads context: getConfig, getDiscoveries, YouTrack issue — Plan-reviewer self-loads context: `getConfig`, `getDiscoveries`, YouTrack issue

After single-task mode is confirmed, the plan-reviewer agent dispatches and immediately pulls context from MCP — Pipeforge's `getConfig` and `getDiscoveries`, plus the external tracker integration (YouTrack issue lookup). Layered context loading in action: orchestrator passes only a task ID; the sub-agent self-loads everything it needs.

Plan review gate

Plan review finishes with 5 severity-classified findings + a graceful proceed prompt — Plan review: 5 severity-classified findings + a graceful proceed prompt

Review completed in 1m 10s (10 tool uses, 33.1k tokens). Each finding carries a one-line summary AND a concrete fix recommendation. The orchestrator then offers 4 paths: address all / highs only / proceed anyway / skip task. Implementation gate that catches design issues *before* code is written.

Pipeline execution

Foundation chunk done in 14m 9s, 5 agents dispatching for the next wave

Same project as the dependency graph above, now running. Group 1 backend (POE-65 backend engine) completed — 86 tool uses, 104.5k tokens. The orchestrator then dispatches Group 2 (chunks 2 / 3 / 4 / 6, all depending on chunk 1) plus a light review for chunk 1 — 5 concurrent agents, each tagged with its specialisation (general, backend-test, frontend, backend, general). Colour-coded labels make agent type visually obvious.

Pipeline execution

8 agents in parallel: Group 1 light reviews + Group 2 implementers

Mid-pipeline orchestration. Group 1 (4 implementer chunks: data layer, color API, CSS styles, preferences) reports per-agent token + tool-use accounting. The orchestrator immediately dispatches Group 2 implementers (4 agents) alongside light reviewers for Group 1 (4 more) — dependency-aware planning combined with parallel scheduling.

Heavy-review boundary

5 reviewers, 34 findings, 0 critical — clean to merge

Heavy-review output. Five specialised agents finished in parallel (code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer) with token / tool-use transparency per agent. Findings classified Critical / Important / Suggestion / Deferred per agent, with a module breakdown (`backend 21+1 deferred · backend-test 13+1 deferred`). 0 Critical = merge gate clears. The pattern is derived from the open-source `pr-review-toolkit`, adapted to the Pipeforge agent model.

Fidelity audit

Handler refactor verified: 17/17 atoms preserved, 2 cosmetic LOWs explained

The auditor's voice on a real refactor. 17 behavioural atoms preserved across both handler moves, 0 missing. Two LOW-severity deltas surfaced with rationale: a string prefix shortening that's *more* correct given the new caller path, and a log message rename that's purely cosmetic. Calibrated, doesn't cry wolf.

Fidelity audit

A real regression caught: flash key silently unified across two flows

The "why this matters" example. During a refactor, an `\InvalidArgumentException` flash key changed in a way that silently merged two distinct user-facing flows into one generic error message. The auditor surfaces: original key, new key, file:line range, the exact UX consequence, and a one-line revert fix. Verdict: 20/21 OK, 1 CRITICAL DRIFT.

Test audit

4,190 PHP tests + 157 E2E green, 259 structured learnings logged in 7m 13s (as of 2026-04-27)

Full test-audit output. PHPStan clean, Unit 2674/2674 (8539 assertions), Integration 459/459, Functional 699/699, Smoke 201/201, E2E 157/157 (1 intentional skip). Plus 259 learnings extracted across backend / e2e / frontend / pattern / gotcha — top categories called out for the next `/integrate-agent-learnings` pass.

Two stories the screenshots back up

1 — Caught an Opus 4.7 regression with my own pipeline telemetry

After upgrading to Opus 4.7, identical pipeline shapes — same task graphs, same skills, only the model changed — ran ~2× wall-clock and ~2.3× tokens vs the Opus 4.6 baseline. Eight+ runs on real tasks ruled out variance.

Two reproducible root causes:

Disk round-tripping of in-memory MCP results. Sub-agents materialised MCP tool responses to the tool-results spool and looped Read + jq over them instead of using the in-context value. Same file re-read 10+ times inside a single agent run, multiplied across every implement / heavy-review fan-out.
Literal-instruction adherence overriding session-level config. Per-session overrides ("don't post deferred items to the tracker") were acknowledged and then violated — the agent reverted to the written spec, silently, with plausible-looking output.

Shipped two revertible spec fixes — config-honoured deferred mode and DRY-extracted per-phase metrics capture — and filed the reproducible investigation upstream: anthropics/claude-code#6354.

2 — The fidelity-auditor exists because of a near-miss

During a docs migration, AI silently summarised 491 lines of behavioural specs into 122. Type-checks passed. Tests passed. The behaviour was structurally fine but operationally lossy — the kind of regression that doesn't surface until months later, in production, as "wait, I thought we did X here?"

The fidelity-auditor is the structural fix. On every move / split / extract, it classifies behavioural atoms as Missing / Simplified / Evolved / Preserved, prints severity per row, and demands explicit reasoning on every LOW. It refuses to ship a verdict it can't justify — see the two fidelity-audit panels above (a clean handler refactor with calibrated LOWs, and a real regression caught) for what that looks like in practice.

Appendix — more artefacts

Four additional screenshots covering more of the same pipeline surface. Same redaction policy as above.

Fidelity audit PASS — 4-row severity table on a structural docs merge. ~45 atoms preserved, full run 4m 31s.

`/whats-next` continued: 13-task backlog flagged for `/discover-routes`, 6 epics flagged "needs `/breakdown` first".

`/deliver` prompts the operator with 5 shipping modes: adaptive single-flow, two-step, hand-implement, park, free text.

Step 0 catches a route mismatch and pauses for user resolution — 4 explicit options instead of failing or proceeding blindly.