Working on the Harness
A few nights ago I sat down to fix some broken data sources in my research project. I spun up a multi-agent session using NTM to explore the bugs and spec out fixes, splitting work across agents with beads tracking dependencies so nothing gets duplicated or dropped.
Pretty quickly I realized the data sources weren't the real problem. The agents kept approaching each source differently, re-inventing the download and processing patterns every time. So I started building skills (reusable prompt templates that get loaded into agents automatically) to standardize the work. Project-level skills for how to pull and process each type of data source. Global orchestration skills for how to coordinate with agent mail and beads consistently, so agents would communicate and claim work the same way without me repeating myself every session.
An hour in, I noticed how annoying it was to export data and context from one repo to the next. So I refactored the whole project into a monorepo.
Shortly after that I rewrote the AGENTS.md so every agent spawned into the project would understand the layout, conventions, and constraints from the first message. An agent with a well-written AGENTS.md will orient itself in seconds. Without one, you spend your first five prompts explaining the same things over and over. More Claude Codes get opened.
A couple hours later I was debugging why I couldn't copy text from my remote server's tmux panes to my local clipboard. Turned out to be a protocol issue with how I was connecting. Switched connection methods and it worked immediately. More Claude Codes get opened.
By 10pm I was wiring Discord channels into project-specific communication pathways for Captain Rusty, my OpenClaw bot for remote controlling NTM sessions. It runs sandboxed in Docker to protect the VPS from prompt injection. Once the channel routing clicked, I could manage agents from anywhere, including my phone. More Claude Codes get opened.
An hour later I was building an orchestrator agent bootstrapper that wired together all the pieces: NTM session creation, agent spawning with the right AGENTS.md context, palette prompt loading, and beads task graph initialization. Before this, I was doing each of those steps manually every time I started a session.
By midnight all three monitors were covered in Claude Code instances, and of course I had to stop and make sure they were all perfectly tiled before I could keep going. I still hadn't fixed a single data source. I wasn't pressed about it. Every piece of harness I was building would carry over to every future project, not just this one.
I fixed the data sources the next morning. Took about 30 minutes with the stronger harness.
Current Stack
Most of this comes from the community around Jeffrey Emanuel's Agentic Flywheel. Anthropic and OpenAI have been formalizing this kind of work as harness engineering. The community independently discovered and shipped working implementations of several key patterns months before the platforms published their frameworks.
Coordination and task tracking:
- Beads for git-backed issue tracking with dependency awareness. Exposes ready work, supports atomic task claims so ownership is explicit.
- Agent Mail for persistent agent identities, inbox/outbox threads, and file-reservation leases. Agent A is editing
config.ts, agent B knows not to touch it.
Orchestration:
- NTM is the command center. Tmux-based orchestration with a prompt palette (F6:
onboard,plan,decompose,swarm_beads,check_mail,status,wrap_up, and more). Where I spend most of my time. - Yazi and beads viewer as supporting tools for navigation and task visibility within NTM sessions.
Safety:
- UBS for static analysis
- DCG hooks to block dangerous shell commands before they execute
- SLB as a two-person rule for anything irreversible
Built on top:
- Fly Orchestration Kit adds a deterministic FSM layer with lease-based concurrency and time-bucketed risk budgets. Hard guardrails, not just vibes.
- fly-prompts manages the prompt and skill library powering my palette.
- APR control-plane supports the APR spec refinement loop, which iterates on specs until they're tight enough for agents to execute cleanly.
The platforms are catching up. Claude Code's agent teams are basically NTM-style tmux orchestration built natively, their internal task lists are beads.
The Loop
Day-to-day:
- Preflight. Pin config and prompt palette so every agent starts with the same constraints.
- Attach session. Create or join an NTM session. Lock objectives, check active beads to avoid duplicate ownership.
- Spec and decompose. Write the spec, break it into beads with dependencies and acceptance criteria, reserve file paths.
- Execute. One agent by default. Swarm only across truly parallelizable surfaces.
- Gate. Run safety and quality checks (UBS, DCG, SLB) before integration.
- Close. Block or resolve beads with reasons, release reservations, hand off.
Coordination overhead is real. A single focused agent with good context outperforms three tripping over shared state. For a given project I'm usually running 2-5 top-level agents, each spawning its own subagents for research, implementation, and testing.
Most of my active time goes into the spec. Getting it right, nailing the decomposition, making sure acceptance criteria are clear before any agent touches code. It's way easier to debug in planning space than in implementation space. Once the spec is solid, implementation is mostly passive: agents execute, I monitor convergence and course-correct. The better the spec, the less I intervene.
Specs only work if agents can hold onto them though. Context windows fill up, agents compact and quietly drop details. Across sessions nothing carries over unless you build the infrastructure. The agents that night kept re-inventing patterns because they had no memory of already solving the same problem. Skills, AGENTS.md, beads, the prompt palette. All of it exists because agents don't remember. The harness brings them to life.