Plan mode is a crutch

The standard plan, approve, execute cycles ignores 4 critical properties of the human-agent optimization problem. We derive these properties and show how to design for them.

The bottleneck is not the model. The model can generate, explore, and propose faster than you can read its output. The bottleneck is you.

Every coding agent knows this, which is why they all start the same way: make a plan, get approval, then execute. Front-load the human. Get alignment early. Minimize surprises.

This is wrong.

The noisy channel

You have a vague idea for a feature. It could become a thousand different implementations. A more detailed description narrows the possibilities. A spec narrows them further. Code narrows them to exactly one. Each step from idea to spec to code eliminates alternatives.

Shannon called this entropy, the uncertainty over possible outcomes. A vague idea has high entropy. Working code has zero. A coding agent's job is to get from one to the other.

But the agent can't close the gap alone. Which database? What should happen on timeout? How does this interact with the billing module? Some of those answers live in the human's head. Some don't live anywhere yet. The human's intent is partially formed, full of ambiguity they haven't confronted.

Entropy can only be reduced through communication between the human and the agent.

This communication flows both ways. The human sends specs, answers questions ("use JWT, not sessions"), reviews code ("this should retry, not fail"). Each answer eliminates possibilities. The agent sends proposals, diffs, and elicitation questions ("should this retry or fail?"). Each proposal forces the human to articulate intent they hadn't pinned down yet. The channel doesn't just transmit information. It generates it. An agent's question crystallizes a decision the human hadn't made.

The channel has constraints. The human can only process so many decisions per unit time. Call this capacity CC. Context switching between unrelated conversations degrades it: part of your attention stays on the previous topic even after you've moved on. Effective capacity:

Ceff=C1+αsC_{\text{eff}} = \frac{C}{1 + \alpha \cdot s}

where α\alpha is how often you switch contexts and ss is the recovery cost per switch. Three auth questions in a row are cheaper than alternating auth, database, auth.

Now combine the pieces. The agent needs to reduce entropy. The only way to reduce it is through the channel. The channel has finite, degradable capacity. The question becomes: what protocol maximizes entropy reduction per unit of human time?

Φ=dH(A)dt\Phi = \frac{dH(A)}{dt}

Every approach to human-agent collaboration (plan mode, spec-driven development, flat multi-agent parallelism) is a protocol for solving this. Some are better than others.

The limits

Shannon proved a hard result: reliable communication through a noisy channel is possible up to the channel's capacity, and impossible above it. No protocol can exceed CeffC_{\text{eff}}. But most protocols don't come close. They waste capacity through poor encoding. The goal isn't to break the limit. It's to stop leaving bandwidth on the table.

What counts as poor encoding? Consider what happens when you compress intent into a spec or a plan. The vaguer the description, the more possible implementations satisfy it, and the higher the distortion between what the human meant and what the agent builds. Rate-distortion theory quantifies the tradeoff: reducing distortion requires more bits through the channel. You can't get a precise implementation from a vague description without further communication.

A plan written before implementation is maximum-compression, minimum-context. It captures what the human knew before touching the code. Ambiguities, edge cases, and interactions with existing modules only surface during execution, when the channel is idle. The distortion is baked in at planning time and paid for at implementation time.

This is the failure mode of plan-then-execute. It front-loads all human-to-agent communication into one burst, then goes silent. During planning, the channel is overloaded: every design decision at once, with minimal implementation context. During execution, the channel is idle. Nothing to steer, nothing to decide. The encoding is bursty: it saturates the channel when the human has the least context, then wastes it when the human could contribute the most.

Plan-then-execute isn't wrong. It optimizes for something real: coherence. One context, sequential decisions, no conflicting edits. But it's a local maximum. It preserves coherence at the expense of channel utilization. The question is what a better encoding looks like.

A better encoding

An orchestrator breaks your feature into three tasks. Agent A starts executing task 1. While it works, the orchestrator researches task 2 (reads the relevant code, tests assumptions, identifies the specific point of uncertainty) and asks you a design question. "Should auth use JWT or session tokens?" You answer in three words. The orchestrator has what it needs. When task 1 merges, task 2 is already human-approved and starts immediately.

Under plan-then-execute, you'd have answered that same question during planning, before any code existed, alongside twenty other decisions, with less context. Same question. Worse answer. Higher cost. And the channel would have been idle for the entire execution phase.

The difference: questions arrive during execution, when the human has the most context to answer them and the channel would otherwise be wasted. There is no phase boundary between research, planning, and execution. The channel never idles.

Now scale up. You have three concurrent workstreams: auth, database schema, and frontend. Each generates questions. Under naive scheduling, they arrive in whatever order they're produced: auth question, database question, frontend question, auth question. Each switch costs you. You're holding auth context, then dropping it for database, then rebuilding it two questions later.

Instead: exhaust all auth questions before touching database. Three auth decisions in a row while you're already thinking about auth. Then one switch to database. Then one switch to frontend. The same number of questions, a fraction of the context switches.

This is disk scheduling applied to human attention. Elevator algorithms minimize seek time by ordering I/O requests by physical proximity. Question scheduling minimizes cognitive switching cost by ordering decisions by domain proximity.

These aren't tricks. They follow from the channel model. Pipelining keeps CeffC_{\text{eff}} utilized: no idle gaps between planning and execution. Locality batching minimizes ss: fewer context switches per decision. Both increase Φ\Phi: more entropy reduced per unit of human time.

Forking the channel

Pipelining keeps the channel busy within a single workstream. But a single workstream still has dead spots. While the agent writes code, there is nothing to ask. The channel idles until the next uncertainty surfaces.

The fix is to multiply the sources. Three agents working concurrently on auth, database schema, and frontend are each independently cycling through research, design, and execution. When auth is deep in implementation, with no questions to ask, the database agent is in its design phase, surfacing decisions. When database goes quiet, frontend has questions ready.

This is statistical multiplexing applied to human attention. A single workstream is bursty: high demand during design, zero during execution. kk independent workstreams, staggered in phase, produce a smoother aggregate demand curve. Each stream generates questions at a bursty rate λi(t)\lambda_i(t). The aggregate rate Λ(t)=iλi(t)\Lambda(t) = \sum_i \lambda_i(t) has lower variance relative to its mean, by the law of large numbers. As kk grows, the human channel approaches steady-state utilization. The same principle that makes packet-switched networks more efficient than circuit-switched ones: independent bursty sources, when multiplexed, smooth each other out.

But multiplexing introduces a tension. More sources means more potential context switches. The human is answering auth, then database, then frontend, then auth again. Exactly the interleaving problem from the previous section.

Forking resolves this structurally.

Each fork owns a single domain and asks its own questions directly. There is no relay through a parent. The parent handles only cross-cutting questions that span multiple forks. So the question stream from each fork is domain-coherent by construction. The human answers a cluster of auth questions, then a cluster of database questions. Multiplexing increases utilization. Domain ownership preserves locality.

The tree structure does something stronger: it minimizes the context distance between consecutive questions. In a flat agent pool, any agent can ask about any topic, so context switches are random. In a tree, questions from agents under the same fork are topically adjacent by construction. The maximum context switch between consecutive questions is bounded by twice the depth: up to the nearest common ancestor, then back down. Most consecutive questions share a recent ancestor, so the typical switch cost is near zero. The deeper the tree, the more fine-grained the domain partitioning, and the smaller the average distance between questions.

This is the tree analogue of the elevator algorithm. Flat scheduling minimizes seek time by reordering requests. Hierarchical decomposition minimizes it by generating requests that are already spatially local. The tree topology guarantees clustering without an explicit scheduling pass.

Forking also solves a second problem: the orchestrator's own coordination ceiling. A single agent managing NN tasks is itself a channel with finite capacity. Its context window fills. It loses track of dependencies. Recursive decomposition (one concern per fork, fork again if the concern is still multi-faceted, stop when a fork owns exactly one) turns the coordination problem into a tree where each node manages a bounded number of children.

The parent's view is a coarse DAG: three to five fork nodes, plus integration nodes that verify the forks compose correctly after they complete. Each fork's view is a fine-grained DAG: ten to fifteen nodes covering research, design, implementation, and review for its single concern. The parent never tracks leaf workers. The fork never worries about sibling forks. This is rate-distortion at every level of the hierarchy: each level accepts less detail in exchange for fewer items to track. Cognitive load stays bounded at every level because each level compresses away everything below it.

Living state

A plan is a snapshot. It captures what the human and agent knew at one moment, before any code was written, before any edge case surfaced, before any assumption was tested. It decays immediately.

Replace the snapshot with a state machine. A directed acyclic graph where each node is a unit of work (research a question, design a component, implement a module, review a plan) and edges are dependencies. Each node has a status: pending, active, or done. The graph is the protocol's working memory.

When research completes on one facet, the downstream design node unblocks. The agent starts designing before sibling research nodes finish. When a design answer arrives from the human, the agent synthesizes a plan and delegates implementation, even while other design questions are still open. Completed work marks nodes done and unblocks what depends on them. New information adds nodes. The graph evolves as work happens.

There are no phases. Research, planning, and execution overlap within each concern. The first implementation can start before the last design question is answered. This eliminates the plan-then-execute phase boundary not just at the top level, but at every level of the hierarchy. Each fork maintains its own DAG, each DAG runs its own concurrent lifecycle, and each node transition (pending to active to done) is an entropy reduction event.

The graph never idles because it never waits for a complete picture before acting on the picture it has.

The degenerate case

Plan mode is the degenerate protocol: one source, one burst, one idle phase, one level, zero multiplexing. The human's channel is overloaded during planning, every decision at once with minimal context, then wasted during execution. One orchestrator, no forking, no hierarchy, no concurrent question streams. A snapshot plan that decays from the moment it is approved.

The channel model says the bottleneck is the human. The protocol's job is to maximize entropy reduction per unit of human time. Pipelining eliminates idle gaps. Locality batching minimizes switching cost. Statistical multiplexing via forking smooths the demand curve. Hierarchical decomposition bounds coordination overhead and generates domain-local questions by construction. Living DAGs eliminate phase boundaries.

These are not independent design decisions. They are consequences of taking the channel model seriously.