Russell Romney

Simon Willison says your job is to deliver code you have proven to work.

This is even more important when agents are writing most of the code. If you oversee an agent that changes the code, and you don't necessarily read every line, it's imperative that you can prove the change works. Do not hand someone a thousand-line patch written by Claude and ask them to discover whether it works.

Martin Fowler recently called my attention to Margaret-Anne Storey's writing about debt: technical, cognitive, and intent debt. Technical debt lives in code, cognitive debt lives in people, and intent debt lives in artifacts. Those artifacts are frequently not updated and that causes intent drift. The system stops reflecting what we meant to build because the goals and constraints were poorly captured or poorly maintained as the system changed.

I've been thinking a lot about these ideas because I am doing a lot of coding with agents in complex greenfield projects. As much as possible, I am using agents to plan, write, and test code. But I've found that coding agents can prove the code works and still change the system into something I did not intend.

I think other people are bumping into this too. This problem is relevant to anyone working on software now, but it's most relevant for the teams with the smallest human-to-code ratio. Cognitive debt for these teams leads to intent debt later, and intent debt seems to be the worst kind.

I think that cognitive debt will continue to grow, because the "how" will be very hard to keep in your head. But the systems will break down when they move past a single human's ownership of intent - otherwise there is nothing to verify against.

Coding with agents

The standard agent loop is something like request, plan, code, test, review. This catches classic issues like syntax errors, missing imports, broken edge cases, and type mismatches. With targeted prompting, it can verify that a given feature "works" end to end. And with good tests, it catches a surprising amount of functional wrongness.

It does not reliably catch prior intent being overwritten by reasonable new code.

Agents are pretty good at executing the current request. They seem to be quite bad at preserving the plans and decisions you have executed against before. They are especially bad when encountering previous halfwork or lazy design, because they latch onto them and extend them, codifying bad abstractions and awkward implementations.

None of these changes is catastrophic on its own. That is why they survive. The code works, the tests pass, and "everything is green". But the system moves a few inches away from representing what you thought you had built, what you expected to remain.

When you write code yourself, some of this shows up as friction. Without agents, the need to refactor, redesign, or reorganize become obvious very quickly as the pain of awkwardly and badly utilizing a past design shortcut makes it really slow to continue a small task. The code pushes back on its misimplementation. This is still true when a system written and maintained by humans takes on a lot of cognitive or technical debt - the friction to continue is greater than the relief of fixing it immediately.

But when agents write most of the code, that feedback disappears. You still direct the work: you spec out a plan, watch the implementation, maybe review parts of the code, and drive the agent toward "what can we prove to work" via normal testing. But the awkwardness is now hidden by how good agents are at making tests pass.

This actually gets worse with the best models I've used because they're increasingly able to make things work according to the general plan while disrupting abstraction boundaries, introducing complexity, and extending bad patterns. And code review or testing or specs alone doesn't seem to be enough, because they don't have anything to verify against.

So intent needs another place to live. I've developed some principles and a process that tries to solve that problem with promising early results, and I'm calling it Intent-Driven Development (IDD).

A process for intent-driven development

IDD aims to let agents and humans each work according to their strengths.

Agents are very good at repetitive coding and verifying. It protects against one of their weaknesses: preserving intent.

Humans are good at synthesizing information and injecting taste. Humans are not very good at enjoying repetetive, rote processes like code review or writing tests.

TL;DR

Maintain a human-owned English spec of the system, called SYSTEM.md. Before every change, write a spec diff that says what changes, what doesn't, and how to verify both. Review both the plan and the code implementation of the plan against the spec diff. In addition to the code, review the blast radius of the implementation, and (adversarially) whether after implementation the plan turned out to be good. Fix any issues the reviewers find. Save the process artifacts in a change folder. Update the main English spec only after the implementation proves against the verification criteria it deserves to be the new baseline.

A note on scale: My ideas for IDD are not for big teams with huge surfaces. A single spec is likely still a bad idea for these teams. This is for teams where the intent of the system (but not the implementation details) can be managed by one person or one clear owner per subsystem. I think this will become increasibly common.

What follows is more detail on the process.

SYSTEM.md

Intent-Driven Development starts with a file called SYSTEM.md.

SYSTEM.md is the current English model of the system. It says what the system does, how it works, what boundaries of data ownership or abstraction matter, and what should not quietly change.

This part of the system is owned by the human, and it is the thing against which agents can verify intent. It is the simple English language example of how the system works. A line belongs in SYSTEM.md if you would care that an agent violated it.

That filter keeps the file honest. A small app might need one page. A database proxy with leases, placement, routing, token fallback, and multiple wire protocols might need more. The length is less important than the specificity, though the length does matter - simpler is better. What matters is whether the file contains the intent you actually want preserved.

The standard is something like, "everything that is necessary, nothing that isn't". If nobody would ever review against a sentence, cut the sentence.

There are lots of reasons why a single file is bad. The biggest is that humans are lazy and people will forget to update it. But agents give us something we did not have before: cheap bidirectional verification. The overall spec can be quickly and cheaply verified against the code. This is a big upgrade from before.

Spec diffs

Every meaningful change starts with a spec diff.

Before changing the code, write the change to the English model. The spec diff says how the intended system is supposed to shift.

The smallest useful spec diff has three parts:

What changes:
- ...

What does not change:
- ...

How we will verify it:
- ...

Here is a made-up example from the kind of infrastructure work I have been doing:

What changes:
- Engine leases return redirect envelopes when another engine owns the lease.

What does not change:
- The proxy does not forward across engines.
- Metadata fast path behavior stays the same.
- Legacy token fallback stays the same.

How we will verify it:
- The proxy re-dials an engine at most once.
- Redirect envelopes include a routable host and port.
- If the engine is unreachable, the old fallback still happens.

Most planning documents describe the new thing. That is useful, but it is not enough. Agents are good at adding the new thing. They seem to be very bad at preserving the old things you forgot to mention, especially in a large system.

The spec diff makes those old things more visible.

The loop

The workflow is simple.

First, check that SYSTEM.md still matches the codebase. If the baseline is stale, fix it before planning new work. A stale spec is worse than no spec because it gives the agent a wrong target with confidence.

Then write the spec diff. The human owns this. A human will of course get help from an agent in drafting it, but the human confirms it. This is where taste, product judgment, and design intent enter the process.

Then have an agent turn the spec diff into an implementation plan like normal. The plan should say what files will change, what tests will prove the change, what areas should not be touched, and what assumptions could be wrong.

Then execute a review step: review the plan against the spec diff. This is the first intent checkpoint. If the plan changes the meaning of the diff, stop. Either the plan is wrong or the spec diff needs to change. Do not let the plan quietly redefine the intent.

Then let the agent implement. It should prove the change works in Simon's sense: run the commands, write the tests, show the output, and make the evidence easy to inspect.

Then review the implementation against the spec diff. This is the second intent checkpoint. The question this review answers is whether the code expresses the intended change without rewriting nearby intent. It's become very easy to get agents to write code that works. But they need to verify that it matches the intent.

If it conforms, update SYSTEM.md. The system now has a new baseline. Then save the old SYSTEM.md alongside the new one and write a commits.txt that links the commits to the change records.

These last steps are important. SYSTEM.md is not a wishlist. It is the current state of intent. It has to move after the implementation proves it deserves to become the new baseline.

Evidence and judgment

Tests are evidence. Logs are evidence. Screenshots are evidence. A failing-before, passing-after test is evidence. A commit link is evidence. But a model saying "this looks correct" is judgment.

You need both, but they are not the same thing. Evidence shows what happened. Judgment decides whether what happened matches the intent.

This distinction matters because agents are very good at producing confident judgment. They can tell you a change conforms when what they really have is a plausible story about why it conforms.

A good conformance review should be annoying. It should ask whether the tests prove the verification conditions. It should look for files outside the expected change radius. It should notice new dependencies, changed defaults, renamed concepts, and comments that claim more than the code proves.

When I am being careful, I ask for three reads.

One read makes the positive case that the plan or implementation matches the spec diff. One read looks around for side effects. One read is hostile and tries to prove the implementation drifted.

I'm calling these review steps: 1. Positive conformance review 2. Negative conformance review 3. Adversarial review

This sounds heavy, but agents are cheap. The effects of intent are expensive, and drift compounds quickly because every new session will take inspiration from bad code.

Tooling

You can turn this into a lot of tooling.

You can link spec diffs to commits, link commits to tests, or link tests to evidence. You can index the codebase and map acceptance criteria to symbols. Some of that will probably be useful.

I want the smallest version first.

A .intent/ directory, a SYSTEM.md, one file per spec diff, and a commits.txt file gets pretty far.

.intent/
  SYSTEM.md
  phases/
    001-phaseName-redirect-leases/
      spec-diff.md
      plan.md
      reviews.md
      commits.txt

I have started with text files, but I might add tooling when the text files hurt.

The test for a new artifact is whether it reduces ambiguity enough to justify being maintained. If it mostly creates another thing to keep in sync, it is probably making the problem worse. These artifacts have been good to me so far.

Where this breaks

Intent-driven development breaks when SYSTEM.md becomes fiction.

The worst version is when an agent implements something slightly wrong and then updates SYSTEM.md to describe what it built. That is not preserving intent. That is laundering drift into the spec.

It also breaks when the spec diff is vague. "Make routing better" is not a spec diff. A spec diff should be sharp enough that an agent can be wrong against it.

It breaks when a cross-cutting change is too large to review as one move. The answer is not to ask an agent for a more heroic review. The answer is to split the change into smaller semantic diffs.

It breaks when you skip the English step because you are excited to ship. The history gets a gap, and gaps compound.

None of this removes the need for taste. It gives taste a place to act before the code exists.

In fact, test-driven development and spec-driven development fit well into this process.

The point

Agents make the planning, coding, and verification parts of coding cheap.

What is getting more expensive is deciding what the system is, proving that changes match that decision, and keeping the decision alive long enough for the next change.

Technical debt will always exist - even great coders take shortcuts. Cognitive debt will never go away: the amount of code each human is responsible for will continue to grow.

That is why I think intent needs to become a stronger part of the development loop as a small English model of the system, changed deliberately, checked against the code, and updated only after the implementation proves it deserves to be the new baseline.

As we move ever closer to agents writing 100% of our code, software development will become about managing the controlled evolution of intent in English.

That is my goal for Intent-Driven Development.