Skip to content

The architecture review is the engagement.

Why half of the work in fixing a system is reading it carefully before changing anything — and why the read is the part that pays for itself.

The first engagement we ever ran ended with us not writing a single line of code. The team had a billing service that produced wrong invoices about twice a week. Their best engineers had spent three months on it. We were brought in to help fix the service. For ten days we read instead.

We read the codebase, the support tickets, the post-mortems. We talked to four people for an hour each. On the eleventh day we sent a four-page document that said, in effect: the billing service is fine. The wrong invoices come from a warehouse job that silently truncates decimals before billing ever sees them.

They fixed it the same week. We were paid for two weeks and went home. It was the most useful engagement we did that year.

Reading is the unglamorous part

Most of the systems we are asked to fix do not need new architecture. They need someone to read what is already there carefully enough to write down what it actually does. That work does not produce a diagram of a future state. It produces a paragraph.

If you cannot write down the system in a paragraph, the system is not done.

— Kernwise · Engineering notes 026

The discipline we have settled on is to spend the first two weeks of every engagement reading and writing, and to refuse, gently, every suggestion that we should just start. The cost of starting before you have read is small in week one. By week six it is the whole engagement.1

What we actually do in those two weeks

  • Read the codebase, beginning at the entry point of the surface you care about.
  • Read the last six months of incident write-ups and what changed afterward.
  • Talk to four to six engineers individually. Ask each the same three questions.
  • Write a single document that says, in the team's own voice, what the system is.
  • Show the document back to the team and let them mark it up.

The decimal bug was a one-line fix once we found it. Finding it meant tracing a value across two services that did not log it. Here is the change that shipped, reduced to its essence:

warehouse/export.ts
export function exportLine(qty: number, unitPrice: number): Line {
  const amount = qty * unitPrice;
- return { amount: Math.trunc(amount) };
+ return { amount: Math.round(amount * 100) / 100 };
}

Why this works

It works because the model of the system the team holds, after two years of shipping, is almost never the model the system actually obeys. There is no malice in this; it is the nature of systems built under pressure. The map and the territory diverge.

A two-week read by someone with no prior context produces a new map. The team reads it and either agrees, in which case the next steps are obvious, or disagrees, in which case the conversation that follows is the most useful one they have had in a quarter.

orderwarehousebillinginvoice
Fig 1The same request, billing's model on the left, the system's actual path on the right.

When it does not work

It does not work when the team already knows what the system does but leadership does not. In that case the document is not for engineering; it is for the people above them, and the work shifts from finding the problem to making it legible.

And it does not work when the actual problem is the team, not the system. We do not do org work. When we find that, we say so, and we stop.

Footnotes

  1. The four-page document mentioned in the opening is the engagement that taught us this. We have used the same format on every engagement since.