Fix-it engagements

A fix-it engagement starts with a sentence that names a real failure. The p99 on checkout went from 240 ms to 1.8 s three weeks ago and nobody changed checkout. One worker in fifty crashes on the same input and we cannot tell which input. The monthly bill on one service tripled and the traffic did not. A nightly job has been writing rows with a balance that does not reconcile, and it has been doing it quietly for longer than anyone wants to admit.

These are not architecture problems. They are bugs — specific, reproducible eventually, fixable. The team usually knows roughly where it lives. What they have run out of is uninterrupted time and a fresh pair of eyes that has not already formed a theory about the cause.

What this is, and what it is not

The architecture review is reading. We spend two weeks understanding a system and hand back a map. Nobody's code changes. A fix-it engagement is the other shape entirely. We come in with one broken thing, we put our hands in the system, and we leave with that thing fixed and merged.

Because the scope is one problem, the engagement is short and the outcome is binary. Either the latency is back under budget, the worker stops crashing, the bill comes down, the rows reconcile — or it does not, and we say so plainly. There is no version of this work that ends in a slide reading further investigation recommended. That sentence is what you came to us to avoid.

How we work a single fault

The method is narrowing. We start with the whole system as the suspect and cut the space of possible causes roughly in half at each step, until what is left is small enough to read line by line. The expensive part is almost never the fix. It is building a reproduction you can trust, because a fault you cannot summon on demand is a fault you cannot prove you have fixed.

Fig 1A fix-it engagement is binary search on the cause. Each measurement halves what is left, until the remaining space is small enough to read directly.

In practice the week tends to go like this.

Reproduce the failure on a machine we control, even if it takes two days. Nothing else starts until this works.
Add the measurement that was missing — the timing, the counter, the log line on the path nobody instrumented.
Bisect: by commit if it appeared on a date, by input if it appeared on a payload, by load if it appeared under traffic.
Read the small remaining region until the cause is named in one sentence, not a list of candidates.
Fix it, with a test that fails before the change and passes after, so the fix is provable and not a coincidence.

The latency regressions are usually the cleanest, because latency leaves a trace. A flame profile under realistic load almost always shows the truth directly: one frame is wide that should be thin, and the work it is doing is work that used to be cached, or used to be batched, or used to not happen at all on that path.

Fig 2A flame profile of the slow path. One frame is far wider than it should be — that is the regression, sitting under a function nobody suspected.

A worker that crashes one time in fifty needs a different instrument: capture the exact input that triggered it, not a summary of it, because the bug is almost always in the one byte the summary threw away. Here is the change that usually ends that class of engagement — log the payload before the parse, not after the crash.

 def handle(message: bytes) -> None:
-    record = parse(message)
-    process(record)
+    try:
+        record = parse(message)
+    except ParseError:
+        # Keep the exact bytes that broke us. A reproduction is worth
+        # more than a stack trace; the trace tells you where, the bytes
+        # tell you why.
+        deadletter.put(message)
+        raise
+    process(record)

The crash stops being random the moment its input stops being lost. After that, fixing the parser is an afternoon.

The decision we make in the first two days

Not every problem that arrives as a fix-it engagement is one. Some are reproductions of a design that is doing exactly what it was built to do, just at a scale it was never built for. When we find that in the first two days, we stop and say so, because the honest answer changes what you should buy.

This is the one judgement call that matters in this kind of work. A team that has lived inside a problem for weeks is, understandably, ready to accept any change that makes the symptom go away. Part of what you are paying for is someone who has not lived inside it, who can tell the difference between a fix and a reprieve, and who has no incentive to call the second one the first.

A bug you cannot reproduce on demand is not a bug you have fixed. It is a bug that has gone quiet.

— Kernwise · Engineering notes 029

What you keep

Three things, all of which outlast us.

The first is the fix itself, merged, with a test that holds the line. The second is the reproduction — the script, the recorded input, the load profile — checked into your repository, so the next time something in that neighbourhood breaks, the team starts from a working harness instead of from zero. The third is a short note, usually under a page, that says what was actually wrong in plain language: not the symptom you reported, but the cause we found, and why it had been invisible.1

That note is the part teams are surprised to value most. The fix solves today's problem. The note is what stops the same misunderstanding from generating next quarter's.

Start a conversation →

Footnotes

Often the gap between the two is the whole story. The reported symptom — slow checkout — and the real cause — an unbounded config reload on the hot path — are far enough apart that no amount of staring at checkout would have closed the distance. ↑