Cutting p95 checkout latency to 340 ms
The team was sure the database was the problem. It was not. The trace said so in the first afternoon, and the fix touched four lines of application code and one index.
- Client
- [example]
- Team
- Two from Kernwise, four from the platform team
- Outcome
- p95 1.4 s → 340 ms
- Engagement
- Six weeks · fixed scope
The checkout worked. Orders went through, money moved, nothing was on fire. But the last step — the one between the customer pressing pay and the confirmation screen — took 1.4 seconds at p95, and on mobile, over a real network, that was long enough that a measurable fraction of customers tapped twice or gave up. The team had spent two sprints adding read replicas and tuning Postgres, and p95 had moved from 1.5 seconds to 1.4. They brought us in to make the database faster.
We did not make the database faster. The database was not the problem.
The problem as we found it
The first thing we asked for was a trace. Not an average, not a dashboard — one real checkout request, captured end to end, with every span. They had the instrumentation; they had simply never looked at a single request, only at aggregates. The aggregate said "checkout is slow." The single request said exactly where.
Four spans dominated. Auth was 14 milliseconds, fine. The payment submit was 70 milliseconds, fine. Between them sat a cart re-fetch at 820 milliseconds and a tax computation at 1,100 milliseconds that overlapped it. Together those two spans were 93 percent of the wall-clock time, and neither of them touched the primary database in the way the team assumed.
The constraint we worked under
This was a fixed-scope engagement: six weeks, no new services, no schema migration that required downtime. The checkout could not regress in correctness — a single wrong total is worse than a slow one — and the team needed to own the change after we left, so nothing clever that only we could maintain.
The approach
The trace made the order of work obvious: fix the two slow spans, in the order of how much they cost, and measure after each change rather than batching them.
The changes that shipped
1 — Collapse the cart re-fetch
The cart was read once to render the page and re-read at submit "to be safe." The re-read pulled every line item with a correlated subquery per row. It was safety theatre: the cart had not changed between render and submit in any real flow. We removed the re-fetch and trusted the cart the client already held, validating only the total server-side.
-- on submit, per line item:
- SELECT *, (SELECT price FROM prices WHERE sku = li.sku
- ORDER BY effective_at DESC LIMIT 1) AS unit_price
- FROM line_items li WHERE li.cart_id = $1;
+ SELECT total_cents FROM carts WHERE id = $1; -- validate the held total onlyThat one change took the cart span from 820 milliseconds to 24.
2 — Precompute the tax bracket
Tax was computed by walking every jurisdiction rule at request time. The rules changed roughly monthly; the request path recomputed them on every checkout. We moved the bracket lookup behind a table that a nightly job refreshes, so the request reads one row instead of evaluating a ruleset. The application change was a single call site swapped to a taxBracketFor(zip) lookup.
3 — Add the index the re-fetch had been hiding
With the re-fetch gone, the remaining cart read had no supporting index because the old query had masked the need for one. A single composite index on carts (id, updated_at) took the last 40 milliseconds off.
4 — Stop awaiting the analytics write
The confirmation response awaited a write to the analytics pipeline that the customer never sees. We moved it off the response path so the confirmation returns as soon as the order is durable.
The outcome
Each change was measured on its own before the next went in, so the attribution is real, not a guess.
| Metric | Before | After | Δ |
|---|---|---|---|
| p50 checkout | 910 ms | 180 ms | −80% |
| p95 checkout | 1,400 ms | 340 ms | −76% |
| p99 checkout | 2,100 ms | 520 ms | −75% |
| Net new services | — | 0 | — |
| Net new dependencies | — | 0 | — |
The fastest request is the one you do not make. Most of this engagement was deleting work, not adding it.
What we took from it
The lesson was not about Postgres, and it was not about checkout.1 It was that the team had been optimising the thing they could see on the dashboard rather than the thing a single trace would have shown them in an afternoon.
- Capture one real request before you tune anything. The aggregate tells you there is a problem; the trace tells you where.
- The slowest span is rarely where the team's attention already is.2
- Deleting work beats speeding it up. Two of the four changes removed code.
- Measure after each change, not after the batch, or you will never know which one mattered.