A backend is two things wearing one name. There is the data model — the tables, the columns, the relationships someone drew when the product was younger and smaller. And there is the way the application actually reads and writes that data, which nobody drew, because it grew one endpoint at a time. For the first year or two these two things stay close. Then they drift apart, quietly, and the gap between them is where almost every backend problem you will have actually lives.
You feel the drift before you can name it. A page that used to load in eighty milliseconds now takes four hundred. A nightly job that finished by 2am now finishes by 6am, then by 9am, then it is still running when people log in. A service that was added to make one feature faster is now in the path of every request and nobody remembers why. None of these are bugs. The code is correct. The model and the usage have simply moved apart, and the cost of that gap is now large enough to notice.
What organic growth does to a data model
Every data model starts as a fair description of the problem. A user has orders. An order has line items. A line item points at a product. You can hold the whole thing in your head, and the queries write themselves.
Then the product changes faster than the model can be redrawn. Someone needs the order total on a list view, so it gets denormalised onto the order row to avoid a join. Someone needs to soft-delete instead of delete, so a nullable deleted_at appears and now every query carries a WHERE deleted_at IS NULL that the indexes were never designed for. Someone needs a status, so a status column arrives as free text, and eighteen months later there are nine spellings of "cancelled" in production.
None of these decisions were wrong on the day they were made. Each one solved a real problem with the smallest change available. The model did not decay because anyone was careless. It decayed because it was used, and use leaves marks. By the time it hurts, the model encodes a dozen of these local fixes, and the shape that made sense for the original problem is buried under the shape that accreted around how the product was actually run.
Tracing the path a real request takes
We do not start with the schema. We start with the requests that matter — the three or four read paths and the one or two write paths that account for most of the load — and we follow each one all the way down, through every service it touches and every query it issues, including the queries nobody mentions because they are hidden behind an ORM or a cache that misses more often than anyone thinks.
- Pick the endpoints by load and by pain, not by how interesting they are.
- Capture the real SQL each one issues, against production-shaped data, not a clean test database.
- Count the round trips — the N+1 that looks like one call in the code and is four hundred in the database.
- Read the query plans on the actual indexes, at the actual row counts, not the ones from launch.
- Follow each write to every place that is supposed to keep a derived value in sync, and find the place that does not.
it is probably the database is what teams tell us in the first conversation, and they are usually half right. The database is where the pain shows up. It is rarely where the pain starts. The pain starts in a query path that fans a single user action across more services and more round trips than anyone intended, and the database is simply the last component honest enough to be slow about it.
By the end of the first week we can usually draw the diagram above for your real traffic, with real numbers on each edge. The picture is almost never what the team expected. The service everyone worried about is fine. The service nobody thinks about — the audit log, the pricing lookup, the thing that was supposed to be cheap — is the one in the hot path issuing the round trip that does not need to exist.
Why 10x held and 100x does not
A query path that fans across four services and issues a handful of extra round trips is invisible at low volume. At ten times your launch load it is still fine, because the absolute numbers are small and the database has headroom. The cost is linear, the slope is gentle, and nothing in the dashboards complains.
The reason 100x feels like a wall rather than a slope is that the costs you ignored were never linear. A missing index turns a lookup from logarithmic into linear, and linear against a table that grew a hundredfold is a hundred times slower, not a little slower. An N+1 that issued forty queries per request now issues forty against a connection pool that has run out of connections, so requests queue, and the queue is where latency goes from annoying to fatal. A lock that was held for two milliseconds and never contended is now contended on every write, and contention does not add — it multiplies.
| What the path does | At 10x | At 100x | Why it breaks |
|---|---|---|---|
| N+1 read on the list view | 40 queries · 60ms | 40 queries · pool exhausted | Connections, not rows |
| Unindexed status filter | 12ms scan | 1,100ms scan | Linear on row count |
| Lock on the counter row | 2ms, uncontended | Serialised writes | Contention multiplies |
| Stale denormalised total | Wrong in 0.1% of rows | Wrong in 0.1% of rows | Not load — correctness |
The last row is the one that catches teams out. Some of what you find under load is not a performance problem at all — it is a correctness problem that load merely made visible, because the only reason nobody noticed the wrong total before was that nobody looked at enough rows. Scale does not create these faults. It stops hiding them.1
The smallest change that holds
Once the path is on paper, the fix is usually smaller than the team feared and more specific than they hoped. We are not looking for the best possible backend. We are looking for the one change that buys you the most headroom for the least disruption, so you can ship it this quarter instead of betting the year on a rewrite.
More often than not, the change is a composite index that matches the filter the hot path actually uses, plus a single query rewritten to fetch its children in one round trip instead of forty. That is it. The model does not need to be redrawn; it needs one or two of its accidents undone where they sit in the path that carries the load.
-- The list view filters on (account_id, status) and orders by created_at.
-- The only index was on account_id, so every list scanned and re-sorted
-- the account's full order history. This matches the real access pattern.
CREATE INDEX CONCURRENTLY idx_orders_account_status_created
ON orders (account_id, status, created_at DESC)
WHERE deleted_at IS NULL;When the change is in the application rather than the schema, it is usually collapsing a fan-out the ORM hid. The diff is small; the effect is not.
def list_for_account(account_id: str) -> list[Order]:
- orders = Order.query.filter_by(account_id=account_id).all()
- for order in orders:
- order.lines = Line.query.filter_by(order_id=order.id).all()
- return orders
+ return (
+ Order.query
+ .filter_by(account_id=account_id, deleted_at=None)
+ .options(selectinload(Order.lines))
+ .all()
+ )What you get is the same shape as everything else we do: a written document. The hot paths, drawn with real numbers. The places where the model and the usage have drifted apart. And a short, ordered list of changes — the first one being the one that, on its own, buys back most of the headroom you have lost.
A backend does not need to be elegant. It needs the queries you actually run to match the model you actually have. Most of the time, that is two indexes apart.
When this is the wrong call
This work is the wrong call when you already know the slow query, you already know the fix, and you simply have not had a free fortnight to ship it. You do not need anyone to trace your paths for you; you need the time, and you should take it.
It is also the wrong call when the bottleneck is genuinely the hardware — when the model fits, the queries are sound, the indexes are right, and the box is simply too small. That is a real situation, and the answer is a bigger box or a read replica, not a consultant reading your schema. We will recognise it quickly, tell you plainly, and not bill you for the rest of the fortnight.
Where we are worth the money is the case in the middle: the backend that is slow, and nobody on the team agrees on why, because everyone is looking at the place the pain surfaces rather than the path that produced it. Naming that path, with numbers, is the work.
Footnotes
-
This is why we always read against production-shaped data. A correctness fault that shows up in one row in a thousand is invisible in a test fixture of fifty rows, and certain to be in the data the moment the table is large enough to matter. ↑