Platform engineering

There are two kinds of platform. One speeds a team up: a developer opens a pull request, the checks come back in a few minutes, the deploy is so boring nobody watches it, and when something looks wrong the dashboards answer the question instead of raising new ones. The other kind taxes every change. The tax is quiet — nobody has a meeting about it — but it shows up as the half-day a small fix takes, the deploy everyone schedules for a Friday afternoon and then dreads, the staging environment that disagrees with production in ways that only surface after release.

Both platforms have CI, deploys, observability, and environments. The difference is not which tools are installed. It is how much each change costs to push through them, and whether anyone has measured that cost honestly.

The cost nobody puts a number on

Ask a team how long it takes to get a one-line change into production and you will get an answer with no evidence behind it. maybe an hour usually means twenty minutes of work and forty minutes of waiting, retrying, and asking someone in a different channel to approve the thing. The waiting is the part that compounds, and it is the part nobody tracks.

So we track it. For the first week we instrument the path a change takes, from the commit to the moment it serves real traffic, and we record every step with a timestamp. Not the happy-path number from a slide — the real distribution, including the retries and the manual gates and the time a build sat in a queue.

Step	Stated	Measured (p50)	Measured (p90)
CI checks	~5 min	11 min	34 min
Review and approve	—	3 h 40 min	19 h
Deploy to staging	~10 min	22 min	1 h 5 min
Promote to production	~10 min	48 min	3 h 20 min

The numbers above are the shape we see, not anyone's actual data. The pattern repeats: the steps a team can name are roughly accurate, and the long pole is almost always a step nobody counted — a flaky test suite that gets re-run twice, a promotion that waits on a person who is in a meeting, an approval queue with no owner.

What we change first

Once the path is measured, the order of work picks itself. We do not start with the tool everyone complains about. We start with the step that adds the most measured time per change, because that is where an hour of our work returns the most.

In most engagements the first three moves are some version of these.

Make CI trustworthy before making it fast. A flaky suite is slower than a slow one, because every red run that turns green on retry trains the team to ignore red. We quarantine the flakes, fix or delete them, and get the suite to a state where red means stopped.
Cut the wait, not just the compute. Parallelise the checks that block a merge, cache what is safe to cache, and move the slow, non-blocking checks off the critical path so they run after merge instead of before it.
Make the deploy boring. One command, the same in every environment, reversible in under a minute, and observable while it happens. A deploy nobody is afraid of is a deploy people do small and often, which is the only way deploys stay safe.

Here is the kind of change that removes a wait without removing a check — the slow integration suite still runs, it just stops blocking the merge:

 jobs:
   unit:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
       - run: make test-unit        # fast, blocks merge

-  integration:
-    needs: unit
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - run: make test-integration  # 14 min, blocks every merge
+  integration:
+    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: make test-integration  # runs post-merge, pages on failure

The change is small and the effect is not: every pull request stops paying for a fourteen-minute suite that catches a class of bug a post-merge run catches just as well, with a page if it ever goes red.1

A pipeline you can read

The other half of the work is making the platform answer questions. A deploy timeline should be readable at a glance — what changed, when it went out, and whether the graphs moved after it did. When that picture exists, an incident starts with a fact instead of a guess.

Fig 1Before and after: the same change reaching production. The wall-clock cost moves from hours dominated by waiting to minutes dominated by actual work.

Observability is part of this, and it is the part teams most often get backwards. The goal is not more dashboards. The goal is that the dashboards you already have answer the questions you actually ask during an incident — is it us or upstream, is it everyone or one customer, did it start when we deployed. If a graph does not help answer one of those, it is noise, and we take it down.

A platform earns its keep when a deploy stops being an event. The day nobody can remember the last scary release is the day the work paid off.

— Kernwise · Engineering notes 021

You run it, not us

The last week is the one that matters most for whether any of this lasts. We do not want to leave behind a platform only we understand — that is just a new dependency, dressed as an improvement.

So we pair. The changes we make, your engineers make with us, and the runbook is written by the person who will be on call, not by us handing it over. By the time we leave, someone on your team has cut a release the new way, has rolled one back on purpose to prove it works, and has the measurements to know whether the next change is getting faster or slower over time.

If at the end of it the path to production is shorter, the deploys are boring, and your team owns the thing without us, the engagement worked. If the only way to keep it working is to keep paying us, it did not, and we would rather you knew that than not.

When it is not the answer

Platform work is the wrong call when the path to production is already short and the real problem is upstream of it — when changes are slow because the requirements keep moving, or because two teams disagree about who owns a service, or because the code is hard to change for reasons no pipeline can fix. A faster deploy does not help a team that is not sure what to deploy.

It is also the wrong call early. A team of four shipping ten times a week does not have a platform problem; it has a platform it has not needed yet, and building one before the pain is real is just adding the tax we came to remove. We will tell you when that is the case, and the honest answer is sometimes that there is nothing here worth doing for another year.

Start a conversation →

Footnotes

Moving a check from pre-merge to post-merge only works when the post-merge failure is loud and the rollback is fast. If a bad change can sit in production unnoticed, the check belongs before the merge — the speed is not worth a silent regression. ↑

Platform engineering.

The cost nobody puts a number on

What we change first

A pipeline you can read

You run it, not us

When it is not the answer

Footnotes