From 41 minutes to 6, and nobody is afraid anymore
The ask was make CI faster. The real problem was that deploys were so big and so rare that every one carried a week of risk. We made them small, frequent, and dull.
- Client
- [example]
- Team
- Two from Kernwise, three from the platform team
- Outcome
- deploy 41 min → 6 min, batch 1 week → 1 change
- Engagement
- Three weeks · fixed scope
The pipeline worked. It built, it tested, it deployed, and the site stayed up. But a single deploy took 41 minutes, and that number had a behavioural cost the team had stopped noticing. Because shipping was slow and tense, people batched their changes. A week of work went out together, on Friday afternoon, in one release. That made every deploy larger, which made it scarier, which made people batch even more. The team asked us to make CI faster. The faster pipeline was the easy half. The point was to break that loop.
We did not buy a faster CI runner. The runner was not the problem.
The problem as we found it
We asked for one thing first: the full log of a single recent deploy, with timestamps on every stage. Not the average duration, not the dashboard — one real run, start to finish. The team had the data; they had only ever looked at the green-or-red summary, never at where the 41 minutes actually went.
Three stages dominated. A container rebuild took 19 minutes because every run rebuilt the image from scratch with no layer caching — npm install and a full dependency compile on every commit, whether or not a dependency had changed. A serial test suite took 14 minutes running 2,200 tests one file at a time on a single worker. And a manual approval gate sat in the middle: a human had to click Approve before deploy, which in practice meant the release waited a median of 6 minutes for someone to notice the Slack message and rubber-stamp it. Those three accounted for 39 of the 41 minutes.
The constraint we worked under
This was a fixed-scope engagement: three weeks, no new services, no new vendor. The team ran their own CI on infrastructure they understood, and they needed to keep owning it after we left — so nothing that only we could operate, and no managed product that added a bill and a dashboard nobody would check. Every change had to be a diff in a config file they already had in the repo.
The approach
The log made the order of work plain: attack the three slow stages by how much wall-clock each cost, and measure after each change instead of shipping all three at once and guessing which one mattered.
The changes that shipped
1 — Cache the image layers that never change
The Dockerfile copied the whole source tree in before installing dependencies, so any source edit — every commit — invalidated the dependency layer and forced a full reinstall and compile. Ordering the copy so the lockfile lands first lets the builder reuse the dependency layer whenever dependencies are unchanged, which is almost always.
FROM node:20-slim
WORKDIR /app
-COPY . .
-RUN npm install
+COPY package.json package-lock.json ./
+RUN npm ci
+COPY . .
RUN npm run buildWe also turned on the registry-backed layer cache so the builder pulls those layers across machines, not just on a warm local one. That alone took the image build from 19 minutes to a little over 2 on a dependency-unchanged commit.1
2 — Run the test suite in parallel
The 2,200 tests ran on one worker, file by file, because that was the default and nobody had revisited it. The suite had no cross-test shared state worth protecting, so it split cleanly. We sharded across four workers and let them run at once.
test:
runs-on: ubuntu-latest
+ strategy:
+ matrix:
+ shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v4
- - run: npm test
+ - run: npm test -- --shard=${{ matrix.shard }}/4Fourteen minutes became just under 4. The shards are not perfectly even, so it is not a clean quarter, but it is close.
3 — Replace the approval click with a canary smoke check
The manual gate added a median 6 minutes of a human noticing a message, and caught nothing. We deployed to a single canary instance, ran a 30-second smoke check against it, and promoted to the fleet only if the check passed — failing the deploy automatically otherwise. A rollback command pinned the previous image and took rollback from a 9-minute manual scramble to 40 seconds.
The safest deploy is a small one you have done a hundred times. We did not make deploys safer by adding a gate. We made them safer by making them boring.
The outcome
Each change went in on its own and was measured before the next, so the attribution holds. The wall-clock number is the one the team felt, but the batch-size number is the one that changed how they worked.
| Metric | Before | After | Δ |
|---|---|---|---|
| Deploy wall-clock | 41 min | 6 min | −85% |
| Image build stage | 19 min | 2 min | −89% |
| Test stage | 14 min | 4 min | −71% |
| Deploys per week | 1 | 11 | +10× |
| Median batch size | 1 week | 1 change | — |
| Rollback time | 9 min | 40 s | −93% |
| Net new services | — | 0 | — |
What we took from it
The lesson was not about caching, and it was not really about CI.2 It was that the 41-minute number had quietly reshaped how the team shipped — into big, rare, frightening releases — and that the fix for fear was not courage but frequency. When a deploy is six minutes and one change, there is nothing to be brave about.
- Read one timestamped run before you touch anything. The summary says it is slow; the log says which stage to delete.
- A gate that has never blocked anything is not safety. It is latency that feels like safety.
- Small and frequent beats large and careful. The risk in a release is roughly the size of the change inside it.
- Make rollback boring too. A deploy is only safe to do often if undoing it is faster than debating it.
Footnotes
-
On a commit that does change a dependency, the build still pays the full reinstall — roughly 11 minutes. That is rare enough, a few times a week, that it does not move the median. ↑
-
The same shape — a slow, dreaded step driving teams to batch and so making the step matter even more — has turned up in deploys, code review, and database migrations across our work. The slowness and the batching feed each other. ↑