AI integration

Most teams ask for help integrating a model after they have already decided to. The demo was good. Someone showed a prototype answering questions over the company's documents, everyone agreed it felt like the future, and now the job is to make it real. That is the moment the work usually goes wrong, because the demo proved the model can do the task once, on a question someone already knew the answer to, and production is the other thing — the task ten thousand times, on questions no one has seen, where being wrong is expensive and being slow is worse.

We do not start by picking a model. We start by writing down what good means for this specific job, in numbers, and then measuring how far the current approach is from it. Often that measurement is the whole engagement. A model that wins a demo and loses on a measured baseline is not a smaller version of a working system — it is a system you have not built yet.

What good has to mean here

A model does not have a quality. It has a quality on a task, against a definition, measured on a set of examples. Skip any of those three and you are guessing.

So the first thing we do is make the definition concrete. Not the answers should be helpful — that cannot be measured and therefore cannot be improved. Instead: for this set of two hundred real questions, drawn from your actual logs, the answer must cite a source that contains the claim, must not invent a policy that does not exist, and must come back in under two seconds at the 95th percentile. Each of those is a number or a yes-or-no, which means each can be counted, which means we can tell whether a change helped.

The two hundred examples matter as much as the definition. They have to look like production, which means they include the malformed questions, the ones with no good answer, and the ones that are really three questions wearing one sentence. A test set made only of clean questions measures a product that does not exist.

The retrieval and evaluation loop

Most product AI integrations are retrieval first and model second. The model is only as good as what you put in front of it, and most failures we see are retrieval failures wearing a model's clothes — the answer was wrong because the right passage was never fetched, not because the model could not reason over it.

So we build the loop before we tune anything in it. Retrieve, generate, score against the bar, look at the worst failures, fix the stage that caused them, run it again. The harness is the deliverable people underestimate: once it exists, every later decision — a different embedding model, a smaller generator, a reranking step — becomes a measurement instead of an argument.

Fig 1The loop we build first: every change is scored against the same fixed example set, so improvement is observed rather than asserted.

The discipline is fixing the example set and never changing it while you are tuning. The moment you edit the test to make a number go up, the number stops meaning anything. New examples get added in batches, deliberately, with a note about why — never quietly, in the middle of chasing a metric.

Cost, latency, and the smallest model that clears the bar

A model that passes on quality can still fail in production, because production has two more axes the demo ignored: what each answer costs and how long it takes. Both belong in the bar from the start, not as a tuning step afterwards.

The instinct is to reach for the largest model, because it is the safest answer to will this be good enough. It is also usually the wrong one. The largest model costs more per call and answers more slowly, and once you have a harness you can measure exactly how much quality you give up by going smaller — which is frequently none worth paying for.

Approach	Pass rate on the bar	Cost per 1k answers	p95 latency
Large model, no retrieval	71%	$8.40	3.9 s
Large model, with retrieval	93%	$9.10	4.2 s
Small model, with retrieval	91%	$1.30	1.4 s

The numbers above are the shape we keep finding, not a promise about your system. Retrieval moved quality more than model size did — twenty-two points against two — and the small model with good retrieval came within two points of the large one at a seventh of the cost and a third of the latency. That last row is usually what we ship. The two points it gives up are real, and we name them, and then we let you decide with the trade-off in front of you instead of behind it.

The largest model is the most expensive way to find out your retrieval is broken.

— Kernwise · Engineering notes 027

Guardrails and the failure you have not budgeted for

Every model integration has a failure distribution, and the work is not making it empty — that is not on offer — but making the failures cheap, visible, and contained. A wrong answer that says it is wrong is a different product from a wrong answer that says it is right, and the gap between them is guardrails.

So we build the boundaries with the same care as the feature. The model gets a way to abstain when retrieval returns nothing relevant, rather than improvising. Answers are checked against the sources they claim to cite before they reach a user. Inputs are bounded so a single request cannot run the cost or the context off a cliff. And every answer carries enough trace — which passages, which model, which version — that when one goes wrong you can find out why in minutes, not days.

None of this is exotic. It is the same engineering you would apply to any component that can fail in production, applied to one that fails more interestingly than most.

When the answer is to not use a model

The most useful thing we say on some engagements is that the model should not be there. This is not a rhetorical flourish. It is a real outcome, and on roughly one engagement in five it is the right one.

A model earns its place when the task is genuinely ambiguous, the inputs are open-ended, and a wrong answer is recoverable. It does not earn its place when a lookup, a rule, or a regular expression would do the same job deterministically, for a fraction of the cost, with no failure distribution to manage. We have watched teams put a language model in front of a problem that was really a database query, and inherit latency, cost, and a class of bug that did not exist before, to do worse what a WHERE clause already did.

When we find that, we say so plainly, and we show the deterministic version working beside the model so the comparison is concrete rather than theoretical. Telling you not to build the thing you came to build is uncomfortable for a week and correct for years, and it is the part of this work we are most sure about.

Start a conversation →

Footnotes

We treat the abstention path as a feature with its own acceptance bar — how often the model should decline, and on which questions — not as an error case to be suppressed once the demo looks clean. ↑