Cutting unsupported answers from 9% to 0.4%
The team wanted a bigger model. The transcripts said the model never saw the answer. Every wrong reply was a retrieval miss wearing a fluent sentence.
- Client
- [example]
- Team
- Two from Kernwise, three from the team
- Outcome
- unsupported answers 9% → 0.4%
- Engagement
- Four weeks · fixed scope
The assistant sounded right. That was the problem. A customer would ask whether a refund window covered a damaged item shipped to a second address, and the assistant would answer in two clean sentences with a specific number of days and a specific exception. The tone was certain. The number was invented. The policy table that held the real answer existed, was indexed, and was sitting in the corpus the whole time — the model simply never saw the row it needed and wrote a plausible one instead.
The team had decided the model was the weak link and wanted to swap to a larger one. A larger model would have written the wrong answer more convincingly.
The problem as we found it
We started where the team had not: in the transcripts. They had a quality score that said the assistant was wrong about nine percent of the time, but no one had read a stratified sample of the failures end to end, with the retrieved context attached. So we pulled two hundred flagged conversations and, for each, looked at the chunks that were actually handed to the model alongside the answer it produced.
The pattern was immediate and boring. In the large majority of wrong answers, the correct fact was not in the retrieved context at all. The model was not reasoning badly over good evidence; it was filling a gap. And the gap had a shape. The questions that failed were overwhelmingly the ones whose answers lived in tables — refund windows by region, fee schedules, tier thresholds — the structured policy that customers ask about most.
Why the rows went missing
The corpus was chunked by a fixed character window with a small overlap. On prose that is fine. On a markdown policy table it is a disaster. The window would land in the middle of a table, cut it on a character boundary, and produce two chunks: one ending mid-row with a dangling pipe, the next beginning mid-row with no header. Neither chunk was a coherent fact. The embedding for "the EU refund window is 30 days" was smeared across a fragment that read, in part, | EU | 30 | with no column names and the row above it severed.
So when a customer asked about the EU refund window, the query embedding had nothing clean to match. The retriever returned the nearest prose paragraph — usually the introductory sentence above the table — which mentioned refunds in general and gave no number. The model got a paragraph about refunds existing, no actual figure, and a question demanding one. It answered anyway.
The model was not hallucinating from nothing. It was hallucinating from a table that had been shredded before it ever arrived.
The constraint we worked under
This was a fixed-scope engagement: four weeks, no model swap, no new services in the stack. The team ran one retrieval service and one vector store and intended to keep running exactly that after we left. Two constraints mattered more than any clever idea we might have had.
First, evaluation came before any change. We would not ship a single adjustment to retrieval without a harness that could tell us, on a held-out set, whether it helped or hurt. Second, the team had to own that harness. An eval suite only we could run would have rotted the week we left.
The approach
The eval harness made the work measurable and the order obvious. Retrieval precision was the upstream number; answer quality was downstream of it. We fixed retrieval and let answer quality follow, measuring the upstream metric after every change so we never confused a real gain with a lucky-sounding sentence.
The changes that shipped
1 — Chunk on structure, not on character count
The fix that mattered was a table-aware splitter. Instead of cutting at a fixed character window, the chunker now parses the document, keeps a table as one unit when it fits, and when a table is too large to fit it splits on row boundaries and repeats the header on every piece. A chunk is now always a coherent fact: a whole table, or a header plus a contiguous block of rows, never a row torn in half.
- def chunk(doc: str, window: int = 800, overlap: int = 80) -> list[str]:
- # fixed character window — blind to table structure, splits rows mid-cell
- out = []
- for start in range(0, len(doc), window - overlap):
- out.append(doc[start : start + window])
- return out
+ def chunk(doc: Document, max_tokens: int = 320) -> list[Chunk]:
+ # structure-aware: never split a table row; carry the header into every piece
+ out: list[Chunk] = []
+ for block in parse_blocks(doc):
+ if block.kind == "table":
+ out.extend(split_table_by_rows(block, header=block.header, max_tokens=max_tokens))
+ else:
+ out.extend(split_prose(block, max_tokens=max_tokens))
+ return out2 — Embed the row with its header for context
A row reads | EU | 30 | yes | and means nothing without | region | days | damaged ok | above it. The table splitter now prepends the column header to every row-block before embedding, so the vector for a refund-window row actually encodes what the columns are. This is what let a question phrased in plain language match a row phrased in pipes.
3 — Add a grounding check before the model answers
Retrieval was now returning the right row most of the time, but we wanted the assistant to refuse rather than guess on the rest. We added a cheap check: if the retrieved context does not contain a candidate answer above a similarity floor, the assistant says it cannot find the policy and routes to a human instead of composing a confident sentence from nothing. Saying nothing is a valid answer, and a far better one than a wrong number.
The outcome
Every number below is from the eval harness the team now owns, run on the held-out set before and after the change. The model did not change. Net new model spend was zero — the bigger model the team had been about to buy was never needed.1
| Metric | Before | After | Δ |
|---|---|---|---|
| Unsupported-answer rate | 9.0% | 0.4% | −96% |
| Retrieval precision (right row in top 3) | 61% | 97% | +36 pts |
| p95 retrieval latency | 180 ms | 190 ms | +10 ms |
| Net new model spend | — | 0 | — |
| Net new services | — | 0 | — |
The unsupported-answer rate fell because the right row started reaching the model, and on the rare miss the grounding check stopped the guess. Latency moved ten milliseconds — the structure-aware splitter does a little more work at ingest, almost none at query time. The team had been ready to spend on a larger model; the spend was zero.
What we took from it
The lesson was not about this model, and it was not about refunds.2 It was that a fluent wrong answer is almost always a retrieval failure that looks like a reasoning failure, and that the two are only distinguishable by reading the context the model was actually handed.
- Read the failures with the retrieved context attached. The answer alone tells you it was wrong; the context tells you why.
- A bigger model does not fix a missing row. It writes the wrong answer more convincingly.
- Chunk on the structure of the document, not on a character count. A table row torn in half is not a fact.
- Build the eval harness first, and make the team own it. A retrieval change you cannot measure is a guess, and a harness you cannot run is gone the day you leave.
- Letting the assistant say it does not know is a feature, not a gap. Silence beats a confident invention.
Footnotes
-
The team had a six-figure annual estimate queued for the larger model. None of it was spent. The wrong-answer problem was upstream of the model entirely. ↑
-
The same shape — a structured document shredded by a structure-blind chunker, then blamed on the model — has shown up in most of the retrieval engagements we have run. It is close to a default failure mode. ↑