A user reported that our AI-generated email drafts kept adopting the wrong voice. Someone would get an email from their insurance agent, and the auto-drafted reply would be written as if the user were the insurance agent. The user’s correction in the feedback log said everything:
“The sender is the provider, I’m the customer, please revise.”
I spent a morning debugging this with Claude as my pair-programmer. The story of how we eventually found the fix is, I think, a case study in how engineers should actually work with AI — not as an oracle that hands you answers, but as a fast collaborator whose premises you must constantly check.
The AI’s first move: an impressive reproducer test
I asked Claude to investigate the draft-generation pipeline and build a test that reproduced the bug. It did so quickly and well. Within a few turns I had:
- A nicely structured integration test loading the captured feedback fixture
- An LLM-as-judge wrapper scoring the generated draft for role correctness (customer vs. provider voice, with structured JSON output)
- A monkeypatch to bypass database lookups so the test could run without infrastructure
- A parallelized 5-sample loop to handle the stochasticity of temperature 0.7
I ran it. All five samples came back as clean customer-voice drafts. Zero role confusions.
The test passed.
The AI’s theories for why the bug “vanished”
When I asked Claude why a bug the user had clearly experienced wasn’t reproducing, it generated a plausible list:
- Temperature non-determinism — at temp 0.7, the same inputs produce different outputs every time
- Model drift — the fixture might have been captured with a different model version
- Implicit hints in the email — the subject starting with “RE:” and the sender’s email domain (obviously an insurance provider) gave the LLM enough clues to figure out the roles
- Unlucky sampling at capture time — we might have caught a rare misbehaving run
Each of those theories was plausible. Some were probably even partially true. Claude proposed running 20 samples, lowering temperature, pinning to a specific model version, crafting a more ambiguous test input.
I didn’t like any of it. It all felt like patching around the fact that we weren’t actually seeing the bug.
The pivot
I asked one question:
“Can you double-check the test you wrote? Is it reproducing what we have in production?”
This is the kind of question that is easy for a human and apparently not natural for AI. Claude audited its own test and produced a table that, in retrospect, should have been the first thing it built:
| Field | Production pipeline | The test | Match? |
|---|---|---|---|
| Sender | agent’s display name + email | same | yes |
| Subject | "RE: About <redacted policy type> insurance" | same | yes |
| Model / temperature | open-weight model @ 0.7 | same | yes |
| Body content | body.content from the email provider’s JSON (raw HTML, 2000 then 1000 chars) | fixture’s body_preview (254 chars of clean plain text) | no |
There it was. The production pipeline was handing the LLM up to 1000 characters of Microsoft Word / Outlook HTML — <html><head><meta><style>@font-face{...}</style> — and never reaching the actual email body. The test was handing the LLM a short, clean plain-text snippet — a one-line “thanks, see attached certificate” style acknowledgment. Of course the LLM could infer roles correctly from that. It had actual text to reason about.
I pointed Claude at the fixture’s full_text field (the real HTML the user’s inbox had stored) and told it to feed that to the draft generator.
5/5 samples immediately confused roles. Every draft was written in the voice of an insurance agent offering coverage:
“I’ve reviewed the details you sent regarding worker compensation coverage, and I’d be happy to help clarify the policy provisions…”
“We can help you assess coverage needs based on your business size, industry, and the states you operate in…”
The bug was now deterministic.
The fix was embarrassingly simple
Once we could see it, the root cause was obvious. The production code was doing this:
full_body = email_data['body']['content'] # raw HTML from Graph API
item = {
'body_preview': full_body[:2000], # 2000 chars of HTML
...
}
# Then later, in the prompt builder:
content = f"Content:\n{body_preview[:1000]}" # another slice, still HTML
The first 1000 characters of any Outlook-authored email is CSS boilerplate. The LLM never saw the actual message. When the “body” is just <style>@font-face{font-family:Calibri}..., and the subject is something generic about a policy type, the model does what it’s designed to do: invent a plausible-sounding reply. Since its training data is full of “insurance agent offering coverage” content, that’s what it produced. Not role confusion — hallucination with a consistent direction.
The fix was three lines: strip HTML before slicing, bump the cap up (now that we were showing real content, not CSS), and remove the redundant second slice in the prompt builder. The test dropped to 0/5 confusions. A second fixture we hadn’t tested yet — a different bug report, about the draft “not understanding the context” — also dropped to 0/5 once we ran it through the same fix. Same root cause, different symptom.
The lesson I keep relearning
AI pair-programmers are fast, thorough, and surprisingly good at following instructions. What they are not good at, at least not yet, is questioning the premises of a task. Claude happily built a beautiful reproducer test, generated multiple theories for why it didn’t reproduce the bug, and proposed increasingly elaborate mitigations — all inside a test whose relationship to production it had never examined.
The question “does this match production?” took me five seconds to ask. It was the highest-leverage thing I did all morning.
There is a tempting narrative that AI tools will replace the junior engineer and let senior engineers do “more important work.” The reality I keep running into is different. AI tools amplify whichever direction you point them. If you point them at a plausible-but-wrong premise, they will build an impressive edifice on top of it. Senior engineering judgment — the instinct to ask “wait, is this even the right question?” — doesn’t get less important. It gets more, because the cost of building on the wrong premise goes from “a few hours” to “a full afternoon of beautifully-structured work that solved the wrong problem.”
The practical takeaway for anyone working this way:
- When AI tells you a bug doesn’t reproduce, don’t accept the first explanation. Ask whether the reproduction path matches production first.
- When AI hands you a “maybe it’s this, maybe it’s that” list of theories, that’s often a signal that the premise is wrong, not that the bug is genuinely mysterious.
- The questions that move things forward are usually simple. You don’t need to out-think the AI. You need to out-frame it.
The final commit on this branch is 36 lines of production-code changes — strip HTML, extract TO/CC, add an identity block to the prompt. It took me a morning to write because the first few hours were spent on the wrong reproducer. That’s the shape of this work now. The AI is fast; the expensive part is making sure we’re asking it the right question.