~/notes

The question that cracked the bug

Thu, 30 Apr 2026 00:00:00 GMT

A user reported that our AI-generated email drafts kept adopting the wrong voice. Someone would get an email from their insurance agent, and the auto-drafted reply would be written as if the user were the insurance agent. The user’s correction in the feedback log said everything:

“The sender is the provider, I’m the customer, please revise.”

I spent a morning debugging this with Claude as my pair-programmer. The story of how we eventually found the fix is, I think, a case study in how engineers should actually work with AI — not as an oracle that hands you answers, but as a fast collaborator whose premises you must constantly check.

The AI’s first move: an impressive reproducer test

I asked Claude to investigate the draft-generation pipeline and build a test that reproduced the bug. It did so quickly and well. Within a few turns I had:

A nicely structured integration test loading the captured feedback fixture
An LLM-as-judge wrapper scoring the generated draft for role correctness (customer vs. provider voice, with structured JSON output)
A monkeypatch to bypass database lookups so the test could run without infrastructure
A parallelized 5-sample loop to handle the stochasticity of temperature 0.7

I ran it. All five samples came back as clean customer-voice drafts. Zero role confusions.

The test passed.

The AI’s theories for why the bug “vanished”

When I asked Claude why a bug the user had clearly experienced wasn’t reproducing, it generated a plausible list:

Temperature non-determinism — at temp 0.7, the same inputs produce different outputs every time
Model drift — the fixture might have been captured with a different model version
Implicit hints in the email — the subject starting with “RE:” and the sender’s email domain (obviously an insurance provider) gave the LLM enough clues to figure out the roles
Unlucky sampling at capture time — we might have caught a rare misbehaving run

Each of those theories was plausible. Some were probably even partially true. Claude proposed running 20 samples, lowering temperature, pinning to a specific model version, crafting a more ambiguous test input.

I didn’t like any of it. It all felt like patching around the fact that we weren’t actually seeing the bug.

The pivot

I asked one question:

“Can you double-check the test you wrote? Is it reproducing what we have in production?”

This is the kind of question that is easy for a human and apparently not natural for AI. Claude audited its own test and produced a table that, in retrospect, should have been the first thing it built:

Field	Production pipeline	The test	Match?
Sender	agent’s display name + email	same	yes
Subject	`"RE: About <redacted policy type> insurance"`	same	yes
Model / temperature	open-weight model @ 0.7	same	yes
Body content	`body.content` from the email provider’s JSON (raw HTML, 2000 then 1000 chars)	fixture’s `body_preview` (254 chars of clean plain text)	no

There it was. The production pipeline was handing the LLM up to 1000 characters of Microsoft Word / Outlook HTML — <html><head><meta><style>@font-face{...}</style> — and never reaching the actual email body. The test was handing the LLM a short, clean plain-text snippet — a one-line “thanks, see attached certificate” style acknowledgment. Of course the LLM could infer roles correctly from that. It had actual text to reason about.

I pointed Claude at the fixture’s full_text field (the real HTML the user’s inbox had stored) and told it to feed that to the draft generator.

5/5 samples immediately confused roles. Every draft was written in the voice of an insurance agent offering coverage:

“I’ve reviewed the details you sent regarding worker compensation coverage, and I’d be happy to help clarify the policy provisions…”

“We can help you assess coverage needs based on your business size, industry, and the states you operate in…”

The bug was now deterministic.

The fix was embarrassingly simple

Once we could see it, the root cause was obvious. The production code was doing this:

full_body = email_data['body']['content']    # raw HTML from Graph API
item = {
    'body_preview': full_body[:2000],        # 2000 chars of HTML
    ...
}
# Then later, in the prompt builder:
content = f"Content:\n{body_preview[:1000]}"  # another slice, still HTML

The first 1000 characters of any Outlook-authored email is CSS boilerplate. The LLM never saw the actual message. When the “body” is just <style>@font-face{font-family:Calibri}..., and the subject is something generic about a policy type, the model does what it’s designed to do: invent a plausible-sounding reply. Since its training data is full of “insurance agent offering coverage” content, that’s what it produced. Not role confusion — hallucination with a consistent direction.

The fix was three lines: strip HTML before slicing, bump the cap up (now that we were showing real content, not CSS), and remove the redundant second slice in the prompt builder. The test dropped to 0/5 confusions. A second fixture we hadn’t tested yet — a different bug report, about the draft “not understanding the context” — also dropped to 0/5 once we ran it through the same fix. Same root cause, different symptom.

The lesson I keep relearning

AI pair-programmers are fast, thorough, and surprisingly good at following instructions. What they are not good at, at least not yet, is questioning the premises of a task. Claude happily built a beautiful reproducer test, generated multiple theories for why it didn’t reproduce the bug, and proposed increasingly elaborate mitigations — all inside a test whose relationship to production it had never examined.

The question “does this match production?” took me five seconds to ask. It was the highest-leverage thing I did all morning.

There is a tempting narrative that AI tools will replace the junior engineer and let senior engineers do “more important work.” The reality I keep running into is different. AI tools amplify whichever direction you point them. If you point them at a plausible-but-wrong premise, they will build an impressive edifice on top of it. Senior engineering judgment — the instinct to ask “wait, is this even the right question?” — doesn’t get less important. It gets more, because the cost of building on the wrong premise goes from “a few hours” to “a full afternoon of beautifully-structured work that solved the wrong problem.”

The practical takeaway for anyone working this way:

When AI tells you a bug doesn’t reproduce, don’t accept the first explanation. Ask whether the reproduction path matches production first.
When AI hands you a “maybe it’s this, maybe it’s that” list of theories, that’s often a signal that the premise is wrong, not that the bug is genuinely mysterious.
The questions that move things forward are usually simple. You don’t need to out-think the AI. You need to out-frame it.

The final commit on this branch is 36 lines of production-code changes — strip HTML, extract TO/CC, add an identity block to the prompt. It took me a morning to write because the first few hours were spent on the wrong reproducer. That’s the shape of this work now. The AI is fast; the expensive part is making sure we’re asking it the right question.

Stop over-engineering your AI agent

Wed, 22 Apr 2026 00:00:00 GMT

We built an AI agent that searches across a user’s emails, Slack, Jira, and other channels to answer questions like “find the Acme project update” or “list all release versions.” The search was broken in ways we didn’t expect — and the fixes were counterintuitive.

Every instinct said add more intelligence: smarter classification, better query rewriting, self-grading loops, bigger models. Every one of those instincts was wrong.

The real path had two parts:

Get out of the AI’s way. Most of our “improvements” were making things worse. Removing them fixed more than adding them.
Teach the AI what it can’t derive. Once we stripped the pipeline down, we found failures that no amount of model capability could fix. Those needed human knowledge encoded as part of the system.

This post is about both halves.

Part 1: Get out of the AI’s way

Wall #1: dense-only search doesn’t find what users actually search for

Users typed a client name — returned zero results. They typed a short project code — zero results. Dense vector search is great at semantic similarity but terrible at exact entity matching. A 3-letter abbreviation doesn’t embed well. Proper nouns don’t cluster near related content. And “Acme Corp” as a cosine similarity vector doesn’t match the actual email about Acme Corp as reliably as you’d expect.

The instinct: train better embeddings. Fine-tune the model. Add query expansion.

What actually worked: add a boring keyword search alongside the semantic one. Fuse the results. This is hybrid search with Reciprocal Rank Fusion (RRF), and every serious production system uses it. Dense retrieval and sparse (keyword) retrieval solve different problems. You need both.

query → dense search + keyword search (parallel) → RRF fusion → results

That’s it. No training, no fine-tuning, no fancy rewriters. Just two searches merged by rank.

Wall #2: pre-search LLM calls were making results worse

Our original pipeline looked like this:

query → LLM call #1: classify the query type (1-2s)
      → LLM call #2: rewrite into 2-3 sentences (1-2s)
      → search → results

It felt intelligent. It was actually terrible.

The classifier would guess the query type (“this is a notification query”) and that guess would control downstream filtering. When it guessed wrong — which happened constantly on short or ambiguous queries — the filter threw out valid results.

The rewriter was worse. Give it a client name and it would helpfully expand the query into something like “update on the project, including progress, milestones, and team members involved.” The original keyword now lives inside a padded sentence instead of being the focused term the user typed. The embedding drifts. Keyword matching weakens. Results get worse.

The instinct: make the classifier smarter. Make the rewriter better. Add fallbacks.

What actually worked: delete both. Pass the user’s original query directly to hybrid search.

Latency dropped from ~4-6s to ~2-3s. Zero-result failures went away. Quality went up.

The principle: every LLM call before retrieval is a place where your pipeline can add noise. If it isn’t provably adding signal, it’s subtracting it.

Wall #3: your tool output is for the LLM, not for humans

Search results were formatted to be readable: full previews, metadata fields, relevance scores, conversation IDs, the works. Roughly 300 tokens per result. Ten results per search.

The agent would make two searches and suddenly its context was full of ~6,000 tokens of formatted noise. By the third search call, the model would start narrating its intentions (“Let me search more specifically…”) instead of actually making tool calls. It had lost coherence.

The instinct: give the LLM a bigger context window. Upgrade the model.

What actually worked: cut tool output from ~300 tokens per result to ~50. Stop sending conversation IDs and relevance percentages to the LLM. Keep just what’s needed to reason about the next step.

# Before (~300 tokens per result)
[1] OUTLOOK: [repo] Release v0.4.0 (PR #332)
    Date: 2026-03-11 10:25
    Relevance Score: 98%
    Participants: github-actions[bot] <notifications@...>
    Content Preview:
    Summary: Bump project version from 0.3.2 to 0.4.0...
    Outlook Conversation ID: AAQkADE1YTA2Y2Q4...

# After (~50 tokens per result)
[src:332] (outlook) Release v0.4.0 (PR #332) | 2026-03-11 | From: github-actions
    Bump project version from 0.3.2 to 0.4.0

The 6x reduction is the difference between the agent finishing its job and running out of coherence mid-response. Design tool output for the reader that’s consuming it — which, in an agentic system, isn’t you.

That’s the “get out of the way” part. Three changes. All subtractions. All improved the system.

Now here’s where it gets interesting.

Part 2: teach the AI what it can’t derive

After stripping the pipeline down, we tested a harder query: “Find Alex’s phone number.”

The number existed in the system. It lived in the email signature of a message from Alex. The agent searched “Alex phone number” — found emails mentioning phone numbers but not the actual digits. It searched “Alex Chen phone” — same thing. It tried six or seven variations. It eventually gave up.

The number was right there. The agent couldn’t find it.

Here’s why: the document that contained the phone number never used the word “phone.” It was an email signature — just a name, a title, and some digits. The user’s query vocabulary and the document’s vocabulary had nothing in common. No amount of keyword search, embedding similarity, or query rewriting can bridge that gap. It’s the fundamental limit of lexical + semantic retrieval.

This is where a human would instantly succeed. A person asked “find Alex’s phone number” would think: “Phone numbers are in email signatures. Let me pull up an email from Alex and read the bottom.” That’s a two-step reasoning chain that has nothing to do with the words “phone number.”

The AI doesn’t know this. Not because it’s dumb — because it can’t derive from first principles that phone numbers live in signatures. That’s human knowledge about how email works.

The instinct: wait for smarter models

We almost did this. “Once models get better at multi-hop reasoning, they’ll figure it out.”

The problem: there will always be a next failure mode. Chasing the model capability curve is a losing race. And even if the next model reasoned through this specific case, it would fail on the next thing you didn’t anticipate.

The fix: encode the knowledge, keep the agent

We did two things. First, a simple tool — get_communication_detail(id) — that fetches the full body of a specific communication. The compact format hides most details from the agent. This tool lets the agent drill down when the preview isn’t enough. Detail on demand.

Second — and more importantly — we defined the reasoning chain as a skill:

---
name: find-contact-info
triggers:
  - phone number
  - contact info
  - how to reach
---

## Steps
1. Search for the person across all channels
2. Search for emails sent BY that person
3. Fetch full content of top result to read the signature

A skill is a declarative definition of a proven workflow. The steps use the same tools the agent already has, but they’re sequenced in a way the agent couldn’t reliably derive on its own. When the user asks about contact info, the system can invoke this skill instead of hoping the agent reasons through it.

The skill isn’t hardcoded logic — it’s encoded knowledge. The tools still do the searching. The LLM still fills in adaptive parameters at each step. But the order of operations — the human insight that “to find contact info you should look at emails from the person” — lives in the skill definition.

Why this matters more than it sounds

Skills are a bet on a different model of AI development. Instead of “AI that figures everything out,” it’s “AI that accumulates learned workflows.” When the system notices a user manually performing a pattern — search a person, fetch an email, extract contact info — it can propose turning that pattern into a skill. The user approves. The next time anyone hits that problem, the skill handles it.

This is the loop that makes the agent get better over time without needing a better model underneath. The model’s job is to handle the long tail — open-ended reasoning, novel situations, conversations. The skills handle the repeatable patterns that humans have already solved.

This is also why we resisted implementing the contact lookup as a hardcoded Python tool. A tool is opaque and one-off. A skill is portable, discoverable, and composable. Future you can read the skill definition and know exactly what it does. Future users can propose new skills for patterns the system missed.

The two-part rule

Looking back, every improvement fit into one of two categories:

Subtractions from the pipeline:

Dense-only search → hybrid search (delete the assumption that semantic is enough)
Pre-search LLM calls → direct query (delete intelligence that was noise)
Verbose tool output → compact format (delete metadata the LLM can’t use)

Additions of human knowledge:

Detail-on-demand tool (humans know “read the full thing” is sometimes the answer)
Contact lookup skill (humans know signatures contain contact info)

The instinct at every turn was to do the opposite — add intelligence to the pipeline, hope the model figures out the chains. That instinct was wrong.

The rule: first, delete everything in your pipeline that isn’t provably helping. Then, for the failures that remain, ask whether the AI is missing knowledge a human has. If yes, encode it — as a tool, a skill, or structured data. Don’t wait for a smarter model.

A practical checklist

If you’re building an agentic search or RAG system, check these in order:

Subtractions:

Do you have both keyword and semantic retrieval fused together? (If no, add it.)
Are there LLM calls before retrieval that classify or rewrite the query? (Consider removing.)
Is your tool output designed for human readability or LLM consumption? (Cut everything the LLM doesn’t need.)
Does your agent have enough context budget to iterate 3-5 times without losing coherence? (If no, reduce per-result tokens.)

Additions:

When your agent fails on a task, ask: is this a vocabulary mismatch or missing world knowledge? (Those need encoded knowledge, not smarter prompts.)
Does your agent have a way to drill down from summaries to full content when needed? (Add a detail-fetch tool.)
Can you extract multi-step patterns from user interactions and encode them as skills? (This is where the compounding returns are.)

The instinct is always to make the AI smarter. The practice is to make the system around it smaller and more specific.