I Deployed a RAG Chatbot to 20,000 Users. Here's What Actually Broke.

Most RAG tutorials end at "it works in the demo." This is what happens after.

For the past 11 months, I've been running a RAG-powered financial assistant for a live fintech platform with 20,000+ active users. Here's what actually broke — and what nobody writing RAG tutorials has to deal with because they've never shipped anything.

What We Built

The system: a retrieval-augmented generation assistant giving traders real-time access to market data, platform documentation, and compliance information. Production environment, live financial context, real regulatory stakes.

The tech: embeddings, vector database, retrieval pipeline, LLM inference, UX layer. Standard RAG architecture. Deployed on a system people use to make financial decisions.

What Actually Broke

1. Chunking strategy was the first real failure

Every tutorial chunks documents at 512 tokens and calls it done. In financial content, this destroys numerical context. A table that spans 600 tokens gets split in ways that make the numbers meaningless in isolation.

What we did: Custom chunking logic that treats tables as atomic units, preserves section hierarchy metadata, and chunks by semantic boundary rather than token count. More work. Dramatically better retrieval quality.

2. Embedding drift

Embeddings are not static. The financial domain has specialized vocabulary that generic embedding models handle poorly. "Margin" means something different in trading than in typography.

What we did: Evaluated domain-specific embeddings, implemented query expansion for financial terminology, added metadata filtering to constrain retrieval to relevant document types before semantic search.

3. Hallucination patterns you can't test for in development

In development, you test with questions you thought of. In production, users ask questions you never imagined. The failure modes that emerged:

Confident wrong answers on edge cases at the boundary of the training data
Temporal confusion — the model treating historical data as current
Aggregation errors — users asking "what's my total?" over data the system couldn't actually aggregate

What we did: Source citation on every response ("here's where this came from"), explicit uncertainty signaling ("I found this from March 2025 — verify current rates"), and hard rules that escalate to human support for specific query types.

4. The latency problem nobody talks about

Users don't tolerate 4-second response times. In development, you accept it. In production, latency becomes a product problem.

What we did: Streaming responses (show the first token immediately), retrieval caching for common queries, parallel retrieval across document shards, and aggressive query preprocessing to eliminate retrievals that clearly won't find relevant content.

5. The trust design problem

This is the one no ML engineer on your team will think about.

A technically correct answer that users don't trust is worthless. Financial users, in particular, are trained to be skeptical. Every design decision about how the AI presents information is a trust decision.

What we did: Source citations. Response confidence indicators. "I don't know" as a first-class response (not a failure state, but a deliberate, trustworthy answer). Human escalation as a visible, always-accessible option.

The Numbers After 11 Months

Production uptime: 99.2%
User adoption: 20,000+ active users
Escalation rate to human support: <8% (target was <12%)
User trust signal (follow-up query rate): Users who query once return for an average of 4.2 queries per session — strong signal of trust

What I'd Tell Teams Starting RAG Today

Invest in chunking before embeddings. Better chunks → better retrieval. No amount of embedding optimization fixes bad chunks.
Design for failure before you design for success. Your "I don't know" response UX matters more than your answer UX. Users encounter the failure state more than they think.
Latency is a product decision, not an engineering problem. Decide what latency is acceptable to your users before you design the system.
Source citation is not optional. In any domain where accuracy matters, users need to know where the answer came from. Design this in from day one.
Your production failure modes are different from your test failure modes. Ship small, observe, iterate. The interesting problems don't appear until real users ask real questions.

This post is part of an ongoing series on production AI implementation. Questions or war stories — reach out.

Get the good stuff directly.

Production AI notes, design decisions, and practical updates from the workbench.

Slide to joinSubscribed

Most RAG tutorials end at "it works in the demo." This is what happens after.

What We Built

The tech: embeddings, vector database, retrieval pipeline, LLM inference, UX layer. Standard RAG architecture. Deployed on a system people use to make financial decisions.

What Actually Broke