Kush Kaveh

AI product builder and UX designer helping teams turn AI strategy into usable systems.

hello@kushkaveh.com
Available for selected projects and roles

Navigate

WorkServicesWritingAboutContact

Connect

LinkedInGitHub

Recognition

A' Design Award Bronze 2025A' Design Award Iron 2024

Copyright 2026 Kush Kaveh. Built with Next.js and Keystatic.

Privacy Policy
Kush Kaveh LogoKush Kaveh Logo
  • Work
  • Services
  • Writing
  • About
  • Let's Talk

Connect

  • LinkedIn
  • GitHub
  • Email
All Writing
AI SystemsRAGProduction

I Deployed a RAG Chatbot to 20,000 Users. Here's What Actually Broke.

The retrieval worked fine. The chunking strategy, embedding quality, and edge cases — that's where most teams fail. A production postmortem from 11 months in the field.

15 March 202612 min readBy Kush Kaveh
I Deployed a RAG Chatbot to 20,000 Users. Here's What Actually Broke.

Most RAG tutorials end at "it works in the demo." This is what happens after.

For the past 11 months, I've been running a RAG-powered financial assistant for a live fintech platform with 20,000+ active users. Here's what actually broke — and what nobody writing RAG tutorials has to deal with because they've never shipped anything.


What We Built

The system: a retrieval-augmented generation assistant giving traders real-time access to market data, platform documentation, and compliance information. Production environment, live financial context, real regulatory stakes.

The tech: embeddings, vector database, retrieval pipeline, LLM inference, UX layer. Standard RAG architecture. Deployed on a system people use to make financial decisions.


What Actually Broke

1. Chunking strategy was the first real failure

Every tutorial chunks documents at 512 tokens and calls it done. In financial content, this destroys numerical context. A table that spans 600 tokens gets split in ways that make the numbers meaningless in isolation.

What we did: Custom chunking logic that treats tables as atomic units, preserves section hierarchy metadata, and chunks by semantic boundary rather than token count. More work. Dramatically better retrieval quality.

2. Embedding drift

Embeddings are not static. The financial domain has specialized vocabulary that generic embedding models handle poorly. "Margin" means something different in trading than in typography.

What we did: Evaluated domain-specific embeddings, implemented query expansion for financial terminology, added metadata filtering to constrain retrieval to relevant document types before semantic search.

3. Hallucination patterns you can't test for in development

In development, you test with questions you thought of. In production, users ask questions you never imagined. The failure modes that emerged:

  • Confident wrong answers on edge cases at the boundary of the training data
  • Temporal confusion — the model treating historical data as current
  • Aggregation errors — users asking "what's my total?" over data the system couldn't actually aggregate

What we did: Source citation on every response ("here's where this came from"), explicit uncertainty signaling ("I found this from March 2025 — verify current rates"), and hard rules that escalate to human support for specific query types.

4. The latency problem nobody talks about

Users don't tolerate 4-second response times. In development, you accept it. In production, latency becomes a product problem.

What we did: Streaming responses (show the first token immediately), retrieval caching for common queries, parallel retrieval across document shards, and aggressive query preprocessing to eliminate retrievals that clearly won't find relevant content.

5. The trust design problem

This is the one no ML engineer on your team will think about.

A technically correct answer that users don't trust is worthless. Financial users, in particular, are trained to be skeptical. Every design decision about how the AI presents information is a trust decision.

What we did: Source citations. Response confidence indicators. "I don't know" as a first-class response (not a failure state, but a deliberate, trustworthy answer). Human escalation as a visible, always-accessible option.


The Numbers After 11 Months

  • Production uptime: 99.2%
  • User adoption: 20,000+ active users
  • Escalation rate to human support: <8% (target was <12%)
  • User trust signal (follow-up query rate): Users who query once return for an average of 4.2 queries per session — strong signal of trust

What I'd Tell Teams Starting RAG Today

  1. Invest in chunking before embeddings. Better chunks → better retrieval. No amount of embedding optimization fixes bad chunks.

  2. Design for failure before you design for success. Your "I don't know" response UX matters more than your answer UX. Users encounter the failure state more than they think.

  3. Latency is a product decision, not an engineering problem. Decide what latency is acceptable to your users before you design the system.

  4. Source citation is not optional. In any domain where accuracy matters, users need to know where the answer came from. Design this in from day one.

  5. Your production failure modes are different from your test failure modes. Ship small, observe, iterate. The interesting problems don't appear until real users ask real questions.


This post is part of an ongoing series on production AI implementation. Questions or war stories — reach out.

Get the good stuff directly.

Production AI notes, design decisions, and practical updates from the workbench.

Slide to joinSubscribed

Author

Kush Kaveh

Kush Kaveh

AI product builder and UX designer

I write about turning AI ideas into usable systems, with an eye on product quality, trust, and European implementation reality.

LinkedIn

Want to apply this in your company?

Bring me a messy AI workflow, a half-built prototype, or a team that needs a clear path to production.

Start a conversation
OlderEU AI Act Compliance for SMBs — What You Actually Need to Do Before August 2026

06 - CONTACT

Let's build something
worth building.

I reply within 24 hours. Bring the problem, the prototype, or the workflow that needs to become real.

hello@kushkaveh.com
LLinkedInGGitHub

0 / 30 minimum

Slide to sendMessage sent