2025 experiment

RAG Failure Maps

A retrieval-first exploration of where RAG systems break, from weak chunk boundaries and noisy evidence to grounding failures and misleading answer confidence.

  • RAG
  • Retrieval
  • Grounding
  • Failure Analysis

RAG Failure Maps started from a practical observation: many retrieval systems look strong when the query is easy and the evidence is obvious, then become unreliable as soon as the documents get noisy or the information is spread across multiple passages.

The goal of this work was to make those failures visible instead of treating them as vague model weakness. I focused on where the system was actually breaking: poor chunk boundaries, irrelevant top results, partial evidence that looked plausible but was not sufficient, and answers that sounded grounded without being well supported.

A useful RAG workflow needs more than retrieval plus generation. It needs a way to inspect the path from query to evidence to answer. That means separating parsing problems from chunking problems, chunking problems from ranking problems, and ranking problems from generation behavior.

This project helped clarify a principle I use repeatedly in AI systems: if failure modes are not explicit, improvement becomes guesswork. A good retrieval system is not the one that produces the most impressive demo. It is the one whose weaknesses can be identified, explained, and improved with discipline.