2026 AI Systems Builder live

Eval-Driven RAG for Technical Documents

Built an evaluation-first RAG system for dense technical documents, using a labeled benchmark to separate retrieval quality from grounded answer quality before optimizing generation.

  • RAG
  • Evaluation
  • Retrieval
  • Grounding
  • AI Systems

Eval-Driven RAG for Technical Documents is a flagship build around one engineering question: if a RAG system looks good, can you explain whether retrieval is actually strong enough to deserve the answer?

The standalone repository is public here: rag-eval-system

What It Is

This is an evaluation-first RAG system for dense technical documents. I used a small but adversarial corpus and a labeled benchmark to test retrieval quality before treating answer generation as the main problem.

That matters because technical documents create exactly the kinds of failure cases that make weak RAG systems look better than they are: repeated terminology, section-sensitive evidence, multi-span answers, and queries that should trigger abstention instead of confident synthesis.

Why It Matters

Most RAG demos optimize for output quality first. I wanted a system that could show, with evidence, where retrieval succeeds, where it fails, and when grounded generation is still the bottleneck.

That makes the project useful as a proof artifact, not just a demo artifact.

System Design

The system is intentionally lightweight and inspectable.

  1. raw technical documents are normalized and chunked deterministically
  2. retrieval baselines run against the same chunk artifact
  3. grounded generation sits on top of the strongest retrieval path
  4. evaluation scores retrieval and answer quality separately

That separation is the point. If the answer is weak, I want to know whether the problem is evidence retrieval, evidence coverage, or answer selection.

Retrieval Baselines

The current benchmark compares three retrieval methods:

  • keyword overlap as a simple deterministic floor
  • bm25 as the strongest sparse baseline
  • tfidf/cosine as a lightweight semantic baseline

The benchmark story is already useful: BM25 currently performs best on this dense technical corpus, while the lightweight semantic baseline loses precision on terminology-heavy and section-sensitive cases.

Generation Findings

The first published generation benchmark uses a deterministic grounded generator. That generator only uses retrieved chunks, which makes answer behavior easy to inspect.

The key finding is honest and important:

  • deterministic generation is grounded
  • grounded does not mean correct
  • the current bottleneck is answer quality, not hallucination

An LLM-backed path also exists in the standalone repo, but it is not part of the current published benchmark results.

What This Proves

This project shows how I approach AI systems work when reliability matters:

  • evaluation is part of system design, not a reporting step
  • retrieval quality and generation quality should be separated
  • engineering tradeoffs should be measurable, not implied
  • public project artifacts should include failure analysis, not just polished outputs

That is the signal I wanted this flagship to carry.

Limitations

This is a strong first public release, not a finished platform.

  • the corpus is small and intentionally adversarial
  • chunking is simple and deterministic
  • the semantic baseline is lightweight rather than neural
  • the published generation result is the deterministic control, not an LLM benchmark

Those constraints are useful because they make the next engineering steps clear instead of hiding them behind premature complexity.