RAG + Eval — A Retrieval-Augmented Pipeline With an Evaluation Gate

Problem

RAG demos are everywhere. The problem is that most of them have no evaluation. Retrieval quality can regress and nobody notices; “a plausible-sounding answer” and “a grounded answer” go undistinguished. In hiring and in practice alike, the biggest differentiator in 2026-era RAG/agent work isn’t the model — it’s evaluation discipline.

Approach

I built a small pipeline that is measured end to end: zero third-party dependencies (pure Python stdlib), and $0 generation cost via a local LLM.

Hybrid retrieval. BM25 (sparse) and cosine (dense) are fused via Reciprocal Rank Fusion, then re-ranked. The embedder is pluggable — two modes, ollama (real semantic) or hash (deterministic, offline) — so CI runs without a GPU or network access.
A 40-question golden set. 32 questions whose answers are in the corpus, plus 8 that have none (EN+KO). It also scores whether the system says “I don’t know” (refusal) when there’s no answer to give.
CI gate. Four metrics — hit@k, answer keyword recall, refusal precision/recall, and citation accuracy — each carry a threshold, and if any one falls short, the build fails (exit ≠ 0). Citation accuracy is measured by deterministic string matching, not an LLM judge — because the evaluation itself has to be reproducible.

Results (measured — from repo fixtures, reproducible by command)

Mode	Embedder	hit@3	answer recall	refusal P / R	citation acc
CI (offline)	hash	90.6%	—	80% / 100%	33.3%
Local (full)	nomic-embed-text	84.4%	75.0%	66.7% / 100%	48.4%

Both modes pass the CI gate thresholds. The numbers not looking perfect is intentional — the offline proxy is calibrated so the hash embedder + BM25 signal measure only as much as they can honestly measure.

Limitations and failure modes

This is a demo at a 10-chunk corpus scale. Indexing, caching, and cost problems at a tens-of-thousands-of-chunks scale are not demonstrated by this repo.
Citation accuracy is a proxy that reads low by construction of the top-k structure (33.3% offline); the README spells out exactly how it’s computed and where its limits are.
Refusal precision (66.7-80%) has room to improve — I prioritized the ability to refuse what should be refused (100% recall), and some over-refusal remains.

What I learned

“RAG without evaluation” is a feature that never shipped. The moment evaluation goes into CI, retrieval quality stops being a matter of feel and becomes code with regression tests. The next step is porting the same golden set to a pgvector-based live demo, keeping the same gate in a cloud environment.