
RAG + Eval — A Retrieval-Augmented Pipeline With an Evaluation Gate
Zero-dependency, pure-Python RAG. A 40-question golden-set evaluation harness is wired in as a CI gate, so retrieval quality can't silently regress.
- Python (stdlib only)
- BM25 + Dense (RRF)
- Ollama
- CI eval gate
Problem
RAG demos are everywhere. The problem is that most of them have no evaluation. Retrieval quality can regress and nobody notices; “a plausible-sounding answer” and “a grounded answer” go undistinguished. In hiring and in practice alike, the biggest differentiator in 2026-era RAG/agent work isn’t the model — it’s evaluation discipline.
Approach
I built a small pipeline that is measured end to end: zero third-party dependencies (pure Python stdlib), and $0 generation cost via a local LLM.
- Hybrid retrieval. BM25 (sparse) and cosine (dense) are fused via Reciprocal Rank Fusion, then
re-ranked. The embedder is pluggable — two modes,
ollama(real semantic) orhash(deterministic, offline) — so CI runs without a GPU or network access. - A 40-question golden set. 32 questions whose answers are in the corpus, plus 8 that have none (EN+KO). It also scores whether the system says “I don’t know” (refusal) when there’s no answer to give.
- CI gate. Four metrics — hit@k, answer keyword recall, refusal precision/recall, and citation accuracy — each carry a threshold, and if any one falls short, the build fails (exit ≠ 0). Citation accuracy is measured by deterministic string matching, not an LLM judge — because the evaluation itself has to be reproducible.
Results (measured — from repo fixtures, reproducible by command)
| Mode | Embedder | hit@3 | answer recall | refusal P / R | citation acc |
|---|---|---|---|---|---|
| CI (offline) | hash | 90.6% | — | 80% / 100% | 33.3% |
| Local (full) | nomic-embed-text | 84.4% | 75.0% | 66.7% / 100% | 48.4% |
Both modes pass the CI gate thresholds. The numbers not looking perfect is intentional — the offline proxy is calibrated so the hash embedder + BM25 signal measure only as much as they can honestly measure.
Limitations and failure modes
- This is a demo at a 10-chunk corpus scale. Indexing, caching, and cost problems at a tens-of-thousands-of-chunks scale are not demonstrated by this repo.
- Citation accuracy is a proxy that reads low by construction of the top-k structure (33.3% offline); the README spells out exactly how it’s computed and where its limits are.
- Refusal precision (66.7-80%) has room to improve — I prioritized the ability to refuse what should be refused (100% recall), and some over-refusal remains.
What I learned
“RAG without evaluation” is a feature that never shipped. The moment evaluation goes into CI, retrieval quality stops being a matter of feel and becomes code with regression tests. The next step is porting the same golden set to a pgvector-based live demo, keeping the same gate in a cloud environment.