Image of a constellation in the dark, nodes linked by thin lines, with a single node glowing orange

RAG + Eval — A Retrieval-Augmented Pipeline With an Evaluation Gate

Zero-dependency, pure-Python RAG. A 40-question golden-set evaluation harness is wired in as a CI gate, so retrieval quality can't silently regress.

2026Design, implementation, and eval set, solo

  • Python (stdlib only)
  • BM25 + Dense (RRF)
  • Ollama
  • CI eval gate
90.6%measuredhit@3 — 32-question golden set (CI offline mode)
100%measuredOut-of-scope refusal recall
40measuredGolden eval set (32 in-scope + 8 out-of-scope, EN+KO)

Problem

RAG demos are everywhere. The problem is that most of them have no evaluation. Retrieval quality can regress and nobody notices; “a plausible-sounding answer” and “a grounded answer” go undistinguished. In hiring and in practice alike, the biggest differentiator in 2026-era RAG/agent work isn’t the model — it’s evaluation discipline.

Approach

I built a small pipeline that is measured end to end: zero third-party dependencies (pure Python stdlib), and $0 generation cost via a local LLM.

  1. Hybrid retrieval. BM25 (sparse) and cosine (dense) are fused via Reciprocal Rank Fusion, then re-ranked. The embedder is pluggable — two modes, ollama (real semantic) or hash (deterministic, offline) — so CI runs without a GPU or network access.
  2. A 40-question golden set. 32 questions whose answers are in the corpus, plus 8 that have none (EN+KO). It also scores whether the system says “I don’t know” (refusal) when there’s no answer to give.
  3. CI gate. Four metrics — hit@k, answer keyword recall, refusal precision/recall, and citation accuracy — each carry a threshold, and if any one falls short, the build fails (exit ≠ 0). Citation accuracy is measured by deterministic string matching, not an LLM judge — because the evaluation itself has to be reproducible.

Results (measured — from repo fixtures, reproducible by command)

Mode Embedder hit@3 answer recall refusal P / R citation acc
CI (offline) hash 90.6% 80% / 100% 33.3%
Local (full) nomic-embed-text 84.4% 75.0% 66.7% / 100% 48.4%

Both modes pass the CI gate thresholds. The numbers not looking perfect is intentional — the offline proxy is calibrated so the hash embedder + BM25 signal measure only as much as they can honestly measure.

Limitations and failure modes

  • This is a demo at a 10-chunk corpus scale. Indexing, caching, and cost problems at a tens-of-thousands-of-chunks scale are not demonstrated by this repo.
  • Citation accuracy is a proxy that reads low by construction of the top-k structure (33.3% offline); the README spells out exactly how it’s computed and where its limits are.
  • Refusal precision (66.7-80%) has room to improve — I prioritized the ability to refuse what should be refused (100% recall), and some over-refusal remains.

What I learned

“RAG without evaluation” is a feature that never shipped. The moment evaluation goes into CI, retrieval quality stops being a matter of feel and becomes code with regression tests. The next step is porting the same golden set to a pgvector-based live demo, keeping the same gate in a cloud environment.