Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question — "BM25 vs TF-IDF ranking" returns generic results about "search relevance" because dense embeddings compress the exact-match signal away.
This is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval (20/20 test questions, Grade A) and 99/100 on the answer-quality gate — both shown live at https://blog.r-lopes.com/how-it-works. Every code block below is copy-pasteable from the running system.
The Core Fix
The single biggest lever is not better embeddings — it's fusing retrieval signals that fail differently. BM25 handles the what (exact terms, rare-token weighting); TF-IDF cosine handles the about (term-distribution similarity); Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the garnish, not the base — the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops.
If you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move.
Architecture
query
│
▼
smart-retrieval.js intent detection + multi-angle expansion
│
▼
search.js
├── synonym expansion (query-side only)
├── BM25 scoring ── list 1
├── TF-IDF cosine ── list 2
├── (optional) dense vector ── list 3
├── weighted RRF fusion (k=60, weights [1.2, 1.0])
├── per-source cap (no single source dominates)
└── cross-encoder rerank
│
▼
openai-proxy.js build context + system prompt → LLM (Claude / local Ollama)
│
▼
verify-answer.js strip fabricated quotes + banned phrases
│
▼
streamed answer
Retrieval: BM25 + TF-IDF + RRF
BM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate:
function bm25Score(queryTokens, doc, df, totalDocs, avgDl) {
let score = 0;
for (const term of queryTokens) {
const termDf = df[term] || 0;
if (termDf === 0) continue;
const idf = Math.log((totalDocs - termDf + 0.5) / (termDf + 0.5) + 1);
const termTf = doc.tf[term] || 0;
const tfNorm = (termTf * (K1 + 1)) / (termTf + K1 * (1 - B + B * doc.docLength / avgDl));
score += idf * tfNorm;
}
return score;
}
TF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses:
function tfidfCosine(queryTokens, doc, df, totalDocs) {
const queryTf = {};
for (const t of queryTokens) queryTf[t] = (queryTf[t] || 0) + 1;
let dotProduct = 0, queryMag = 0, docMag = 0;
for (const term of new Set(queryTokens)) {
const termDf = df[term] || 0;
if (termDf === 0) continue;
const idf = Math.log(totalDocs / (termDf + 1));
const qTfidf = (queryTf[term] || 0) * idf;
const dTfidf = (doc.tf[term] || 0) * idf;
dotProduct += qTfidf * dTfidf;
queryMag += qTfidf * qTfidf;
}
for (const term of Object.keys(doc.tf)) {
const termDf = df[term] || 0;
if (termDf === 0) continue;
const idf = Math.log(totalDocs / (termDf + 1));
docMag += (doc.tf[term] * idf) ** 2;
}
queryMag = Math.sqrt(queryMag);
docMag = Math.sqrt(docMag);
if (queryMag === 0 || docMag === 0) return 0;
return dotProduct / (queryMag * docMag);
}
The fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant k=60 is the standard damping value — it stops rank-1 from utterly dominating rank-2:
const RRF_K = 60;
function reciprocalRankFusion(rankedLists, k = RRF_K, weights = null) {
const scores = new Map();
for (let li = 0; li < rankedLists.length; li++) {
const list = rankedLists[li];
const w = weights ? weights[li] : 1.0;
for (let rank = 0; rank < list.length; rank++) {
const id = list[rank].doc.id;
const rrfScore = w / (k + rank + 1);
scores.set(id, (scores.get(id) || 0) + rrfScore);
}
}
return scores;
}
Wiring it together — BM25 weighted 1.2, TF-IDF 1.0:
const bm25Ranked = docs.map(doc => ({ doc, score: bm25Score(expandedTokens, doc, index.df, totalDocs, avgDocLength) }))
.sort((a, b) => b.score - a.score);
const tfidfRanked = docs.map(doc => ({ doc, score: tfidfCosine(expandedTokens, doc, index.df, totalDocs) }))
.sort((a, b) => b.score - a.score);
const rrfScores = reciprocalRankFusion([bm25Ranked, tfidfRanked], RRF_K, [1.2, 1.0]);
Two details that earn their keep: synonym expansion is query-side only (expanding documents would blow up the index and dilute IDF), and a per-source cap runs after fusion so a single prolific source can't monopolize the top-k — diversity of evidence beats depth from one channel.
The Quality Gate
Retrieval being right doesn't make the answer right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores 99/100.
The verifier's most important check is quote fidelity. Any > "blockquote" is validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio — quotes that aren't actually in the sources are replaced with a *[fabricated quote removed]* marker and logged:
- Quote fidelity — blockquotes fuzzy-matched (0.9 word-overlap ratio) against retrieved chunks; fabrications stripped and logged.
- Invalid source refs —
[Source N]whereNexceeds the retrieved count is removed. - Banned phrases —
production-ready,blazing fast,world-class,best-in-classand friends are flagged; cheerleading is a regression, not a flourish. - Emoji headers and "Keep exploring" footers — auto-stripped.
- Structural compliance — deep answers must lead with one root cause before any diagram or table.
The gate runs automatically on proxy restart and as a git pre-push hook on guarded files. A change that drops the score below 90 does not ship.
The Numbers
These are measured, not aspirational — generated from the live corpus and the latest eval reports:
| Metric | Value | Source |
|---|---|---|
| Chunks in corpus | 69,638 | live rag_chunks.json |
| Distinct sources | 30 | live rag_chunks.json |
| Retrieval | 20/20 (95.6%), Grade A | https://blog.r-lopes.com/how-it-works |
| Topic recall | perfect | rag_eval_report.json |
| Keyword recall | perfect | rag_eval_report.json |
| Source recall | trails — the weak spot | rag_eval_report.json |
| Answer quality gate | 99/100 (4/4 cases, floor 90) | quality_eval_report.json |
| Verifier unit tests | 33 | test-verifier.js |
What I'd Do Differently
Honesty section, because the failures are more useful than the wins:
- Source recall is the weak spot. Topic and keyword recall are both perfect, but source recall trails — the system finds the right answer but doesn't always surface every source that supports it. That's the next number to move.
- The gold-standard gate is only four cases. Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left.
- Dense vectors are underused. They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set.
The pipeline isn't finished — no pipeline is. But "95.6% retrieval, 99/100 quality, fabrications stripped automatically" — all live at https://blog.r-lopes.com/how-it-works — is a real bar, measured on a real corpus, and the code above is exactly what produces it.