Case study — built & proven

2026-06-05 · 7 min read · Rafael Lopes

Building a RAG Pipeline From Scratch

Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time...

AI rag retrieval bm25 tf-idf rrf llm search

Most RAG tutorials hand you a vector store, a cosine-similarity call, and a prompt template, then declare victory. That pipeline falls over the first time someone asks a keyword-precise question — "BM25 vs TF-IDF ranking" returns generic results about "search relevance" because dense embeddings compress the exact-match signal away.

This is the pipeline I actually run in production: 69,638 chunks across 30 curated sources, retrieved with hybrid lexical scoring fused by weighted Reciprocal Rank Fusion, then passed through an answer verifier that strips fabricated quotes before anything reaches a reader. The measured numbers are 95.6% retrieval (20/20 test questions, Grade A) and 99/100 on the answer-quality gate — both shown live at https://blog.r-lopes.com/how-it-works. Every code block below is copy-pasteable from the running system.

The Core Fix

The single biggest lever is not better embeddings — it's fusing retrieval signals that fail differently. BM25 handles the what (exact terms, rare-token weighting); TF-IDF cosine handles the about (term-distribution similarity); Reciprocal Rank Fusion merges their rankings without needing to tune a single similarity threshold. Dense vectors get added as a third list, but they are the garnish, not the base — the lexical pair is what recovers the keyword-critical queries a vector-only system silently drops.

If you do exactly one thing to a vector-only RAG system, add BM25 and fuse with RRF. That's the move.

Architecture

query
  │
  ▼
smart-retrieval.js   intent detection + multi-angle expansion
  │
  ▼
search.js
  ├── synonym expansion (query-side only)
  ├── BM25 scoring           ── list 1
  ├── TF-IDF cosine          ── list 2
  ├── (optional) dense vector ── list 3
  ├── weighted RRF fusion (k=60, weights [1.2, 1.0])
  ├── per-source cap (no single source dominates)
  └── cross-encoder rerank
  │
  ▼
openai-proxy.js      build context + system prompt → LLM (Claude / local Ollama)
  │
  ▼
verify-answer.js     strip fabricated quotes + banned phrases
  │
  ▼
streamed answer

Retrieval: BM25 + TF-IDF + RRF

BM25 is the workhorse. The IDF term rewards rare query terms; the TF normalization saturates so a chunk doesn't win just by repeating a word, and it length-normalizes against the average document so long chunks don't dominate:

function bm25Score(queryTokens, doc, df, totalDocs, avgDl) {
  let score = 0;
  for (const term of queryTokens) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log((totalDocs - termDf + 0.5) / (termDf + 0.5) + 1);
    const termTf = doc.tf[term] || 0;
    const tfNorm = (termTf * (K1 + 1)) / (termTf + K1 * (1 - B + B * doc.docLength / avgDl));
    score += idf * tfNorm;
  }
  return score;
}

TF-IDF cosine is the second signal. It captures distributional similarity that BM25's term-at-a-time scoring misses:

function tfidfCosine(queryTokens, doc, df, totalDocs) {
  const queryTf = {};
  for (const t of queryTokens) queryTf[t] = (queryTf[t] || 0) + 1;
  let dotProduct = 0, queryMag = 0, docMag = 0;
  for (const term of new Set(queryTokens)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    const qTfidf = (queryTf[term] || 0) * idf;
    const dTfidf = (doc.tf[term] || 0) * idf;
    dotProduct += qTfidf * dTfidf;
    queryMag += qTfidf * qTfidf;
  }
  for (const term of Object.keys(doc.tf)) {
    const termDf = df[term] || 0;
    if (termDf === 0) continue;
    const idf = Math.log(totalDocs / (termDf + 1));
    docMag += (doc.tf[term] * idf) ** 2;
  }
  queryMag = Math.sqrt(queryMag);
  docMag = Math.sqrt(docMag);
  if (queryMag === 0 || docMag === 0) return 0;
  return dotProduct / (queryMag * docMag);
}

The fusion is where most tutorials oversimplify. Standard RRF gives every list equal weight; in practice BM25 is the stronger signal for technical queries, so it gets a higher weight. The constant k=60 is the standard damping value — it stops rank-1 from utterly dominating rank-2:

const RRF_K = 60;

function reciprocalRankFusion(rankedLists, k = RRF_K, weights = null) {
  const scores = new Map();
  for (let li = 0; li < rankedLists.length; li++) {
    const list = rankedLists[li];
    const w = weights ? weights[li] : 1.0;
    for (let rank = 0; rank < list.length; rank++) {
      const id = list[rank].doc.id;
      const rrfScore = w / (k + rank + 1);
      scores.set(id, (scores.get(id) || 0) + rrfScore);
    }
  }
  return scores;
}

Wiring it together — BM25 weighted 1.2, TF-IDF 1.0:

const bm25Ranked  = docs.map(doc => ({ doc, score: bm25Score(expandedTokens, doc, index.df, totalDocs, avgDocLength) }))
                        .sort((a, b) => b.score - a.score);
const tfidfRanked = docs.map(doc => ({ doc, score: tfidfCosine(expandedTokens, doc, index.df, totalDocs) }))
                        .sort((a, b) => b.score - a.score);

const rrfScores = reciprocalRankFusion([bm25Ranked, tfidfRanked], RRF_K, [1.2, 1.0]);

Two details that earn their keep: synonym expansion is query-side only (expanding documents would blow up the index and dilute IDF), and a per-source cap runs after fusion so a single prolific source can't monopolize the top-k — diversity of evidence beats depth from one channel.

The Quality Gate

Retrieval being right doesn't make the answer right. LLMs fabricate quotes, cite sources that weren't retrieved, and pad with cheerleading. So every generated answer passes a verifier before it ships, backed by 33 unit tests and a 4-case gold-standard gate with a hard floor of 90/100. The system currently scores 99/100.

The verifier's most important check is quote fidelity. Any > "blockquote" is validated against the retrieved chunk text by fuzzy match at a 0.9 word-overlap ratio — quotes that aren't actually in the sources are silently removed and logged server-side (the reader never sees a placeholder):

Quote fidelity — blockquotes fuzzy-matched (0.9 word-overlap ratio) against retrieved chunks; fabrications stripped and logged.
Invalid source refs — [Source N] where N exceeds the retrieved count is removed.
Banned phrases — production-ready, blazing fast, world-class, best-in-class and friends are flagged; cheerleading is a regression, not a flourish.
Emoji headers and "Keep exploring" footers — auto-stripped.
Structural compliance — deep answers must lead with one root cause before any diagram or table.

The gate runs automatically on proxy restart and as a git pre-push hook on guarded files. A change that drops the score below 90 does not ship.

The Numbers

These are measured, not aspirational — generated from the live corpus and the latest eval reports:

Metric	Value	Source
Chunks in corpus	69,638	live `rag_chunks.json`
Distinct sources	30	live `rag_chunks.json`
Retrieval	20/20 (95.6%), Grade A	https://blog.r-lopes.com/how-it-works
Topic recall	perfect	`rag_eval_report.json`
Keyword recall	perfect	`rag_eval_report.json`
Source recall	trails — the weak spot	`rag_eval_report.json`
Answer quality gate	99/100 (4/4 cases, floor 90)	`quality_eval_report.json`
Verifier unit tests	33	`test-verifier.js`

What I'd Do Differently

Honesty section, because the failures are more useful than the wins:

Source recall is the weak spot. Topic and keyword recall are both perfect, but source recall trails — the system finds the right answer but doesn't always surface every source that supports it. That's the next number to move.
The gold-standard gate is only four cases. Four cases catch obvious regressions but won't catch a cross-domain one. Expanding to a Kafka query, a system-design query, and a web-performance query is the cheapest reliability upgrade left.
Dense vectors are underused. They're wired in as a third RRF list but the lexical pair does most of the work. There's headroom in a proper cross-encoder rerank pass over a larger candidate set.

The pipeline isn't finished — no pipeline is. But "95.6% retrieval, 99/100 quality, fabrications stripped automatically" — all live at https://blog.r-lopes.com/how-it-works — is a real bar, measured on a real corpus, and the code above is exactly what produces it.

Built, then written

Tested on my own homelab before publishing — a four-architecture cluster (ARM · AMD ROCm · NVIDIA CUDA · Apple Silicon) running this blog, the RAG pipeline, and a sovereign research copilot. Built and tested before it's written — refined as I learn. See the platform →

Rafael Lopes

Production AI Engineer in Vancouver, BC. Brazilian. Builds and ships production AI on a self-hosted homelab — RAG pipelines, distributed LLM inference, web performance, and platform engineering.

Machine-readable brief — Rafael Lopes

Building a RAG Pipeline From Scratch

The Core Fix

Architecture

Retrieval: BM25 + TF-IDF + RRF

The Quality Gate

The Numbers

What I'd Do Differently

Related posts

Machine-readable brief — Rafael Lopes

Building a RAG Pipeline From Scratch

The Core Fix

Architecture

Retrieval: BM25 + TF-IDF + RRF

The Quality Gate

The Numbers

What I'd Do Differently

Related posts

llms.txt & llms-full.txt: Teaching LLMs What Your Site Offers

AI Engineer in Vancouver, BC — Production AI, Built in the Open

WebMCP: Making Your Website Callable, Not Just Crawlable