DocsGuides

Build a RAG pipeline

Retrieval-Augmented Generation lets an LLM answer questions over your private data. This guide walks the concepts first, then the canonical code, then how to evaluate it.

What is RAG, in one paragraph?

A plain LLM only knows what it saw during training. RAG fixes that by doing a quick library lookup before the model answers: you search your own documents for the few passages most relevant to the user's question, paste them into the prompt, and ask the model to answer using only that context. The model becomes a reasoning engine on top of your data — without retraining and without waiting for the next model release.

Why not just fine-tune?

Fine-tuning teaches a model new style or behavior, but it's a poor way to inject facts. Facts change, fine-tunes are expensive to redo, and the model still hallucinates confidently when it forgets. RAG keeps your knowledge in a database where you can add, update, or delete a single document in milliseconds — and every answer is grounded in passages you can show the user as citations.

Fresh data

Update by re-indexing

Citations

Show the source passage

Cost

No training run needed

Control

Delete a doc → it's gone

The mental model

Think of RAG as an open-book exam. The retriever is the student flipping through the textbook to find the right page. The generator (the LLM) reads that page and writes the answer in their own words. If the retriever hands over the wrong page, even a brilliant student will write a wrong answer — which is why retrieval quality matters more than model choice in most RAG systems.

text

┌──────────┐   ┌─────────┐   ┌──────────┐   ┌──────────┐   ┌─────┐
│ Question │──▶│ Embed Q │──▶│  Vector  │──▶│ Top-K +  │──▶│ LLM │──▶ Answer
└──────────┘   └─────────┘   │  Search  │   │ Rerank   │   └─────┘
                              └──────────┘   └──────────┘       ▲
                                   ▲                            │
                                   │                            │
                              ┌──────────┐                ┌──────────┐
                              │ Vector DB│◀───── embed ───│ Chunked  │
                              └──────────┘                │   Docs   │
                                                          └──────────┘

The four stages

Every RAG system, no matter how fancy, decomposes into the same four stages: chunk, embed, retrieve, generate. The first two happen offline when you ingest documents. The last two run on every user query. Each stage has its own quality knobs — and its own way to quietly ruin the final answer.

1. Chunk — slice documents into bite-sized pieces

Embedding models have a context limit (a few hundred to a few thousand tokens) and retrieval works best when each unit of text covers one idea. So you split long documents into smaller chunks. A common starting point is 500–1000 characters with ~15% overlap so a sentence that straddles two chunks still appears whole in one of them.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,    // ~150-200 tokens of English
  chunkOverlap: 120, // keeps cross-boundary context intact
});

const chunks = await splitter.splitDocuments(docs);

Chunk size = retrieval precision vs. context

Small chunks = more precise matches but less surrounding context for the LLM. Large chunks = richer context but the relevant sentence gets diluted by noise. Start at 800 chars and tune from evals.

2. Embed & upsert — turn text into searchable vectors

An embedding model maps each chunk to a vector of numbers (e.g. 1536 dimensions) such that semantically similar chunks end up near each other in that space. You store the vector alongside the original text and any useful metadata (source URL, author, date) in a vector database. This is the index your retriever will search later.

const vectors = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: chunks.map((c) => c.pageContent),
});

await index.upsert(chunks.map((c, i) => ({
  id: c.id,
  values: vectors.data[i].embedding,
  metadata: {
    source: c.metadata.source,
    text: c.pageContent, // store original text for retrieval
  },
})));

Always store the source text in metadata

The vector alone is useless to the LLM — it can't read numbers. You need the original chunk text back at query time, so save it as metadata when you upsert.

3. Retrieve — find the most relevant chunks

At query time you embed the user's question with the same model, ask the vector DB for the top-K nearest chunks (usually K = 10–50), then optionally pass them through a reranker — a smaller, more accurate model that re-scores each chunk against the query with full attention. Rerank is the single highest-leverage upgrade in most RAG systems.

// 1. Embed the question with the SAME model used for documents
const queryVector = (await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: query,
})).data[0].embedding;

// 2. Approximate-nearest-neighbor search for candidates
const { matches } = await index.query({
  vector: queryVector,
  topK: 20,
  includeMetadata: true,
});

// 3. Rerank the candidates with a cross-encoder
const reranked = await cohere.rerank({
  model: "rerank-3.5",
  query,
  documents: matches.map((m) => m.metadata.text),
  topN: 5,
});

4. Generate — answer using the retrieved context

Stitch the top chunks into the prompt, tell the model to answer only from that context, and ask it to cite which chunk supports each claim. The system prompt is doing real work here: a weak instruction lets the model fall back on its training data and hallucinate.

const context = reranked.results
  .map((r, i) => `[${i + 1}] ${r.document.text}`)
  .join("\n---\n");

const answer = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "system",
      content:
        "Answer the question using ONLY the numbered context below. " +
        "Cite sources inline like [1], [2]. If the context does not " +
        "contain the answer, say 'I don't know'.",
    },
    {
      role: "user",
      content: `Context:\n${context}\n\nQuestion: ${query}`,
    },
  ],
});

Where RAG quietly goes wrong

Most broken RAG systems fail in the same handful of ways. Knowing the failure modes makes them easier to spot:

Bad chunking

Answers cut in half

Wrong embedding model

Topical, not specific

No rerank

Right chunk ranked 14th

Weak system prompt

Model hallucinates anyway

Missing metadata filters

Stale or wrong-tenant data

Mixed embedding versions

Silent recall collapse

Evaluation — the only way to know it works

RAG quality is invisible without a fixed evaluation set. Hold out 50–200 question/answer pairs that reflect real user questions and measure two things separately:

Retrieval quality: for each question, is the gold chunk in the top-K results? (recall@k). If retrieval is broken, no amount of prompt engineering will save you.

Generation quality: given the retrieved context, is the answer faithful (every claim supported by context) and relevant (actually answers the question)? Tools like RAGAS score this with an LLM-as-judge.

Evaluate, don't vibe-check

"It looked good when I tried it" is how RAG regressions ship. A 50-row eval set runs in under a minute and catches 90% of changes that silently degrade quality.

When to reach for advanced patterns

The pipeline above is enough for the vast majority of use cases. Reach for these only when evals show the baseline isn't enough:

Hybrid search

Rare tokens / IDs / code

Query rewriting

Conversational follow-ups

HyDE

Sparse query vocabulary

Multi-hop / agentic

Answers span many docs

Late chunking

Long technical documents

Metadata pre-filter

Multi-tenant or time-bound