Build a RAG pipeline
Retrieval-Augmented Generation lets an LLM answer questions over your private data. This guide walks the concepts first, then the canonical code, then how to evaluate it.
What is RAG, in one paragraph?
A plain LLM only knows what it saw during training. RAG fixes that by doing a quick library lookup before the model answers: you search your own documents for the few passages most relevant to the user's question, paste them into the prompt, and ask the model to answer using only that context. The model becomes a reasoning engine on top of your data — without retraining and without waiting for the next model release.
Why not just fine-tune?
Fine-tuning teaches a model new style or behavior, but it's a poor way to inject facts. Facts change, fine-tunes are expensive to redo, and the model still hallucinates confidently when it forgets. RAG keeps your knowledge in a database where you can add, update, or delete a single document in milliseconds — and every answer is grounded in passages you can show the user as citations.
The mental model
Think of RAG as an open-book exam. The retriever is the student flipping through the textbook to find the right page. The generator (the LLM) reads that page and writes the answer in their own words. If the retriever hands over the wrong page, even a brilliant student will write a wrong answer — which is why retrieval quality matters more than model choice in most RAG systems.
┌──────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌─────┐
│ Question │──▶│ Embed Q │──▶│ Vector │──▶│ Top-K + │──▶│ LLM │──▶ Answer
└──────────┘ └─────────┘ │ Search │ │ Rerank │ └─────┘
└──────────┘ └──────────┘ ▲
▲ │
│ │
┌──────────┐ ┌──────────┐
│ Vector DB│◀───── embed ───│ Chunked │
└──────────┘ │ Docs │
└──────────┘The four stages
Every RAG system, no matter how fancy, decomposes into the same four stages: chunk, embed, retrieve, generate. The first two happen offline when you ingest documents. The last two run on every user query. Each stage has its own quality knobs — and its own way to quietly ruin the final answer.
1. Chunk — slice documents into bite-sized pieces
Embedding models have a context limit (a few hundred to a few thousand tokens) and retrieval works best when each unit of text covers one idea. So you split long documents into smaller chunks. A common starting point is 500–1000 characters with ~15% overlap so a sentence that straddles two chunks still appears whole in one of them.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800, // ~150-200 tokens of English
chunkOverlap: 120, // keeps cross-boundary context intact
});
const chunks = await splitter.splitDocuments(docs);2. Embed & upsert — turn text into searchable vectors
An embedding model maps each chunk to a vector of numbers (e.g. 1536 dimensions) such that semantically similar chunks end up near each other in that space. You store the vector alongside the original text and any useful metadata (source URL, author, date) in a vector database. This is the index your retriever will search later.
const vectors = await openai.embeddings.create({
model: "text-embedding-3-small",
input: chunks.map((c) => c.pageContent),
});
await index.upsert(chunks.map((c, i) => ({
id: c.id,
values: vectors.data[i].embedding,
metadata: {
source: c.metadata.source,
text: c.pageContent, // store original text for retrieval
},
})));3. Retrieve — find the most relevant chunks
At query time you embed the user's question with the same model, ask the vector DB for the top-K nearest chunks (usually K = 10–50), then optionally pass them through a reranker — a smaller, more accurate model that re-scores each chunk against the query with full attention. Rerank is the single highest-leverage upgrade in most RAG systems.
// 1. Embed the question with the SAME model used for documents
const queryVector = (await openai.embeddings.create({
model: "text-embedding-3-small",
input: query,
})).data[0].embedding;
// 2. Approximate-nearest-neighbor search for candidates
const { matches } = await index.query({
vector: queryVector,
topK: 20,
includeMetadata: true,
});
// 3. Rerank the candidates with a cross-encoder
const reranked = await cohere.rerank({
model: "rerank-3.5",
query,
documents: matches.map((m) => m.metadata.text),
topN: 5,
});4. Generate — answer using the retrieved context
Stitch the top chunks into the prompt, tell the model to answer only from that context, and ask it to cite which chunk supports each claim. The system prompt is doing real work here: a weak instruction lets the model fall back on its training data and hallucinate.
const context = reranked.results
.map((r, i) => `[${i + 1}] ${r.document.text}`)
.join("\n---\n");
const answer = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content:
"Answer the question using ONLY the numbered context below. " +
"Cite sources inline like [1], [2]. If the context does not " +
"contain the answer, say 'I don't know'.",
},
{
role: "user",
content: `Context:\n${context}\n\nQuestion: ${query}`,
},
],
});Where RAG quietly goes wrong
Most broken RAG systems fail in the same handful of ways. Knowing the failure modes makes them easier to spot:
Evaluation — the only way to know it works
RAG quality is invisible without a fixed evaluation set. Hold out 50–200 question/answer pairs that reflect real user questions and measure two things separately:
Retrieval quality: for each question, is the gold chunk in the top-K results? (recall@k). If retrieval is broken, no amount of prompt engineering will save you.
Generation quality: given the retrieved context, is the answer faithful (every claim supported by context) and relevant (actually answers the question)? Tools like RAGAS score this with an LLM-as-judge.
When to reach for advanced patterns
The pipeline above is enough for the vast majority of use cases. Reach for these only when evals show the baseline isn't enough: