DocsGuides

Chunking strategies

Bad chunks ruin retrieval no matter how good the model is. Spend time here — it has the highest leverage of any single decision in a RAG pipeline.

Why chunking matters

Embedding models compress a passage into a single vector. The longer and more topically mixed that passage is, the more the vector becomes an average of unrelated ideas — and averages don't match well against specific questions. Chunking is how you keep each vector focused on one idea so similarity search can actually find it.

A chunk is also the unit the LLM eventually reads. Too small and the model loses the surrounding context it needs to reason. Too large and the relevant sentence drowns in noise and burns tokens. Good chunking is the art of picking the sweet spot for your data.

The 80/20 rule

Most RAG quality problems trace back to chunking, not the embedding model or the LLM. Fix chunking first, measure, then move on.

The knobs you actually tune

Chunk size

500–1500 chars typical

Overlap

10–20% of chunk size

Split boundary

Paragraph → sentence → char

Metadata

Source, section, page, date

1. Fixed-size with overlap (the baseline)

Simple, fast, and surprisingly hard to beat. Walk the text in fixed windows of N characters, stepping forward by N − overlap each time. The overlap region ensures any sentence that straddles a boundary still appears whole in at least one chunk.

function fixedChunks(text: string, size = 800, overlap = 120) {
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + size, text.length);
    chunks.push(text.slice(start, end));
    if (end === text.length) break;
    start = end - overlap; // step forward, keep overlap region
  }
  return chunks;
}

Why overlap matters

Without overlap, the answer "The capital of France is Paris." can be split into "The capital of France is" and "Paris." — neither chunk will match the query "What is the capital of France?" well.

2. Recursive character splitting (the smart baseline)

Instead of cutting blindly at character N, try splitting at the most meaningful boundary first — paragraph break, then sentence, then word, then character. This keeps semantic units intact whenever possible. It's the default in LangChain and LlamaIndex for a reason.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 120,
  // Tried in order — fall through to the next if the chunk is still too big
  separators: ["\n\n", "\n", ". ", " ", ""],
});

const chunks = await splitter.splitDocuments(docs);

3. Structure-aware splitting

For Markdown, HTML, or source code, the document already tells you where the meaningful boundaries are: headings, sections, functions, classes. Split on those first, then fall back to character splits only inside each block. Most retrievers gain 5–15 points of recall just by respecting structure.

import { MarkdownHeaderTextSplitter } from "langchain/text_splitter";

const mdSplitter = new MarkdownHeaderTextSplitter({
  headersToSplitOn: [
    ["#",   "h1"],
    ["##",  "h2"],
    ["###", "h3"],
  ],
});

const sections = await mdSplitter.splitText(markdown);

// Each section carries its heading trail as metadata so the LLM
// (and the user) can see where the chunk came from.
sections[0].metadata; // { h1: "Guide", h2: "Chunking", h3: "Overlap" }

For code, use a language-aware splitter that knows about scopes:

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const codeSplitter = RecursiveCharacterTextSplitter.fromLanguage("ts", {
  chunkSize: 1000,
  chunkOverlap: 100,
});
// Splits on class/function boundaries before falling back to lines.

4. Semantic chunking

A more recent approach: embed each sentence, then start a new chunk whenever the cosine distance between consecutive sentences spikes above a threshold. The result is chunks whose boundaries follow topic shifts in the text rather than arbitrary character counts. Great for prose, weaker for highly structured documents (you're throwing away the structure signal).

async function semanticChunk(text: string, threshold = 0.25) {
  const sentences = text.split(/(?<=[.!?])\s+/);
  const embeds = await embed(sentences); // [n][dim]

  const chunks: string[][] = [[sentences[0]]];
  for (let i = 1; i < sentences.length; i++) {
    const dist = 1 - cosine(embeds[i], embeds[i - 1]);
    if (dist > threshold) chunks.push([]); // topic shift → new chunk
    chunks.at(-1)!.push(sentences[i]);
  }
  return chunks.map((c) => c.join(" "));
}

More cost, sometimes more quality

Semantic chunking calls the embedding model on every sentence at ingest time. Only adopt it if your eval set shows a real win over recursive splitting — it often doesn't justify the cost.

5. Late chunking (2024 technique)

Flip the usual order: embed the entire long document with a long-context embedding model first, then mean-pool the token embeddings over chunk-sized windows. Because each token's embedding already attended to the whole document, the resulting chunk vectors "know" their surrounding context. A notable quality win for legal, medical, and technical documents where references span sections.

// Pseudocode — requires a long-context model that exposes token embeddings
const { tokenEmbeddings, offsets } = await longContextEmbed(fullDoc);

const windows = makeCharWindows(fullDoc, 800, 120);
const chunkVectors = windows.map((w) => {
  const tokenSlice = tokenEmbeddings.filter(
    (_, i) => offsets[i] >= w.start && offsets[i] < w.end,
  );
  return meanPool(tokenSlice); // context-aware chunk vector
});

6. Parent–child (small-to-big) retrieval

Index small chunks for matching, but feed larger parent chunks to the LLM for reading. You get precise retrieval and rich context — the best of both sizes. This is one of the highest-ROI patterns in production RAG.

// Index small chunks (250 chars) for precise similarity
const childSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 250 });
// Keep larger parent chunks (1500 chars) for LLM context
const parentSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1500 });

const parents = await parentSplitter.splitDocuments(docs);
const children = parents.flatMap((p, pid) =>
  childSplitter.splitText(p.pageContent).map((text) => ({
    text,
    parentId: pid,
  })),
);

// Embed and store children; keep parents in a doc store keyed by parentId.
// At query time: search children → dedupe parent IDs → return parent text.

Always attach metadata

Every chunk should carry the metadata you'll want to filter or cite on later. Adding it at ingest time is cheap; back-filling it later means re-indexing the whole corpus.

const enriched = chunks.map((c, i) => ({
  id: `${doc.id}#${i}`,
  text: c.pageContent,
  metadata: {
    source: doc.url,
    title: doc.title,
    section: c.metadata.h2 ?? null,
    page: c.metadata.page ?? null,
    createdAt: doc.createdAt,
    tenantId: doc.tenantId,        // for multi-tenant pre-filtering
    embeddingModel: "text-embedding-3-small",
    embeddingVersion: "v1",         // bump on model change
  },
}));

Pre-filter before vector search

Filtering by tenantId, date, or section at the database level is orders of magnitude cheaper than retrieving 1000 candidates and filtering in app code. Every serious vector DB supports it.

How to pick a chunk size (with evals)

Don't guess. Hold out 50–200 question/gold-passage pairs and sweep chunk size against recall@k. The curve is usually flat over a wide range, then falls off a cliff — pick a size comfortably inside the flat region.

const sizes = [400, 600, 800, 1000, 1200, 1600];
const results = [];
for (const size of sizes) {
  const chunks = chunkAll(corpus, size, Math.round(size * 0.15));
  await reindex(chunks);
  const recall = await evalRecallAtK(evalSet, 5);
  results.push({ size, recall });
}
console.table(results);

Common pitfalls

Zero overlap

Answers cut in half

Huge chunks

Signal drowned in noise

Tiny chunks

LLM lacks context to reason

Ignoring structure

Headings/tables shredded

No metadata

Can't filter or cite

Mixed embed versions

Silent recall collapse

A sensible default to start with

When in doubt: recursive character splitter, chunk size 800, overlap 120, with structure-aware splitting for Markdown/code and parent–child retrieval if your evals show retrieval is precise but answers lack context. Tune from there with real measurements.