DocsDatabases

FAISS

FAISS (Facebook AI Similarity Search) is a library — not a server — for efficient similarity search and clustering of dense vectors. It powers the indexing layer inside many modern vector databases.

At a glance

License

MIT

Maintainer

Meta AI Research

Language

C++ with Python bindings

Hardware

CPU + CUDA GPU

Scale

Billions of vectors on a single node

Form factor

Embedded library (no server)

What FAISS actually is

FAISS is a library, not a database. There's no server, no auth, no replication, no payload storage. You hand it a NumPy array of vectors, it builds an index in memory, and you can then ask it for the nearest neighbors of any query vector. Persistence is opt-in: you call faiss.write_index and ship the resulting file yourself.

That minimalism is the point. Pinecone, Milvus, Weaviate, and others use FAISS (or ideas from it) under the hood and add the database surface on top — metadata filtering, sharding, RBAC, REST APIs. If you don't need those, FAISS is the fastest path to a working ANN index.

Install and build your first index

bash

# CPU-only build
pip install faiss-cpu

# Or, with CUDA support (matches your CUDA toolkit version)
pip install faiss-gpu

python

import numpy as np
import faiss

d = 768                # embedding dimension
nb = 100_000           # database size
xb = np.random.random((nb, d)).astype("float32")
xq = np.random.random((5, d)).astype("float32")  # 5 queries

# Flat = brute force, exact, baseline for recall
index = faiss.IndexFlatL2(d)
index.add(xb)                          # no training needed for Flat
D, I = index.search(xq, k=10)          # distances + ids, shape (5, 10)
print(I[0])                            # 10 nearest neighbors of query 0

The index zoo

FAISS exposes a family of index types. You pick one based on a three-way tradeoff between recall, latency, and memory.

IndexFlatL2 / IP

Exact, slow, 100% recall

IndexIVFFlat

Inverted file, faster, ~tunable recall

IndexIVFPQ

IVF + product quantization, tiny memory

IndexHNSWFlat

Graph-based, great latency, more RAM

IndexLSH

Binary hashing, very compact

IndexIVFScalarQuantizer

Cheap compression, decent recall

IVF: partition the space

IndexIVFFlat first runs k-means to find nlist centroids, then assigns each vector to its closest centroid's "cell". At query time, FAISS only scans the nprobe nearest cells instead of the full dataset. Bigger nprobe → higher recall but slower.

python

d, nlist = 768, 4096
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)

index.train(xb)            # learns centroids — required for IVF
index.add(xb)
index.nprobe = 16          # tune: 1 = fast/low recall, 64 = slow/high recall
D, I = index.search(xq, 10)

Rule of thumb for nlist

A common heuristic is nlist ≈ sqrt(N) for datasets up to ~10M, and nlist ≈ 4 · sqrt(N) beyond that. Always sweep nprobe against a held-out recall@k target.

PQ: compress vectors aggressively

Product quantization splits each vector into m sub-vectors and replaces each with an 8-bit code from a learned codebook. A 768-d float32 vector (3 KB) collapses to ~m bytes — often a 32-64× memory reduction with surprisingly small recall loss. This is how FAISS scales to a billion vectors on a single machine.

python

d, nlist, m, nbits = 768, 4096, 96, 8
# m must divide d. 96 sub-vectors × 8 bits ⇒ 96 bytes per vector.
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)

index.train(xb)            # learns centroids AND PQ codebooks
index.add(xb)
index.nprobe = 32
D, I = index.search(xq, 10)

HNSW: graph-based, no training

IndexHNSWFlat builds a hierarchical small-world graph and walks it greedily at query time. It needs no training, gives excellent latency at moderate scale, and is what most "vector DB" products expose by default — but it keeps full float vectors in RAM, so it's memory-hungry.

python

index = faiss.IndexHNSWFlat(d, 32)     # 32 = M, neighbors per node
index.hnsw.efConstruction = 200        # build-time quality
index.add(xb)
index.hnsw.efSearch = 64               # query-time quality/latency knob
D, I = index.search(xq, 10)

GPU acceleration

FAISS has first-class CUDA support. Move any index to the GPU with one call and search throughput jumps 5-20× on large batches.

python

res = faiss.StandardGpuResources()
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)  # device 0
gpu_index.add(xb)
D, I = gpu_index.search(xq, 10)

Mapping your own IDs

By default FAISS returns row positions (0..N-1). Wrap any index with IndexIDMap to attach your own 64-bit ids — useful for joining results back to a primary database.

python

base = faiss.IndexFlatL2(d)
index = faiss.IndexIDMap(base)
ids = np.arange(1_000_000, 1_000_000 + nb).astype("int64")
index.add_with_ids(xb, ids)

D, I = index.search(xq, 5)
print(I[0])    # actual ids like [1000042, 1000917, ...]

Persist and reload

python

faiss.write_index(index, "docs.faiss")
# ... later, in another process ...
index = faiss.read_index("docs.faiss")

The file is a self-contained binary — copy it to S3, ship it with your model, or mmap it for fast cold starts.

When to use FAISS vs a hosted vector DB

Reach for FAISS when

You own the data pipeline and want max control

Reach for FAISS when

You need GPU search or extreme compression (PQ)

Reach for a DB when

You need filtering, multi-tenant, REST APIs

Reach for a DB when

You need replication, HA, and live updates

Live updates are FAISS's weakest spot

IVF/PQ indexes are trained on a sample of your data. As your distribution drifts, recall slowly degrades and you need to retrain and rebuild. If you do constant inserts and deletes, a managed DB handles the bookkeeping for you.

A practical recipe

For most RAG workloads with 100K-50M chunks, start with IndexHNSWFlat(d, 32) — no training, great recall, simple to reason about. Move to IndexIVFPQ when RAM becomes the bottleneck (typically past ~10M vectors at 768-1536 dims), and switch to GPU when query QPS matters more than memory cost.