FAISS
FAISS (Facebook AI Similarity Search) is a library — not a server — for efficient similarity search and clustering of dense vectors. It powers the indexing layer inside many modern vector databases.
At a glance
What FAISS actually is
FAISS is a library, not a database. There's no server, no auth, no replication, no payload storage. You hand it a NumPy array of vectors, it builds an index in memory, and you can then ask it for the nearest neighbors of any query vector. Persistence is opt-in: you call faiss.write_index and ship the resulting file yourself.
That minimalism is the point. Pinecone, Milvus, Weaviate, and others use FAISS (or ideas from it) under the hood and add the database surface on top — metadata filtering, sharding, RBAC, REST APIs. If you don't need those, FAISS is the fastest path to a working ANN index.
Install and build your first index
# CPU-only build
pip install faiss-cpu
# Or, with CUDA support (matches your CUDA toolkit version)
pip install faiss-gpuimport numpy as np
import faiss
d = 768 # embedding dimension
nb = 100_000 # database size
xb = np.random.random((nb, d)).astype("float32")
xq = np.random.random((5, d)).astype("float32") # 5 queries
# Flat = brute force, exact, baseline for recall
index = faiss.IndexFlatL2(d)
index.add(xb) # no training needed for Flat
D, I = index.search(xq, k=10) # distances + ids, shape (5, 10)
print(I[0]) # 10 nearest neighbors of query 0The index zoo
FAISS exposes a family of index types. You pick one based on a three-way tradeoff between recall, latency, and memory.
IVF: partition the space
IndexIVFFlat first runs k-means to find nlist centroids, then assigns each vector to its closest centroid's "cell". At query time, FAISS only scans the nprobe nearest cells instead of the full dataset. Bigger nprobe → higher recall but slower.
d, nlist = 768, 4096
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
index.train(xb) # learns centroids — required for IVF
index.add(xb)
index.nprobe = 16 # tune: 1 = fast/low recall, 64 = slow/high recall
D, I = index.search(xq, 10)nlist ≈ sqrt(N) for datasets up to ~10M, and nlist ≈ 4 · sqrt(N) beyond that. Always sweep nprobe against a held-out recall@k target.PQ: compress vectors aggressively
Product quantization splits each vector into m sub-vectors and replaces each with an 8-bit code from a learned codebook. A 768-d float32 vector (3 KB) collapses to ~m bytes — often a 32-64× memory reduction with surprisingly small recall loss. This is how FAISS scales to a billion vectors on a single machine.
d, nlist, m, nbits = 768, 4096, 96, 8
# m must divide d. 96 sub-vectors × 8 bits ⇒ 96 bytes per vector.
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
index.train(xb) # learns centroids AND PQ codebooks
index.add(xb)
index.nprobe = 32
D, I = index.search(xq, 10)HNSW: graph-based, no training
IndexHNSWFlat builds a hierarchical small-world graph and walks it greedily at query time. It needs no training, gives excellent latency at moderate scale, and is what most "vector DB" products expose by default — but it keeps full float vectors in RAM, so it's memory-hungry.
index = faiss.IndexHNSWFlat(d, 32) # 32 = M, neighbors per node
index.hnsw.efConstruction = 200 # build-time quality
index.add(xb)
index.hnsw.efSearch = 64 # query-time quality/latency knob
D, I = index.search(xq, 10)GPU acceleration
FAISS has first-class CUDA support. Move any index to the GPU with one call and search throughput jumps 5-20× on large batches.
res = faiss.StandardGpuResources()
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index) # device 0
gpu_index.add(xb)
D, I = gpu_index.search(xq, 10)Mapping your own IDs
By default FAISS returns row positions (0..N-1). Wrap any index with IndexIDMap to attach your own 64-bit ids — useful for joining results back to a primary database.
base = faiss.IndexFlatL2(d)
index = faiss.IndexIDMap(base)
ids = np.arange(1_000_000, 1_000_000 + nb).astype("int64")
index.add_with_ids(xb, ids)
D, I = index.search(xq, 5)
print(I[0]) # actual ids like [1000042, 1000917, ...]Persist and reload
faiss.write_index(index, "docs.faiss")
# ... later, in another process ...
index = faiss.read_index("docs.faiss")The file is a self-contained binary — copy it to S3, ship it with your model, or mmap it for fast cold starts.
When to use FAISS vs a hosted vector DB
A practical recipe
For most RAG workloads with 100K-50M chunks, start with IndexHNSWFlat(d, 32) — no training, great recall, simple to reason about. Move to IndexIVFPQ when RAM becomes the bottleneck (typically past ~10M vectors at 768-1536 dims), and switch to GPU when query QPS matters more than memory cost.