Star on GitHub
DocsDatabases

FAISS

FAISS (Facebook AI Similarity Search) is a library — not a server — for efficient similarity search and clustering of dense vectors. It powers the indexing layer inside many modern vector databases.

At a glance

License
MIT
Maintainer
Meta AI Research
Language
C++ with Python bindings
Hardware
CPU + CUDA GPU
Scale
Billions of vectors on a single node
Form factor
Embedded library (no server)

What FAISS actually is

FAISS is a library, not a database. There's no server, no auth, no replication, no payload storage. You hand it a NumPy array of vectors, it builds an index in memory, and you can then ask it for the nearest neighbors of any query vector. Persistence is opt-in: you call faiss.write_index and ship the resulting file yourself.

That minimalism is the point. Pinecone, Milvus, Weaviate, and others use FAISS (or ideas from it) under the hood and add the database surface on top — metadata filtering, sharding, RBAC, REST APIs. If you don't need those, FAISS is the fastest path to a working ANN index.

Install and build your first index

bash
# CPU-only build
pip install faiss-cpu

# Or, with CUDA support (matches your CUDA toolkit version)
pip install faiss-gpu
python
import numpy as np
import faiss

d = 768                # embedding dimension
nb = 100_000           # database size
xb = np.random.random((nb, d)).astype("float32")
xq = np.random.random((5, d)).astype("float32")  # 5 queries

# Flat = brute force, exact, baseline for recall
index = faiss.IndexFlatL2(d)
index.add(xb)                          # no training needed for Flat
D, I = index.search(xq, k=10)          # distances + ids, shape (5, 10)
print(I[0])                            # 10 nearest neighbors of query 0

The index zoo

FAISS exposes a family of index types. You pick one based on a three-way tradeoff between recall, latency, and memory.

IndexFlatL2 / IP
Exact, slow, 100% recall
IndexIVFFlat
Inverted file, faster, ~tunable recall
IndexIVFPQ
IVF + product quantization, tiny memory
IndexHNSWFlat
Graph-based, great latency, more RAM
IndexLSH
Binary hashing, very compact
IndexIVFScalarQuantizer
Cheap compression, decent recall

IVF: partition the space

IndexIVFFlat first runs k-means to find nlist centroids, then assigns each vector to its closest centroid's "cell". At query time, FAISS only scans the nprobe nearest cells instead of the full dataset. Bigger nprobe → higher recall but slower.

python
d, nlist = 768, 4096
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)

index.train(xb)            # learns centroids — required for IVF
index.add(xb)
index.nprobe = 16          # tune: 1 = fast/low recall, 64 = slow/high recall
D, I = index.search(xq, 10)
Rule of thumb for nlist
A common heuristic is nlist ≈ sqrt(N) for datasets up to ~10M, and nlist ≈ 4 · sqrt(N) beyond that. Always sweep nprobe against a held-out recall@k target.

PQ: compress vectors aggressively

Product quantization splits each vector into m sub-vectors and replaces each with an 8-bit code from a learned codebook. A 768-d float32 vector (3 KB) collapses to ~m bytes — often a 32-64× memory reduction with surprisingly small recall loss. This is how FAISS scales to a billion vectors on a single machine.

python
d, nlist, m, nbits = 768, 4096, 96, 8
# m must divide d. 96 sub-vectors × 8 bits ⇒ 96 bytes per vector.
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)

index.train(xb)            # learns centroids AND PQ codebooks
index.add(xb)
index.nprobe = 32
D, I = index.search(xq, 10)

HNSW: graph-based, no training

IndexHNSWFlat builds a hierarchical small-world graph and walks it greedily at query time. It needs no training, gives excellent latency at moderate scale, and is what most "vector DB" products expose by default — but it keeps full float vectors in RAM, so it's memory-hungry.

python
index = faiss.IndexHNSWFlat(d, 32)     # 32 = M, neighbors per node
index.hnsw.efConstruction = 200        # build-time quality
index.add(xb)
index.hnsw.efSearch = 64               # query-time quality/latency knob
D, I = index.search(xq, 10)

GPU acceleration

FAISS has first-class CUDA support. Move any index to the GPU with one call and search throughput jumps 5-20× on large batches.

python
res = faiss.StandardGpuResources()
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)  # device 0
gpu_index.add(xb)
D, I = gpu_index.search(xq, 10)

Mapping your own IDs

By default FAISS returns row positions (0..N-1). Wrap any index with IndexIDMap to attach your own 64-bit ids — useful for joining results back to a primary database.

python
base = faiss.IndexFlatL2(d)
index = faiss.IndexIDMap(base)
ids = np.arange(1_000_000, 1_000_000 + nb).astype("int64")
index.add_with_ids(xb, ids)

D, I = index.search(xq, 5)
print(I[0])    # actual ids like [1000042, 1000917, ...]

Persist and reload

python
faiss.write_index(index, "docs.faiss")
# ... later, in another process ...
index = faiss.read_index("docs.faiss")

The file is a self-contained binary — copy it to S3, ship it with your model, or mmap it for fast cold starts.

When to use FAISS vs a hosted vector DB

Reach for FAISS when
You own the data pipeline and want max control
Reach for FAISS when
You need GPU search or extreme compression (PQ)
Reach for a DB when
You need filtering, multi-tenant, REST APIs
Reach for a DB when
You need replication, HA, and live updates
Live updates are FAISS's weakest spot
IVF/PQ indexes are trained on a sample of your data. As your distribution drifts, recall slowly degrades and you need to retrain and rebuild. If you do constant inserts and deletes, a managed DB handles the bookkeeping for you.

A practical recipe

For most RAG workloads with 100K-50M chunks, start with IndexHNSWFlat(d, 32) — no training, great recall, simple to reason about. Move to IndexIVFPQ when RAM becomes the bottleneck (typically past ~10M vectors at 768-1536 dims), and switch to GPU when query QPS matters more than memory cost.