Star on GitHub
DocsGuides

Scaling & performance

The numbers and trade-offs you actually hit as your index grows from 1M to 1B vectors.

Memory math

A float32 vector at 1536 dimensions costs 1536 × 4 = 6,144 bytes. Add roughly 1.5× for HNSW graph overhead. So one million vectors needs ≈ 9 GB RAM, ten million ≈ 90 GB, a hundred million ≈ 900 GB.

text
vectors  | float32 HNSW | int8 HNSW | PQ-8x
---------+--------------+-----------+--------
    1 M  |     ~9 GB    |   ~2.5 GB | ~80 MB
   10 M  |    ~90 GB    |    ~25 GB |  ~800 MB
  100 M  |   ~900 GB    |   ~250 GB |  ~8 GB
    1 B  | impractical  |  ~2.5 TB  |  ~80 GB

Quantization

Scalar quantization to int8 typically loses < 1% recall and cuts memory 4×. Product quantization (PQ) compresses 30–100× with a larger but still acceptable recall hit when paired with a small re-rank pass over uncompressed candidates.

Sharding

Past ~50M vectors per node, shard horizontally by tenant or by a random hash. Query each shard in parallel and merge results — most managed services do this for you, but rolling your own with Qdrant or Milvus is straightforward.

Measure recall, not just latency
ANN parameters trade recall for speed silently. Always benchmark recall@10 against a brute-force baseline before going to production.