# Approximate Nearest Neighbor (ANN) Indexes

An ANN or a vector index is a data structure specifically designed to efficiently organize and search vector data based on their similarity via the chosen distance metric. By constructing a vector index, the search space is effectively narrowed down, avoiding the need for brute-force scanning of the entire vector space. A vector index is faster but less accurate than exhaustive search (kNN or flat search). LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.

Currently, LanceDB does *not* automatically create the ANN index.
LanceDB has optimized code for kNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.

In the future we will look to automatically create and configure the ANN index as data comes in.

## Types of Index

Lance can support multiple index types, the most widely used one is `IVF_PQ`

.

`IVF_PQ`

: use**Inverted File Index (IVF)**to first divide the dataset into`N`

partitions, and then use**Product Quantization**to compress vectors in each partition.`DiskANN`

(**Experimental**): organize the vector as a on-disk graph, where the vertices approximately represent the nearest neighbors of each vector.

## Creating an IVF_PQ Index

Lance supports `IVF_PQ`

index type by default.

Creating indexes is done via the create_index method.

```
import lancedb
import numpy as np
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
# Create 10,000 sample vectors
data = [{"vector": row, "item": f"item {i}"}
for i, row in enumerate(np.random.random((10_000, 1536)).astype('float32'))]
# Add the vectors to a table
tbl = db.create_table("my_vectors", data=data)
# Create and train the index - you need to have enough data in the table for an effective training step
tbl.create_index(num_partitions=256, num_sub_vectors=96)
```

```
import * as vectordb from "vectordb";
const db = await vectordb.connect("data/sample-lancedb");
let data = [];
for (let i = 0; i < 10_000; i++) {
data.push({
vector: Array(1536).fill(i),
id: `${i}`,
content: "",
longId: `${i}`,
});
}
const table = await db.createTable("my_vectors", data);
await table.createIndex({
type: "ivf_pq",
column: "vector",
num_partitions: 16,
num_sub_vectors: 48,
});
```

**metric**(default: "L2"): The distance metric to use. By default it uses euclidean distance "`L2`

". We also support "cosine" and "dot" distance as well.**num_partitions**(default: 256): The number of partitions of the index.**num_sub_vectors**(default: 96): The number of sub-vectors (M) that will be created during Product Quantization (PQ). For D dimensional vector, it will be divided into`M`

of`D/M`

sub-vectors, each of which is presented by a single PQ code.

### Use GPU to build vector index

Lance Python SDK has experimental GPU support for creating IVF index. Using GPU for index creation requires PyTorch>2.0 being installed.

You can specify the GPU device to train IVF partitions via

**accelerator**: Specify to`cuda`

or`mps`

(on Apple Silicon) to enable GPU training.

Trouble shootings:

If you see `AssertionError: Torch not compiled with CUDA enabled`

, you need to install
PyTorch with CUDA support.

## Querying an ANN Index

Querying vector indexes is done via the search function.

There are a couple of parameters that can be used to fine-tune the search:

**limit**(default: 10): The amount of results that will be returned**nprobes**(default: 20): The number of probes used. A higher number makes search more accurate but also slower.

Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.

e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.

Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.**refine_factor**(default: None): Refine the results by reading extra elements and re-ranking them in memory.

A higher number makes search more accurate but also slower. If you find the recall is less than ideal, try refine_factor=10 to start.

e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.

Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.

The search will return the data requested in addition to the distance of each item.

### Filtering (where clause)

You can further filter the elements returned by a search using a where clause.

### Projections (select clause)

You can select the columns returned by the query using a select clause.

## FAQ

### When is it necessary to create an ANN vector index?

`LanceDB`

has manually-tuned SIMD code for computing vector distances.
In our benchmarks, computing 100K pairs of 1K dimension vectors takes **less than 20ms**.
For small datasets (< 100K rows) or applications that can accept 100ms latency, vector indices are usually not necessary.

For large-scale or higher dimension vectors, it is beneficial to create vector index.

### How big is my index, and how many memory will it take?

In LanceDB, all vector indices are **disk-based**, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code.

For example, with a 1024-dimension dataset, if we choose `num_sub_vectors=64`

, each sub-vector has `1024 / 64 = 16`

float32 numbers.
Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64`

times of space reduction.

### How to choose `num_partitions`

and `num_sub_vectors`

for `IVF_PQ`

index?

`num_partitions`

is used to decide how many partitions the first level `IVF`

index uses.
Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train.
On `SIFT-1M`

dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall.

`num_sub_vectors`

specifies how many Product Quantization (PQ) short codes to generate on each vector. Because
PQ is a lossy compression of the original vector, a higher `num_sub_vectors`

usually results in
less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors`

also causes heavier I/O and
more PQ computation, and thus, higher latency. `dimension / num_sub_vectors`

should be a multiple of 8 for optimum SIMD efficiency.