Vector Indexing and Vector Search With Lance¶
Lance provides high-performance vector search capabilities with ANN (Approximate Nearest Neighbor) indexes.
By the end of this tutorial, you'll be able to build and use ANN indexes to dramatically speed up vector search operations while maintaining high accuracy. You'll also learn how to tune search parameters for optimal performance and combine vector search with metadata queries in a single operation.
Install the Python SDK¶
Set Up Your Environment¶
First, import the necessary libraries:
import shutil
import lance
import numpy as np
import pandas as pd
import pyarrow as pa
import duckdb
Prepare Your Vector Embeddings¶
For this tutorial, download and prepare the SIFT 1M dataset for vector search experiments.
- Download
ANN_SIFT1M
from: http://corpus-texmex.irisa.fr/ - Direct link:
ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
You can just use wget
:
rm -rf sift* vec_data.lance
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
Convert Your Data to Lance Format¶
Then, convert the raw vector data into Lance format for efficient storage and querying.
from lance.vector import vec_to_table
import struct
uri = "vec_data.lance"
with open("sift/sift_base.fvecs", mode="rb") as fobj:
buf = fobj.read()
data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
dd = dict(zip(range(1000000), data))
table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
Now you can load the dataset:
Search Without an Index¶
You'll perform vector search without an index to see the baseline performance, then compare it with indexed search.
First, let's sample some query vectors:
import duckdb
# Make sure DuckDB v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
0 [29.0, 10.0, 1.0, 50.0, 7.0, 89.0, 95.0, 51.0,...
1 [7.0, 5.0, 39.0, 49.0, 17.0, 12.0, 83.0, 117.0...
2 [0.0, 0.0, 0.0, 10.0, 12.0, 31.0, 6.0, 0.0, 0....
3 [0.0, 2.0, 9.0, 1.793662034335766e-43, 30.0, 1...
4 [54.0, 112.0, 16.0, 0.0, 0.0, 7.0, 112.0, 44.0...
...
95 [1.793662034335766e-43, 33.0, 47.0, 28.0, 0.0,...
96 [1.0, 4.0, 2.0, 32.0, 3.0, 7.0, 119.0, 116.0, ...
97 [17.0, 46.0, 12.0, 0.0, 0.0, 3.0, 23.0, 58.0, ...
98 [0.0, 11.0, 30.0, 14.0, 34.0, 7.0, 0.0, 0.0, 1...
99 [20.0, 8.0, 121.0, 98.0, 37.0, 77.0, 9.0, 18.0...
Name: vector, Length: 100, dtype: object
Now, perform nearest neighbor search without an index:
import time
start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()
print(f"Time(sec): {end-start}")
print(tbl.to_pandas())
Expected output:
Time(sec): 0.10735273361206055
id vector score
0 144678 [29.0, 10.0, 1.0, 50.0, 7.0, 89.0, 95.0, 51.0,... 0.0
1 575538 [2.0, 0.0, 1.0, 42.0, 3.0, 38.0, 152.0, 27.0, ... 76908.0
2 241428 [11.0, 0.0, 2.0, 118.0, 11.0, 108.0, 116.0, 21... 92877.0
...
Without the index, the search will scan throughout the whole dataset to compute the distance between each data point. For practical real-time performance with, you will get much better performance with an ANN index.
Build the Search Index¶
If you build an ANN index - you can dramatically speed up vector search operations while maintaining high accuracy. In this example, we will build the IVF_PQ
index:
sift1m.create_index(
"vector",
index_type="IVF_PQ", # specify the IVF_PQ index type
num_partitions=256, # IVF
num_sub_vectors=16, # PQ
)
The sample response should look like this:
Building vector index: IVF256,PQ16
CPU times: user 2min 23s, sys: 2.77 s, total: 2min 26s
Wall time: 22.7 s
Sample 65536 out of 1000000 to train kmeans of 128 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Sample 65536 out of 1000000 to train kmeans of 8 dim, 256 clusters
Index Creation Performance
If you're trying this on your own data, make sure your vector (dimensions / num_sub_vectors) % 8 == 0, or else index creation will take much longer than expected due to SIMD misalignment.
Vector Search with the ANN Index¶
You can now perform the same search operation using your newly created index and see the dramatic performance improvement.
sift1m = lance.dataset(uri)
import time
tot = 0
for q in samples:
start = time.time()
tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
end = time.time()
tot += (end - start)
print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())
Expected output:
Avg(sec): 0.0009334301948547364
id vector score
0 378825 [20.0, 8.0, 121.0, 98.0, 37.0, 77.0, 9.0, 18.0... 16560.197266
1 143787 [11.0, 24.0, 122.0, 122.0, 53.0, 4.0, 0.0, 3.0... 61714.941406
2 356895 [0.0, 14.0, 67.0, 122.0, 83.0, 23.0, 1.0, 0.0,... 64147.218750
3 535431 [9.0, 22.0, 118.0, 118.0, 4.0, 5.0, 4.0, 4.0, ... 69092.593750
4 308778 [1.0, 7.0, 48.0, 123.0, 73.0, 36.0, 8.0, 4.0, ... 69131.812500
5 222477 [14.0, 73.0, 39.0, 4.0, 16.0, 94.0, 19.0, 8.0,... 69244.195312
6 672558 [2.0, 1.0, 0.0, 11.0, 36.0, 23.0, 7.0, 10.0, 0... 70264.828125
7 365538 [54.0, 43.0, 97.0, 59.0, 34.0, 17.0, 10.0, 15.... 70273.710938
8 659787 [10.0, 9.0, 23.0, 121.0, 38.0, 26.0, 38.0, 9.0... 70374.703125
9 603930 [32.0, 32.0, 122.0, 122.0, 70.0, 4.0, 15.0, 12... 70583.375000
Performance Note
Your actual numbers will vary by your storage. These numbers are from local disk on an M2 MacBook Air. If you're querying S3 directly, HDD, or network drives, performance will be slower.
Tune the Search Parameters¶
You need to adjust search parameters to balance between speed and accuracy, finding the optimal settings for your use case.
The latency vs recall is tunable via: - nprobes: how many IVF partitions to search - refine_factor: determines how many vectors are retrieved during re-ranking
%%time
sift1m.to_table(
nearest={
"column": "vector",
"q": samples[0],
"k": 10,
"nprobes": 10,
"refine_factor": 5,
}
).to_pandas()
Parameter Explanation:
- q
=> sample vector
- k
=> how many neighbors to return
- nprobes
=> how many partitions (in the coarse quantizer) to probe
- refine_factor
=> controls "re-ranking". If k=10 and refine_factor=5 then retrieve 50 nearest neighbors by ANN and re-sort using actual distances then return top 10. This improves recall without sacrificing performance too much
Memory Usage
The latencies above include file I/O as Lance currently doesn't hold anything in memory. Along with index building speed, creating a purely in-memory version of the dataset would make the biggest impact on performance.
Combine Features and Vectors¶
You can add metadata columns to your vector dataset and query both vectors and features together in a single operation.
In real-life situations, users have other feature or metadata columns that need to be stored and fetched together. If you're managing data and the index separately, you have to do a bunch of annoying plumbing to put stuff together.
With Lance it's a single call:
tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
You can then query both vectors and metadata together:
# Get vectors and metadata together
result = sift1m.to_table(
columns=["item_id", "revenue"],
nearest={"column": "vector", "q": samples[0], "k": 10}
)
print(result.to_pandas())
Next Steps¶
You should check out Versioning Your Datasets with Lance. We'll show you how to version your vector datasets and track changes over time.