Skip to content

Vector Search

A vector search finds the approximate or exact nearest neighbors to a given query vector.

  • In a recommendation system or search engine, you can find similar records to the one you searched.
  • In LLM and other AI applications, each data point can be represented by embeddings generated from existing models, following which the search returns the most relevant features.

Distance metrics

Distance metrics are a measure of the similarity between a pair of vectors. Currently, LanceDB supports the following metrics:

Metric Description
l2 Euclidean / L2 distance
cosine Cosine Similarity
dot Dot Production
hamming Hamming Distance

Note

The hamming metric is only available for binary vectors.

Exhaustive search (kNN)

If you do not create a vector index, LanceDB exhaustively scans the entire vector space and computes the distance to every vector in order to find the exact nearest neighbors. This is effectively a kNN search.

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
data = [
    {"vector": row, "item": f"item {i}"}
    for i, row in enumerate(np.random.random((10_000, 1536)).astype("float32"))
]
tbl = db.create_table("vector_search", data=data)
tbl.search(np.random.random((1536))).limit(10).to_list()
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)
data = [
    {"vector": row, "item": f"item {i}"}
    for i, row in enumerate(np.random.random((10_000, 1536)).astype("float32"))
]
async_tbl = await async_db.create_table("vector_search_async", data=data)
(await async_tbl.query().nearest_to(np.random.random((1536))).limit(10).to_list())
import * as lancedb from "@lancedb/lancedb";

const db = await lancedb.connect(databaseDir);
const tbl = await db.openTable("my_vectors");

const results1 = await tbl.search(Array(128).fill(1.2)).limit(10).toArray();
import * as lancedb from "vectordb";

const db = await lancedb.connect("data/sample-lancedb");
const tbl = await db.openTable("my_vectors");

const results_1 = await tbl.search(Array(1536).fill(1.2)).limit(10).execute();

By default, l2 will be used as metric type. You can specify the metric type as cosine or dot if required.

tbl.search(np.random.random((1536))).distance_type("cosine").limit(10).to_list()
(
    await async_tbl.query()
    .nearest_to(np.random.random((1536)))
    .distance_type("cosine")
    .limit(10)
    .to_list()
)
const results2 = await (
  tbl.search(Array(128).fill(1.2)) as lancedb.VectorQuery
)
  .distanceType("cosine")
  .limit(10)
  .toArray();
const results_2 = await tbl
  .search(Array(1536).fill(1.2))
  .metricType(lancedb.MetricType.Cosine)
  .limit(10)
  .execute();

To perform scalable vector retrieval with acceptable latencies, it's common to build a vector index. While the exhaustive search is guaranteed to always return 100% recall, the approximate nature of an ANN search means that using an index often involves a trade-off between recall and latency.

See the IVF_PQ index for a deeper description of how IVF_PQ indexes work in LanceDB.

Binary vector

LanceDB supports binary vectors as a data type, and has the ability to search binary vectors with hamming distance. The binary vectors are stored as uint8 arrays (every 8 bits are stored as a byte):

Note

The dim of the binary vector must be a multiple of 8. A vector of dim 128 will be stored as a uint8 array of size 16.

import lancedb
import numpy as np
import pyarrow as pa
import pytest

db = lancedb.connect("data/binary_lancedb")
schema = pa.schema(
    [
        pa.field("id", pa.int64()),
        # for dim=256, lance stores every 8 bits in a byte
        # so the vector field should be a list of 256 / 8 = 32 bytes
        pa.field("vector", pa.list_(pa.uint8(), 32)),
    ]
)
tbl = db.create_table("my_binary_vectors", schema=schema)

data = []
for i in range(1024):
    vector = np.random.randint(0, 2, size=256)
    # pack the binary vector into bytes to save space
    packed_vector = np.packbits(vector)
    data.append(
        {
            "id": i,
            "vector": packed_vector,
        }
    )
tbl.add(data)

query = np.random.randint(0, 2, size=256)
packed_query = np.packbits(query)
tbl.search(packed_query).distance_type("hamming").to_arrow()
import lancedb
import numpy as np
import pyarrow as pa
import pytest

db = await lancedb.connect_async("data/binary_lancedb")
schema = pa.schema(
    [
        pa.field("id", pa.int64()),
        # for dim=256, lance stores every 8 bits in a byte
        # so the vector field should be a list of 256 / 8 = 32 bytes
        pa.field("vector", pa.list_(pa.uint8(), 32)),
    ]
)
tbl = await db.create_table("my_binary_vectors", schema=schema)

data = []
for i in range(1024):
    vector = np.random.randint(0, 2, size=256)
    # pack the binary vector into bytes to save space
    packed_vector = np.packbits(vector)
    data.append(
        {
            "id": i,
            "vector": packed_vector,
        }
    )
await tbl.add(data)

query = np.random.randint(0, 2, size=256)
packed_query = np.packbits(query)
await tbl.query().nearest_to(packed_query).distance_type("hamming").to_arrow()

Multivector type

LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali).

You can index on a column with multivector type and search on it, the query can be single vector or multiple vectors. If the query is multiple vectors mq, the similarity (distance) from it to any multivector mv in the dataset, is defined as:

maxsim

where sim is the similarity function (e.g. cosine).

For now, only cosine metric is supported for multivector search. The vector value type can be float16, float32 or float64.

import lancedb
import numpy as np
import pyarrow as pa

db = lancedb.connect("data/multivector_demo")
schema = pa.schema(
    [
        pa.field("id", pa.int64()),
        # float16, float32, and float64 are supported
        pa.field("vector", pa.list_(pa.list_(pa.float32(), 256))),
    ]
)
data = [
    {
        "id": i,
        "vector": np.random.random(size=(2, 256)).tolist(),
    }
    for i in range(1024)
]
tbl = db.create_table("my_table", data=data, schema=schema)

# only cosine similarity is supported for multi-vectors
tbl.create_index(metric="cosine")

# query with single vector
query = np.random.random(256).astype(np.float16)
tbl.search(query).to_arrow()

# query with multiple vectors
query = np.random.random(size=(2, 256))
tbl.search(query).to_arrow()
import lancedb
import numpy as np
import pyarrow as pa

db = await lancedb.connect_async("data/multivector_demo")
schema = pa.schema(
    [
        pa.field("id", pa.int64()),
        # float16, float32, and float64 are supported
        pa.field("vector", pa.list_(pa.list_(pa.float32(), 256))),
    ]
)
data = [
    {
        "id": i,
        "vector": np.random.random(size=(2, 256)).tolist(),
    }
    for i in range(1024)
]
tbl = await db.create_table("my_table", data=data, schema=schema)

# only cosine similarity is supported for multi-vectors
await tbl.create_index(column="vector", config=IvfPq(distance_type="cosine"))

# query with single vector
query = np.random.random(256)
await tbl.query().nearest_to(query).to_arrow()

# query with multiple vectors
query = np.random.random(size=(2, 256))
await tbl.query().nearest_to(query).to_arrow()

Search with distance range

You can also search for vectors within a specific distance range from the query vector. This is useful when you want to find vectors that are not just the nearest neighbors, but also those that are within a certain distance. This can be done by using the distance_range method.

import lancedb
import numpy as np

db = lancedb.connect("data/distance_range_demo")
data = [
    {
        "id": i,
        "vector": np.random.random(256),
    }
    for i in range(1024)
]
tbl = db.create_table("my_table", data=data)
query = np.random.random(256)

# Search for the vectors within the range of [0.1, 0.5)
tbl.search(query).distance_range(0.1, 0.5).to_arrow()

# Search for the vectors with the distance less than 0.5
tbl.search(query).distance_range(upper_bound=0.5).to_arrow()

# Search for the vectors with the distance greater or equal to 0.1
tbl.search(query).distance_range(lower_bound=0.1).to_arrow()
import lancedb
import numpy as np

db = await lancedb.connect_async("data/distance_range_demo")
data = [
    {
        "id": i,
        "vector": np.random.random(256),
    }
    for i in range(1024)
]
tbl = await db.create_table("my_table", data=data)
query = np.random.random(256)

# Search for the vectors within the range of [0.1, 0.5)
await tbl.query().nearest_to(query).distance_range(0.1, 0.5).to_arrow()

# Search for the vectors with the distance less than 0.5
await tbl.query().nearest_to(query).distance_range(upper_bound=0.5).to_arrow()

# Search for the vectors with the distance greater or equal to 0.1
await tbl.query().nearest_to(query).distance_range(lower_bound=0.1).to_arrow()
import * as lancedb from "@lancedb/lancedb";

const results3 = await (
  tbl.search(Array(128).fill(1.2)) as lancedb.VectorQuery
)
  .distanceType("cosine")
  .distanceRange(0.1, 0.2)
  .limit(10)
  .toArray();

Output search results

LanceDB returns vector search results via different formats commonly used in python. Let's create a LanceDB table with a nested schema:

from datetime import datetime

import lancedb

from lancedb.pydantic import Vector, LanceModel

import numpy as np

from pydantic import BaseModel

class Metadata(BaseModel):
    source: str
    timestamp: datetime


class Document(BaseModel):
    content: str
    meta: Metadata


class LanceSchema(LanceModel):
    id: str
    vector: Vector(1536)
    payload: Document


# Let's add 100 sample rows to our dataset
data = [
    LanceSchema(
        id=f"id{i}",
        vector=np.random.randn(1536),
        payload=Document(
            content=f"document{i}",
            meta=Metadata(source=f"source{i % 10}", timestamp=datetime.now()),
        ),
    )
    for i in range(100)
]

# Synchronous client
tbl = db.create_table("documents", data=data)
from datetime import datetime

import lancedb

from lancedb.pydantic import Vector, LanceModel

import numpy as np

from pydantic import BaseModel

class Metadata(BaseModel):
    source: str
    timestamp: datetime


class Document(BaseModel):
    content: str
    meta: Metadata


class LanceSchema(LanceModel):
    id: str
    vector: Vector(1536)
    payload: Document


# Let's add 100 sample rows to our dataset
data = [
    LanceSchema(
        id=f"id{i}",
        vector=np.random.randn(1536),
        payload=Document(
            content=f"document{i}",
            meta=Metadata(source=f"source{i % 10}", timestamp=datetime.now()),
        ),
    )
    for i in range(100)
]

async_tbl = await async_db.create_table("documents_async", data=data)

As a PyArrow table

Using to_arrow() we can get the results back as a pyarrow Table. This result table has the same columns as the LanceDB table, with the addition of an _distance column for vector search or a score column for full text search.

tbl.search(np.random.randn(1536)).to_arrow()
await async_tbl.query().nearest_to(np.random.randn(1536)).to_arrow()

As a Pandas DataFrame

You can also get the results as a pandas dataframe.

tbl.search(np.random.randn(1536)).to_pandas()
await async_tbl.query().nearest_to(np.random.randn(1536)).to_pandas()

While other formats like Arrow/Pydantic/Python dicts have a natural way to handle nested schemas, pandas can only store nested data as a python dict column, which makes it difficult to support nested references. So for convenience, you can also tell LanceDB to flatten a nested schema when creating the pandas dataframe.

tbl.search(np.random.randn(1536)).to_pandas(flatten=True)

If your table has a deeply nested struct, you can control how many levels of nesting to flatten by passing in a positive integer.

tbl.search(np.random.randn(1536)).to_pandas(flatten=1)

Note

flatten is not yet supported with our asynchronous client.

As a list of Python dicts

You can of course return results as a list of python dicts.

tbl.search(np.random.randn(1536)).to_list()
await async_tbl.query().nearest_to(np.random.randn(1536)).to_list()

As a list of Pydantic models

We can add data using Pydantic models, and we can certainly retrieve results as Pydantic models

tbl.search(np.random.randn(1536)).to_pydantic(LanceSchema)

Note

to_pydantic() is not yet supported with our asynchronous client.

Note that in this case the extra _distance field is discarded since it's not part of the LanceSchema.