Python APIs¶

Lance is a columnar format that is specifically designed for efficient multi-modal data processing.

Lance Dataset¶

The core of Lance is the LanceDataset class. User can open a dataset by using lance.dataset().

Opens the Lance dataset from the address specified.

Parameters:

uri : str¶

Address to the Lance dataset. It can be a local file path /tmp/data.lance, or a cloud object store URI, i.e., s3://bucket/data.lance.

version : optional, int | str¶

If specified, load a specific version of the Lance dataset. Else, loads the latest version. A version number (int) or a tag (str) can be provided.

asof : optional, datetime or str¶

If specified, find the latest version created on or earlier than the given argument value. If a version is already specified, this arg is ignored.

block_size : optional, int¶

Block size in bytes. Provide a hint for the size of the minimal I/O request.

commit_lock : optional, lance.commit.CommitLock¶

A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.

index_cache_size : optional, int¶

Index cache size. Index cache is a LRU cache with TTL. This number specifies the number of index pages, for example, IVF partitions, to be cached in the host memory. Default value is 256.

Roughly, for an IVF_PQ partition with n rows, the size of each index page equals the combination of the pq code (nd.array([n,pq], dtype=uint8)) and the row ids (nd.array([n], dtype=uint64)). Approximately, n = Total Rows / number of IVF partitions. pq = number of PQ sub-vectors.

storage_options : optional, dict¶

Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

default_scan_options : optional, dict¶

Default scan options that are used when scanning the dataset. This accepts the same arguments described in lance.LanceDataset.scanner(). The arguments will be applied to any scan operation.

This can be useful to supply defaults for common parameters such as batch_size.

It can also be used to create a view of the dataset that includes meta fields such as _rowid or _rowaddr. If default_scan_options is provided then the schema returned by lance.LanceDataset.schema() will include these fields if the appropriate scan options are set.

Basic IOs¶

The following functions are used to read and write data in Lance format.

LanceDataset.insert(data: ReaderLike, *, mode='append', **kwargs)

Insert data into the dataset.

Parameters:

data_obj : Reader-like

The data to be written. Acceptable types are: - Pandas DataFrame, Pyarrow Table, Dataset, Scanner, or RecordBatchReader - Huggingface dataset

mode : str, default 'append'¶

The mode to use when writing the data. Options are:: create - create a new dataset (raises if uri already exists). overwrite - create a new snapshot version append - create a new version that is the concat of the input the latest version (raises if uri does not exist)

**kwargs : dict, optional¶

Additional keyword arguments to pass to write_dataset().

Return a Scanner that can support various pushdowns.

Parameters:

columns : list of str, or dict of str to str default None¶

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

filter : pa.compute.Expression or str¶

Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

limit : int, default None¶

Fetch up to this many rows. All rows if None or unspecified.

offset : int, default None¶

Fetch starting with this row. 0 if None or unspecified.

nearest : dict, default None¶

Get the rows corresponding to the K most similar vectors. Example:

{
    "column": <embedding col name>,
    "q": <query vector as pa.Float32Array>,
    "k": 10,
    "nprobes": 1,
    "refine_factor": 1
}

batch_size : int, default None¶

The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

io_buffer_size : int, default None¶

The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

batch_readahead : int, optional¶

The number of batches to read ahead.

fragment_readahead : int, optional¶

The number of fragments to read ahead.

scan_in_order : bool, default True¶

Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

fragments : iterable of LanceFragment, default None¶

If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

prefilter : bool, default False¶

If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.

If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.

use_scalar_index : bool, default True¶

Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

late_materialization : bool or List[str], default None¶

Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.

Early materialization can be better when there are many results or the columns are very narrow.

If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.

full_text_query : str or dict, optional¶

query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

columns: list[str]
The columns to search, currently only supports a single column in the columns list.
query: str
The query string to search for.

fast_search : bool, default False¶

If True, then the search will only be performed on the indexed data, which yields faster search time.

scan_stats_callback : Callable[[ScanStatistics], None], default None¶

A callback function that will be called with the scan statistics after the scan is complete. Errors raised by the callback will be logged but not re-raised.

include_deleted_rows : bool, default False¶

If True, then rows that have been deleted, but are still present in the fragment, will be returned. These rows will have the _rowid column set to null. All other columns will reflect the value stored on disk and may not be null.

Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Note

For now, if BOTH filter and nearest is specified, then:

nearest is executed first.
The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}

Read the dataset as materialized record batches.

Parameters:

**kwargs : dict, optional¶: Arguments for scanner().

Returns:

record_batches

Return type:

Iterator of RecordBatch

Read the data into memory as a pyarrow.Table

Parameters:

columns : list of str, or dict of str to str default None¶

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

filter : pa.compute.Expression or str¶

Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

limit : int, default None¶

Fetch up to this many rows. All rows if None or unspecified.

offset : int, default None¶

Fetch starting with this row. 0 if None or unspecified.

nearest : dict, default None¶

Get the rows corresponding to the K most similar vectors. Example:

{
    "column": <embedding col name>,
    "q": <query vector as pa.Float32Array>,
    "k": 10,
    "metric": "cosine",
    "nprobes": 1,
    "refine_factor": 1
}

batch_size : int, optional¶

The number of rows to read at a time.

io_buffer_size : int, default None¶

The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

batch_readahead : int, optional¶

The number of batches to read ahead.

fragment_readahead : int, optional¶

The number of fragments to read ahead.

scan_in_order : bool, optional, default True¶

Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

prefilter : bool, optional, default False¶

Run filter before the vector search.

late_materialization : bool or List[str], default None¶

Allows custom control over late materialization. See ScannerBuilder.late_materialization for more information.

use_scalar_index : bool, default True¶

Allows custom control over scalar index usage. See ScannerBuilder.use_scalar_index for more information.

with_row_id : bool, optional, default False¶

Return row ID.

with_row_address : bool, optional, default False¶

Return row address

use_stats : bool, optional, default True¶

Use stats pushdown during filters.

fast_search : bool, optional, default False¶

full_text_query : str or dict, optional¶

query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents contains “hello” or “world”. or a dictionary with the following keys:

columns: list[str]
The columns to search, currently only supports a single column in the columns list.
query: str
The query string to search for.

include_deleted_rows : bool, optional, default False¶

Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Notes

If BOTH filter and nearest is specified, then:

nearest is executed first.
The results are filtered afterward, unless pre-filter sets to True.

Random Access¶

Lance stands out with its super fast random access, unlike other columnar formats.

LanceDataset.take(indices: list[int] | Array, columns: list[str] | dict[str, str] | None = None) → Table

Select rows of data by index.

Parameters:

indices : Array or array-like¶: indices of rows to select in the dataset.
columns : list of str, or dict of str to str default None¶: List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

Returns:

table

Return type:

pyarrow.Table

Select blobs by row IDs.

Instead of loading large binary blob data into memory before processing it, this API allows you to open binary blob data as a regular Python file-like object. For more details, see lance.BlobFile.

Exactly one of ids, addresses, or indices must be specified. :param blob_column: The name of the blob column to select. :type blob_column: str :param ids: row IDs to select in the dataset. :type ids: Integer Array or array-like :param addresses: The (unstable) row addresses to select in the dataset. :type addresses: Integer Array or array-like :param indices: The offset / indices of the row in the dataset. :type indices: Integer Array or array-like

Returns:: blob_files
Return type:: List[BlobFile]

Schema Evolution¶

Lance supports schema evolution, which means that you can add new columns to the dataset cheaply.

Add new columns with defined values.

There are several ways to specify the new columns. First, you can provide SQL expressions for each new column. Second you can provide a UDF that takes a batch of existing data and returns a new batch with the new columns. These new columns will be appended to the dataset.

You can also provide a RecordBatchReader which will read the new column values from some external source. This is often useful when the new column values have already been staged to files (often by some distributed process)

See the lance.add_columns_udf() decorator for more information on writing UDFs.

Parameters:

transforms : dict or AddColumnsUDF or ReaderLike¶: If this is a dictionary, then the keys are the names of the new columns and the values are SQL expression strings. These strings can reference existing columns in the dataset. If this is a AddColumnsUDF, then it is a UDF that takes a batch of existing data and returns a new batch with the new columns. If this is pyarrow.Field or pyarrow.Schema, it adds all NULL columns with the given schema, in a metadata-only operation.
read_columns : list of str, optional¶: The names of the columns that the UDF will read. If None, then the UDF will read all columns. This is only used when transforms is a UDF. Otherwise, the read columns are inferred from the SQL expressions.
reader_schema : pa.Schema, optional¶: Only valid if transforms is a ReaderLike object. This will be used to determine the schema of the reader.
batch_size : int, optional¶: The number of rows to read at a time from the source dataset when applying the transform. This is ignored if the dataset is a v1 dataset.

Examples

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3]})
>>> dataset = lance.write_dataset(table, "my_dataset")
>>> @lance.batch_udf()
... def double_a(batch):
...     df = batch.to_pandas()
...     return pd.DataFrame({'double_a': 2 * df['a']})
>>> dataset.add_columns(double_a)
>>> dataset.to_table().to_pandas()
   a  double_a
0  1         2
1  2         4
2  3         6
>>> dataset.add_columns({"triple_a": "a * 3"})
>>> dataset.to_table().to_pandas()
   a  double_a  triple_a
0  1         2         3
1  2         4         6
2  3         6         9

Indexing and Searching¶

LanceDataset.create_index(column: str | list[str], index_type: str, name: str | None = None, metric: str = 'L2', replace: bool = False, num_partitions: int | None = None, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, num_sub_vectors: int | None = None, accelerator: str | 'torch.Device' | None = None, index_cache_size: int | None = None, shuffle_partition_batches: int | None = None, shuffle_partition_concurrency: int | None = None, ivf_centroids_file: str | None = None, precomputed_partition_dataset: str | None = None, storage_options: dict[str, str] | None = None, filter_nan: bool = True, one_pass_ivfpq: bool = False, **kwargs) → LanceDataset

Create index on column.

Experimental API

Parameters:

column : str¶

The column to be indexed.

index_type : str¶

The type of the index. "IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ" are supported now.

name : str, optional¶

The index name. If not provided, it will be generated from the column name.

metric : str¶

The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.

replace : bool¶

Replace the existing index if it exists.

num_partitions : int, optional¶

The number of partitions of IVF (Inverted File Index).

ivf_centroids : optional¶

It can be either np.ndarray, pyarrow.FixedSizeListArray or pyarrow.FixedShapeTensorArray. A num_partitions x dimension array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.

pq_codebook : optional,¶

It can be np.ndarray, pyarrow.FixedSizeListArray, or pyarrow.FixedShapeTensorArray. A num_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors) array of K-mean centroids for PQ codebook.

Note: nbits is always 8 for now. If not provided, a new PQ model will be trained.

num_sub_vectors : int, optional¶

The number of sub-vectors for PQ (Product Quantization).

accelerator: str | 'torch.Device' | None = None¶

If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.

index_cache_size : int, optional¶

The size of the index cache in number of entries. Default value is 256.

shuffle_partition_batches : int, optional¶

The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.

Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

shuffle_partition_concurrency : int, optional¶

The number of shuffle partitions to process concurrently. Default value is 2

By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

storage_options : optional, dict¶

Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

filter_nan : bool¶

Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.

one_pass_ivfpq : bool¶

Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.

**kwargs¶

Parameters passed to the index building process.

The SQ (Scalar Quantization) is available for only IVF_HNSW_SQ index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is of num_bits, now only 8 bits are supported.

If index_type is “IVF_*”, then the following parameters are required:: num_partitions
If index_type is with “PQ”, then the following parameters are required:: num_sub_vectors

Optional parameters for IVF_PQ:

ivf_centroids
Existing K-mean centroids for IVF clustering.

num_bits
The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.

Optional parameters for IVF_HNSW_*:

max_level: Int, the maximum number of levels in the graph.
m: Int, the number of edges per node in the graph.
ef_construction: Int, the number of nodes to examine during the construction.

Examples

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16
)

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_HNSW_SQ",
    num_partitions=256,
)

Experimental Accelerator (GPU) support:

accelerate: use GPU to train IVF partitions.
Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16,
    accelerator="cuda"
)

References

Create a scalar index on a column.

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(filter="my_col != 7").to_table()

Vector search with pre-filers can also benefit from scalar indices. For example,

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(
    nearest=dict(
       column="vector",
       q=[1, 2, 3, 4],
       k=10,
    )
    filter="my_col != 7",
    prefilter=True
)

There are 5 types of scalar indices available today.

BTREE. The most common type is BTREE. This index is inspired by the btree data structure although only the first few layers of the btree are cached in memory. It will perform well on columns with a large number of unique values and few rows per value.
BITMAP. This index stores a bitmap for each unique value in the column. This index is useful for columns with a small number of unique values and many rows per value.
LABEL_LIST. A special index that is used to index list columns whose values have small cardinality. For example, a column that contains lists of tags (e.g. ["tag1", "tag2", "tag3"]) can be indexed with a LABEL_LIST index. This index can only speedup queries with array_has_any or array_has_all filters.
NGRAM. A special index that is used to index string columns. This index creates a bitmap for each ngram in the string. By default we use trigrams. This index can currently speed up queries using the contains function in filters.
FTS/INVERTED. It is used to index document columns. This index can conduct full-text searches. For example, a column that contains any word of query string “hello world”. The results will be ranked by BM25.

Note that the LANCE_BYPASS_SPILLING environment variable can be used to bypass spilling to disk. Setting this to true can avoid memory exhaustion issues (see https://github.com/apache/datafusion/issues/10073 for more info).

Experimental API

Parameters:

column : str¶: The column to be indexed. Must be a boolean, integer, float, or string column.
index_type : str¶: The type of the index. One of "BTREE", "BITMAP", "LABEL_LIST", "NGRAM", "FTS" or "INVERTED".
name : str, optional¶: The index name. If not provided, it will be generated from the column name.
replace : bool, default True¶: Replace the existing index if it exists.
with_position : bool, default True: This is for the INVERTED index. If True, the index will store the positions of the words in the document, so that you can conduct phrase query. This will significantly increase the index size. It won’t impact the performance of non-phrase queries even if it is set to True.
base_tokenizer : str, default "simple": This is for the INVERTED index. The base tokenizer to use. The value can be: * “simple”: splits tokens on whitespace and punctuation. * “whitespace”: splits tokens on whitespace. * “raw”: no tokenization.
language : str, default "English": This is for the INVERTED index. The language for stemming and stop words. This is only used when stem or remove_stop_words is true
max_token_length : Optional[int], default 40: This is for the INVERTED index. The maximum token length. Any token longer than this will be removed.
lower_case : bool, default True: This is for the INVERTED index. If True, the index will convert all text to lowercase.
stem : bool, default False: This is for the INVERTED index. If True, the index will stem the tokens.
remove_stop_words : bool, default False: This is for the INVERTED index. If True, the index will remove stop words.
ascii_folding : bool, default False: This is for the INVERTED index. If True, the index will convert non-ascii characters to ascii characters if possible. This would remove accents like “é” -> “e”.

Examples

import lance

dataset = lance.dataset("/tmp/images.lance")
dataset.create_index(
    "category",
    "BTREE",
)

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND’d or OR’d together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

To determine if a scan is making use of a scalar index you can use explain_plan to look at the query plan that lance has created. Queries that use scalar indices will either have a ScalarIndexQuery relation or a MaterializeIndex operator.

LanceDataset.drop_index(name: str)

Drops an index from the dataset

Note: Indices are dropped by “index name”. This is not the same as the field name. If you did not specify a name when you created the index then a name was generated for you. You can use the list_indices method to get the names of the indices.

Return a Scanner that can support various pushdowns.

Parameters:

columns : list of str, or dict of str to str default None¶

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

filter : pa.compute.Expression or str¶

Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

limit : int, default None¶

Fetch up to this many rows. All rows if None or unspecified.

offset : int, default None¶

Fetch starting with this row. 0 if None or unspecified.

nearest : dict, default None¶

Get the rows corresponding to the K most similar vectors. Example:

{
    "column": <embedding col name>,
    "q": <query vector as pa.Float32Array>,
    "k": 10,
    "nprobes": 1,
    "refine_factor": 1
}

batch_size : int, default None¶

The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

io_buffer_size : int, default None¶

The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

batch_readahead : int, optional¶

The number of batches to read ahead.

fragment_readahead : int, optional¶

The number of fragments to read ahead.

scan_in_order : bool, default True¶

Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

fragments : iterable of LanceFragment, default None¶

If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

prefilter : bool, default False¶

use_scalar_index : bool, default True¶

Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

late_materialization : bool or List[str], default None¶

Early materialization can be better when there are many results or the columns are very narrow.

If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

full_text_query : str or dict, optional¶

query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

columns: list[str]
The columns to search, currently only supports a single column in the columns list.
query: str
The query string to search for.

fast_search : bool, default False¶

If True, then the search will only be performed on the indexed data, which yields faster search time.

scan_stats_callback : Callable[[ScanStatistics], None], default None¶

A callback function that will be called with the scan statistics after the scan is complete. Errors raised by the callback will be logged but not re-raised.

include_deleted_rows : bool, default False¶

Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Note

For now, if BOTH filter and nearest is specified, then:

nearest is executed first.
The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}

API Reference¶

More information can be found in the API reference.