Python APIs

Lance is a columnar format that is specifically designed for efficient multi-modal data processing.

Lance Dataset

The core of Lance is the LanceDataset class. User can open a dataset by using lance.dataset().

lance.dataset(uri: str | Path, version: int | str | None = None, asof: ts_types | None = None, block_size: int | None = None, commit_lock: CommitLock | None = None, index_cache_size: int | None = None, storage_options: Dict[str, str] | None = None, default_scan_options: Dict[str, str] | None = None) LanceDataset

Opens the Lance dataset from the address specified.

Parameters:
  • uri (str) – Address to the Lance dataset. It can be a local file path /tmp/data.lance, or a cloud object store URI, i.e., s3://bucket/data.lance.

  • version (optional, int | str) – If specified, load a specific version of the Lance dataset. Else, loads the latest version. A version number (int) or a tag (str) can be provided.

  • asof (optional, datetime or str) – If specified, find the latest version created on or earlier than the given argument value. If a version is already specified, this arg is ignored.

  • block_size (optional, int) – Block size in bytes. Provide a hint for the size of the minimal I/O request.

  • commit_lock (optional, lance.commit.CommitLock) – A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.

  • index_cache_size (optional, int) –

    Index cache size. Index cache is a LRU cache with TTL. This number specifies the number of index pages, for example, IVF partitions, to be cached in the host memory. Default value is 256.

    Roughly, for an IVF_PQ partition with n rows, the size of each index page equals the combination of the pq code (nd.array([n,pq], dtype=uint8)) and the row ids (nd.array([n], dtype=uint64)). Approximately, n = Total Rows / number of IVF partitions. pq = number of PQ sub-vectors.

  • storage_options (optional, dict) – Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

  • default_scan_options (optional, dict) –

    Default scan options that are used when scanning the dataset. This accepts the same arguments described in lance.LanceDataset.scanner(). The arguments will be applied to any scan operation.

    This can be useful to supply defaults for common parameters such as batch_size.

    It can also be used to create a view of the dataset that includes meta fields such as _rowid or _rowaddr. If default_scan_options is provided then the schema returned by lance.LanceDataset.schema() will include these fields if the appropriate scan options are set.

Basic IOs

The following functions are used to read and write data in Lance format.

LanceDataset.insert(data: ReaderLike, *, mode='append', **kwargs)

Insert data into the dataset.

Parameters:
  • data_obj (Reader-like) – The data to be written. Acceptable types are: - Pandas DataFrame, Pyarrow Table, Dataset, Scanner, or RecordBatchReader - Huggingface dataset

  • mode (str, default 'append') –

    The mode to use when writing the data. Options are:

    create - create a new dataset (raises if uri already exists). overwrite - create a new snapshot version append - create a new version that is the concat of the input the latest version (raises if uri does not exist)

  • **kwargs (dict, optional) – Additional keyword arguments to pass to write_dataset().

LanceDataset.scanner(columns: List[str] | Dict[str, str] | None = None, filter: str | Expression | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, fragments: Iterable[LanceFragment] | None = None, full_text_query: str | dict | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, io_buffer_size: int | None = None, late_materialization: bool | List[str] | None = None, use_scalar_index: bool | None = None) LanceScanner

Return a Scanner that can support various pushdowns.

Parameters:
  • columns (list of str, or dict of str to str default None) – List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

  • filter (pa.compute.Expression or str) – Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

  • limit (int, default None) – Fetch up to this many rows. All rows if None or unspecified.

  • offset (int, default None) – Fetch starting with this row. 0 if None or unspecified.

  • nearest (dict, default None) –

    Get the rows corresponding to the K most similar vectors. Example:

    {
        "column": <embedding col name>,
        "q": <query vector as pa.Float32Array>,
        "k": 10,
        "nprobes": 1,
        "refine_factor": 1
    }
    

  • batch_size (int, default None) – The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

  • io_buffer_size (int, default None) – The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

  • batch_readahead (int, optional) – The number of batches to read ahead.

  • fragment_readahead (int, optional) – The number of fragments to read ahead.

  • scan_in_order (bool, default True) – Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

  • fragments (iterable of LanceFragment, default None) – If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

  • prefilter (bool, default False) –

    If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.

    If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.

  • use_scalar_index (bool, default True) – Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

  • late_materialization (bool or List[str], default None) –

    Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.

    Early materialization can be better when there are many results or the columns are very narrow.

    If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

    The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.

  • full_text_query (str or dict, optional) –

    query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

    • columns: list[str]

      The columns to search, currently only supports a single column in the columns list.

    • query: str

      The query string to search for.

  • fast_search (bool, default False) – If True, then the search will only be performed on the indexed data, which yields faster search time.

Notes

For now, if BOTH filter and nearest is specified, then:

  1. nearest is executed first.

  2. The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}
LanceDataset.to_batches(columns: List[str] | Dict[str, str] | None = None, filter: str | Expression | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, full_text_query: str | dict | None = None, io_buffer_size: int | None = None, late_materialization: bool | List[str] | None = None, use_scalar_index: bool | None = None, **kwargs) Iterator[RecordBatch]

Read the dataset as materialized record batches.

Parameters:

**kwargs (dict, optional) – Arguments for Scanner.from_dataset.

Returns:

record_batches

Return type:

Iterator of RecordBatch

LanceDataset.to_table(columns: List[str] | Dict[str, str] | None = None, filter: str | Expression | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, full_text_query: str | dict | None = None, io_buffer_size: int | None = None, late_materialization: bool | List[str] | None = None, use_scalar_index: bool | None = None) Table

Read the data into memory as a pyarrow.Table

Parameters:
  • columns (list of str, or dict of str to str default None) – List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

  • filter (pa.compute.Expression or str) –

    Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

  • limit (int, default None) – Fetch up to this many rows. All rows if None or unspecified.

  • offset (int, default None) – Fetch starting with this row. 0 if None or unspecified.

  • nearest (dict, default None) –

    Get the rows corresponding to the K most similar vectors. Example:

    {
        "column": <embedding col name>,
        "q": <query vector as pa.Float32Array>,
        "k": 10,
        "metric": "cosine",
        "nprobes": 1,
        "refine_factor": 1
    }
    

  • batch_size (int, optional) – The number of rows to read at a time.

  • io_buffer_size (int, default None) – The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

  • batch_readahead (int, optional) – The number of batches to read ahead.

  • fragment_readahead (int, optional) – The number of fragments to read ahead.

  • scan_in_order (bool, optional, default True) – Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

  • prefilter (bool, optional, default False) – Run filter before the vector search.

  • late_materialization (bool or List[str], default None) – Allows custom control over late materialization. See ScannerBuilder.late_materialization for more information.

  • use_scalar_index (bool, default True) – Allows custom control over scalar index usage. See ScannerBuilder.use_scalar_index for more information.

  • with_row_id (bool, optional, default False) – Return row ID.

  • with_row_address (bool, optional, default False) – Return row address

  • use_stats (bool, optional, default True) – Use stats pushdown during filters.

  • fast_search (bool, optional, default False) –

  • full_text_query (str or dict, optional) –

    query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents contains “hello” or “world”. or a dictionary with the following keys:

    • columns: list[str]

      The columns to search, currently only supports a single column in the columns list.

    • query: str

      The query string to search for.

Notes

If BOTH filter and nearest is specified, then:

  1. nearest is executed first.

  2. The results are filtered afterward, unless pre-filter sets to True.

Random Access

Lance stands out with its super fast random access, unlike other columnar formats.

LanceDataset.take(indices: List[int] | Array, columns: List[str] | Dict[str, str] | None = None, **kwargs) Table

Select rows of data by index.

Parameters:
  • indices (Array or array-like) – indices of rows to select in the dataset.

  • columns (list of str, or dict of str to str default None) – List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

  • **kwargs (dict, optional) – See :py:method::scanner method for full parameter description.

Returns:

table

Return type:

pyarrow.Table

LanceDataset.take_blobs(row_ids: List[int] | Array, blob_column: str) List[BlobFile]

Select blobs by row IDs.

Instead of loading large binary blob data into memory before processing it, this API allows you to open binary blob data as a regular Python file-like object. For more details, see lance.BlobFile.

Parameters:
  • row_ids (List Array or array-like) – row IDs to select in the dataset.

  • blob_column (str) – The name of the blob column to select.

Returns:

blob_files

Return type:

List[BlobFile]

Schema Evolution

Lance supports schema evolution, which means that you can add new columns to the dataset cheaply.

LanceDataset.add_columns(transforms: Dict[str, str] | BatchUDF | ReaderLike, read_columns: List[str] | None = None, reader_schema: pa.Schema | None = None, batch_size: int | None = None)

Add new columns with defined values.

There are several ways to specify the new columns. First, you can provide SQL expressions for each new column. Second you can provide a UDF that takes a batch of existing data and returns a new batch with the new columns. These new columns will be appended to the dataset.

You can also provide a RecordBatchReader which will read the new column values from some external source. This is often useful when the new column values have already been staged to files (often by some distributed process)

See the lance.add_columns_udf() decorator for more information on writing UDFs.

Parameters:
  • transforms (dict or AddColumnsUDF or ReaderLike) – If this is a dictionary, then the keys are the names of the new columns and the values are SQL expression strings. These strings can reference existing columns in the dataset. If this is a AddColumnsUDF, then it is a UDF that takes a batch of existing data and returns a new batch with the new columns.

  • read_columns (list of str, optional) – The names of the columns that the UDF will read. If None, then the UDF will read all columns. This is only used when transforms is a UDF. Otherwise, the read columns are inferred from the SQL expressions.

  • reader_schema (pa.Schema, optional) – Only valid if transforms is a ReaderLike object. This will be used to determine the schema of the reader.

  • batch_size (int, optional) – The number of rows to read at a time from the source dataset when applying the transform. This is ignored if the dataset is a v1 dataset.

Examples

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3]})
>>> dataset = lance.write_dataset(table, "my_dataset")
>>> @lance.batch_udf()
... def double_a(batch):
...     df = batch.to_pandas()
...     return pd.DataFrame({'double_a': 2 * df['a']})
>>> dataset.add_columns(double_a)
>>> dataset.to_table().to_pandas()
   a  double_a
0  1         2
1  2         4
2  3         6
>>> dataset.add_columns({"triple_a": "a * 3"})
>>> dataset.to_table().to_pandas()
   a  double_a  triple_a
0  1         2         3
1  2         4         6
2  3         6         9

See also

LanceDataset.merge

Merge a pre-computed set of columns into the dataset.

LanceDataset.drop_columns(columns: List[str])

Drop one or more columns from the dataset

This is a metadata-only operation and does not remove the data from the underlying storage. In order to remove the data, you must subsequently call compact_files to rewrite the data without the removed columns and then call cleanup_old_versions to remove the old files.

Parameters:

columns (list of str) – The names of the columns to drop. These can be nested column references (e.g. “a.b.c”) or top-level column names (e.g. “a”).

Examples

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})
>>> dataset = lance.write_dataset(table, "example")
>>> dataset.drop_columns(["a"])
>>> dataset.to_table().to_pandas()
   b
0  a
1  b
2  c

Indexing and Searching

LanceDataset.create_index(column: str | List[str], index_type: str, name: str | None = None, metric: str = 'L2', replace: bool = False, num_partitions: int | None = None, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, num_sub_vectors: int | None = None, accelerator: str | 'torch.Device' | None = None, index_cache_size: int | None = None, shuffle_partition_batches: int | None = None, shuffle_partition_concurrency: int | None = None, ivf_centroids_file: str | None = None, precomputed_partition_dataset: str | None = None, storage_options: Dict[str, str] | None = None, filter_nan: bool = True, one_pass_ivfpq: bool = False, **kwargs) LanceDataset

Create index on column.

Experimental API

Parameters:
  • column (str) – The column to be indexed.

  • index_type (str) – The type of the index. "IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ" are supported now.

  • name (str, optional) – The index name. If not provided, it will be generated from the column name.

  • metric (str) – The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.

  • replace (bool) – Replace the existing index if it exists.

  • num_partitions (int, optional) – The number of partitions of IVF (Inverted File Index).

  • ivf_centroids (optional) – It can be either np.ndarray, pyarrow.FixedSizeListArray or pyarrow.FixedShapeTensorArray. A num_partitions x dimension array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.

  • pq_codebook (optional,) –

    It can be np.ndarray, pyarrow.FixedSizeListArray, or pyarrow.FixedShapeTensorArray. A num_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors) array of K-mean centroids for PQ codebook.

    Note: nbits is always 8 for now. If not provided, a new PQ model will be trained.

  • num_sub_vectors (int, optional) – The number of sub-vectors for PQ (Product Quantization).

  • accelerator (str or torch.Device, optional) – If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.

  • index_cache_size (int, optional) – The size of the index cache in number of entries. Default value is 256.

  • shuffle_partition_batches (int, optional) –

    The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.

    Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

  • shuffle_partition_concurrency (int, optional) –

    The number of shuffle partitions to process concurrently. Default value is 2

    By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

  • storage_options (optional, dict) – Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

  • filter_nan (bool) – Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.

  • one_pass_ivfpq (bool) – Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.

  • kwargs – Parameters passed to the index building process.

The SQ (Scalar Quantization) is available for only IVF_HNSW_SQ index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is of num_bits, now only 8 bits are supported.

If index_type is “IVF_*”, then the following parameters are required:

num_partitions

If index_type is with “PQ”, then the following parameters are required:

num_sub_vectors

Optional parameters for IVF_PQ:

  • ivf_centroids

    Existing K-mean centroids for IVF clustering.

  • num_bits

    The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.

Optional parameters for IVF_HNSW_*:
max_level

Int, the maximum number of levels in the graph.

m

Int, the number of edges per node in the graph.

ef_construction

Int, the number of nodes to examine during the construction.

Examples

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16
)
import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_HNSW_SQ",
    num_partitions=256,
)

Experimental Accelerator (GPU) support:

  • accelerate: use GPU to train IVF partitions.

    Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16,
    accelerator="cuda"
)

References

LanceDataset.scanner(columns: List[str] | Dict[str, str] | None = None, filter: str | Expression | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, fragments: Iterable[LanceFragment] | None = None, full_text_query: str | dict | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, io_buffer_size: int | None = None, late_materialization: bool | List[str] | None = None, use_scalar_index: bool | None = None) LanceScanner

Return a Scanner that can support various pushdowns.

Parameters:
  • columns (list of str, or dict of str to str default None) – List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

  • filter (pa.compute.Expression or str) –

    Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

  • limit (int, default None) – Fetch up to this many rows. All rows if None or unspecified.

  • offset (int, default None) – Fetch starting with this row. 0 if None or unspecified.

  • nearest (dict, default None) –

    Get the rows corresponding to the K most similar vectors. Example:

    {
        "column": <embedding col name>,
        "q": <query vector as pa.Float32Array>,
        "k": 10,
        "nprobes": 1,
        "refine_factor": 1
    }
    

  • batch_size (int, default None) – The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

  • io_buffer_size (int, default None) – The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

  • batch_readahead (int, optional) – The number of batches to read ahead.

  • fragment_readahead (int, optional) – The number of fragments to read ahead.

  • scan_in_order (bool, default True) – Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

  • fragments (iterable of LanceFragment, default None) – If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

  • prefilter (bool, default False) –

    If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.

    If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.

  • use_scalar_index (bool, default True) – Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

  • late_materialization (bool or List[str], default None) –

    Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.

    Early materialization can be better when there are many results or the columns are very narrow.

    If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

    The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.

  • full_text_query (str or dict, optional) –

    query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

    • columns: list[str]

      The columns to search, currently only supports a single column in the columns list.

    • query: str

      The query string to search for.

  • fast_search (bool, default False) – If True, then the search will only be performed on the indexed data, which yields faster search time.

Notes

For now, if BOTH filter and nearest is specified, then:

  1. nearest is executed first.

  2. The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}

API Reference

More information can be found in the API reference.