Python APIs

Lance is a columnar format that is specifically designed for efficient multi-modal data processing.

Lance Dataset

The core of Lance is the LanceDataset class. User can open a dataset by using lance.dataset().

lance.dataset(uri: str | Path, version: int | str | None = None, asof: ts_types | None = None, block_size: int | None = None, commit_lock: CommitLock | None = None, index_cache_size: int | None = None, storage_options: dict[str, str] | None = None, default_scan_options: dict[str, str] | None = None) LanceDataset

Opens the Lance dataset from the address specified.

Parameters:
uri : str

Address to the Lance dataset. It can be a local file path /tmp/data.lance, or a cloud object store URI, i.e., s3://bucket/data.lance.

version : optional, int | str

If specified, load a specific version of the Lance dataset. Else, loads the latest version. A version number (int) or a tag (str) can be provided.

asof : optional, datetime or str

If specified, find the latest version created on or earlier than the given argument value. If a version is already specified, this arg is ignored.

block_size : optional, int

Block size in bytes. Provide a hint for the size of the minimal I/O request.

commit_lock : optional, lance.commit.CommitLock

A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.

index_cache_size : optional, int

Index cache size. Index cache is a LRU cache with TTL. This number specifies the number of index pages, for example, IVF partitions, to be cached in the host memory. Default value is 256.

Roughly, for an IVF_PQ partition with n rows, the size of each index page equals the combination of the pq code (nd.array([n,pq], dtype=uint8)) and the row ids (nd.array([n], dtype=uint64)). Approximately, n = Total Rows / number of IVF partitions. pq = number of PQ sub-vectors.

storage_options : optional, dict

Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

default_scan_options : optional, dict

Default scan options that are used when scanning the dataset. This accepts the same arguments described in lance.LanceDataset.scanner(). The arguments will be applied to any scan operation.

This can be useful to supply defaults for common parameters such as batch_size.

It can also be used to create a view of the dataset that includes meta fields such as _rowid or _rowaddr. If default_scan_options is provided then the schema returned by lance.LanceDataset.schema() will include these fields if the appropriate scan options are set.

Basic IOs

The following functions are used to read and write data in Lance format.

LanceDataset.insert(data: ReaderLike, *, mode='append', **kwargs)

Insert data into the dataset.

Parameters:
data_obj : Reader-like

The data to be written. Acceptable types are: - Pandas DataFrame, Pyarrow Table, Dataset, Scanner, or RecordBatchReader - Huggingface dataset

mode : str, default 'append'

The mode to use when writing the data. Options are:

create - create a new dataset (raises if uri already exists). overwrite - create a new snapshot version append - create a new version that is the concat of the input the latest version (raises if uri does not exist)

**kwargs : dict, optional

Additional keyword arguments to pass to write_dataset().

LanceDataset.scanner(columns: list[str] | dict[str, str] | None = None, filter: Expression | str | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, fragments: Iterable[LanceFragment] | None = None, full_text_query: str | dict | FullTextQuery | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, io_buffer_size: int | None = None, late_materialization: bool | list[str] | None = None, use_scalar_index: bool | None = None, include_deleted_rows: bool | None = None) LanceScanner

Return a Scanner that can support various pushdowns.

Parameters:
columns : list of str, or dict of str to str default None

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

filter : pa.compute.Expression or str

Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

limit : int, default None

Fetch up to this many rows. All rows if None or unspecified.

offset : int, default None

Fetch starting with this row. 0 if None or unspecified.

nearest : dict, default None

Get the rows corresponding to the K most similar vectors. Example:

{
    "column": <embedding col name>,
    "q": <query vector as pa.Float32Array>,
    "k": 10,
    "nprobes": 1,
    "refine_factor": 1
}

batch_size : int, default None

The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

io_buffer_size : int, default None

The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

batch_readahead : int, optional

The number of batches to read ahead.

fragment_readahead : int, optional

The number of fragments to read ahead.

scan_in_order : bool, default True

Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

fragments : iterable of LanceFragment, default None

If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

prefilter : bool, default False

If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.

If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.

use_scalar_index : bool, default True

Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

late_materialization : bool or List[str], default None

Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.

Early materialization can be better when there are many results or the columns are very narrow.

If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.

full_text_query : str or dict, optional

query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

  • columns: list[str]

    The columns to search, currently only supports a single column in the columns list.

  • query: str

    The query string to search for.

If True, then the search will only be performed on the indexed data, which yields faster search time.

include_deleted_rows : bool, default False

If True, then rows that have been deleted, but are still present in the fragment, will be returned. These rows will have the _rowid column set to null. All other columns will reflect the value stored on disk and may not be null.

Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Note

For now, if BOTH filter and nearest is specified, then:

  1. nearest is executed first.

  2. The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}
LanceDataset.to_batches(columns: list[str] | dict[str, str] | None = None, filter: Expression | str | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, full_text_query: str | dict | None = None, io_buffer_size: int | None = None, late_materialization: bool | list[str] | None = None, use_scalar_index: bool | None = None, **kwargs) Iterator[RecordBatch]

Read the dataset as materialized record batches.

Parameters:
**kwargs : dict, optional

Arguments for scanner().

Returns:

record_batches

Return type:

Iterator of RecordBatch

LanceDataset.to_table(columns: list[str] | dict[str, str] | None = None, filter: Expression | str | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, full_text_query: str | dict | FullTextQuery | None = None, io_buffer_size: int | None = None, late_materialization: bool | list[str] | None = None, use_scalar_index: bool | None = None, include_deleted_rows: bool | None = None) Table

Read the data into memory as a pyarrow.Table

Parameters:
columns : list of str, or dict of str to str default None

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

filter : pa.compute.Expression or str

Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

limit : int, default None

Fetch up to this many rows. All rows if None or unspecified.

offset : int, default None

Fetch starting with this row. 0 if None or unspecified.

nearest : dict, default None

Get the rows corresponding to the K most similar vectors. Example:

{
    "column": <embedding col name>,
    "q": <query vector as pa.Float32Array>,
    "k": 10,
    "metric": "cosine",
    "nprobes": 1,
    "refine_factor": 1
}

batch_size : int, optional

The number of rows to read at a time.

io_buffer_size : int, default None

The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

batch_readahead : int, optional

The number of batches to read ahead.

fragment_readahead : int, optional

The number of fragments to read ahead.

scan_in_order : bool, optional, default True

Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

prefilter : bool, optional, default False

Run filter before the vector search.

late_materialization : bool or List[str], default None

Allows custom control over late materialization. See ScannerBuilder.late_materialization for more information.

use_scalar_index : bool, default True

Allows custom control over scalar index usage. See ScannerBuilder.use_scalar_index for more information.

with_row_id : bool, optional, default False

Return row ID.

with_row_address : bool, optional, default False

Return row address

use_stats : bool, optional, default True

Use stats pushdown during filters.

full_text_query : str or dict, optional

query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents contains “hello” or “world”. or a dictionary with the following keys:

  • columns: list[str]

    The columns to search, currently only supports a single column in the columns list.

  • query: str

    The query string to search for.

include_deleted_rows : bool, optional, default False

If True, then rows that have been deleted, but are still present in the fragment, will be returned. These rows will have the _rowid column set to null. All other columns will reflect the value stored on disk and may not be null.

Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Notes

If BOTH filter and nearest is specified, then:

  1. nearest is executed first.

  2. The results are filtered afterward, unless pre-filter sets to True.

Random Access

Lance stands out with its super fast random access, unlike other columnar formats.

LanceDataset.take(indices: list[int] | Array, columns: list[str] | dict[str, str] | None = None) Table

Select rows of data by index.

Parameters:
indices : Array or array-like

indices of rows to select in the dataset.

columns : list of str, or dict of str to str default None

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

Returns:

table

Return type:

pyarrow.Table

LanceDataset.take_blobs(row_ids: list[int] | Array, blob_column: str) list[BlobFile]

Select blobs by row IDs.

Instead of loading large binary blob data into memory before processing it, this API allows you to open binary blob data as a regular Python file-like object. For more details, see lance.BlobFile.

Parameters:
row_ids : List Array or array-like

row IDs to select in the dataset.

blob_column : str

The name of the blob column to select.

Returns:

blob_files

Return type:

List[BlobFile]

Schema Evolution

Lance supports schema evolution, which means that you can add new columns to the dataset cheaply.

LanceDataset.add_columns(transforms: dict[str, str] | BatchUDF | ReaderLike | pyarrow.Field | list[pyarrow.Field] | pyarrow.Schema, read_columns: list[str] | None = None, reader_schema: pa.Schema | None = None, batch_size: int | None = None)

Add new columns with defined values.

There are several ways to specify the new columns. First, you can provide SQL expressions for each new column. Second you can provide a UDF that takes a batch of existing data and returns a new batch with the new columns. These new columns will be appended to the dataset.

You can also provide a RecordBatchReader which will read the new column values from some external source. This is often useful when the new column values have already been staged to files (often by some distributed process)

See the lance.add_columns_udf() decorator for more information on writing UDFs.

Parameters:
transforms : dict or AddColumnsUDF or ReaderLike

If this is a dictionary, then the keys are the names of the new columns and the values are SQL expression strings. These strings can reference existing columns in the dataset. If this is a AddColumnsUDF, then it is a UDF that takes a batch of existing data and returns a new batch with the new columns. If this is pyarrow.Field or pyarrow.Schema, it adds all NULL columns with the given schema, in a metadata-only operation.

read_columns : list of str, optional

The names of the columns that the UDF will read. If None, then the UDF will read all columns. This is only used when transforms is a UDF. Otherwise, the read columns are inferred from the SQL expressions.

reader_schema : pa.Schema, optional

Only valid if transforms is a ReaderLike object. This will be used to determine the schema of the reader.

batch_size : int, optional

The number of rows to read at a time from the source dataset when applying the transform. This is ignored if the dataset is a v1 dataset.

Examples

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3]})
>>> dataset = lance.write_dataset(table, "my_dataset")
>>> @lance.batch_udf()
... def double_a(batch):
...     df = batch.to_pandas()
...     return pd.DataFrame({'double_a': 2 * df['a']})
>>> dataset.add_columns(double_a)
>>> dataset.to_table().to_pandas()
   a  double_a
0  1         2
1  2         4
2  3         6
>>> dataset.add_columns({"triple_a": "a * 3"})
>>> dataset.to_table().to_pandas()
   a  double_a  triple_a
0  1         2         3
1  2         4         6
2  3         6         9

See also

LanceDataset.merge

Merge a pre-computed set of columns into the dataset.

LanceDataset.drop_columns(columns: list[str])

Drop one or more columns from the dataset

This is a metadata-only operation and does not remove the data from the underlying storage. In order to remove the data, you must subsequently call compact_files to rewrite the data without the removed columns and then call cleanup_old_versions to remove the old files.

Parameters:
columns : list of str

The names of the columns to drop. These can be nested column references (e.g. “a.b.c”) or top-level column names (e.g. “a”).

Examples

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})
>>> dataset = lance.write_dataset(table, "example")
>>> dataset.drop_columns(["a"])
>>> dataset.to_table().to_pandas()
   b
0  a
1  b
2  c

Indexing and Searching

LanceDataset.create_index(column: str | list[str], index_type: str, name: str | None = None, metric: str = 'L2', replace: bool = False, num_partitions: int | None = None, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, num_sub_vectors: int | None = None, accelerator: str | 'torch.Device' | None = None, index_cache_size: int | None = None, shuffle_partition_batches: int | None = None, shuffle_partition_concurrency: int | None = None, ivf_centroids_file: str | None = None, precomputed_partition_dataset: str | None = None, storage_options: dict[str, str] | None = None, filter_nan: bool = True, one_pass_ivfpq: bool = False, **kwargs) LanceDataset

Create index on column.

Experimental API

Parameters:
column : str

The column to be indexed.

index_type : str

The type of the index. "IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ" are supported now.

name : str, optional

The index name. If not provided, it will be generated from the column name.

metric : str

The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.

replace : bool

Replace the existing index if it exists.

num_partitions : int, optional

The number of partitions of IVF (Inverted File Index).

ivf_centroids : optional

It can be either np.ndarray, pyarrow.FixedSizeListArray or pyarrow.FixedShapeTensorArray. A num_partitions x dimension array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.

pq_codebook : optional,

It can be np.ndarray, pyarrow.FixedSizeListArray, or pyarrow.FixedShapeTensorArray. A num_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors) array of K-mean centroids for PQ codebook.

Note: nbits is always 8 for now. If not provided, a new PQ model will be trained.

num_sub_vectors : int, optional

The number of sub-vectors for PQ (Product Quantization).

accelerator: str | 'torch.Device' | None = None

If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.

index_cache_size : int, optional

The size of the index cache in number of entries. Default value is 256.

shuffle_partition_batches : int, optional

The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.

Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

shuffle_partition_concurrency : int, optional

The number of shuffle partitions to process concurrently. Default value is 2

By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

storage_options : optional, dict

Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

filter_nan : bool

Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.

one_pass_ivfpq : bool

Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.

**kwargs

Parameters passed to the index building process.

The SQ (Scalar Quantization) is available for only IVF_HNSW_SQ index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is of num_bits, now only 8 bits are supported.

If index_type is “IVF_*”, then the following parameters are required:

num_partitions

If index_type is with “PQ”, then the following parameters are required:

num_sub_vectors

Optional parameters for IVF_PQ:

  • ivf_centroids

    Existing K-mean centroids for IVF clustering.

  • num_bits

    The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.

Optional parameters for IVF_HNSW_*:
max_level

Int, the maximum number of levels in the graph.

m

Int, the number of edges per node in the graph.

ef_construction

Int, the number of nodes to examine during the construction.

Examples

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16
)
import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_HNSW_SQ",
    num_partitions=256,
)

Experimental Accelerator (GPU) support:

  • accelerate: use GPU to train IVF partitions.

    Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16,
    accelerator="cuda"
)

References

LanceDataset.create_scalar_index(column: str, index_type: 'BTREE' | 'BITMAP' | 'LABEL_LIST' | 'INVERTED' | 'FTS' | 'NGRAM', name: str | None = None, *, replace: bool = True, **kwargs)

Create a scalar index on a column.

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(filter="my_col != 7").to_table()

Vector search with pre-filers can also benefit from scalar indices. For example,

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(
    nearest=dict(
       column="vector",
       q=[1, 2, 3, 4],
       k=10,
    )
    filter="my_col != 7",
    prefilter=True
)

There are 5 types of scalar indices available today.

  • BTREE. The most common type is BTREE. This index is inspired by the btree data structure although only the first few layers of the btree are cached in memory. It will perform well on columns with a large number of unique values and few rows per value.

  • BITMAP. This index stores a bitmap for each unique value in the column. This index is useful for columns with a small number of unique values and many rows per value.

  • LABEL_LIST. A special index that is used to index list columns whose values have small cardinality. For example, a column that contains lists of tags (e.g. ["tag1", "tag2", "tag3"]) can be indexed with a LABEL_LIST index. This index can only speedup queries with array_has_any or array_has_all filters.

  • NGRAM. A special index that is used to index string columns. This index creates a bitmap for each ngram in the string. By default we use trigrams. This index can currently speed up queries using the contains function in filters.

  • FTS/INVERTED. It is used to index document columns. This index can conduct full-text searches. For example, a column that contains any word of query string “hello world”. The results will be ranked by BM25.

Note that the LANCE_BYPASS_SPILLING environment variable can be used to bypass spilling to disk. Setting this to true can avoid memory exhaustion issues (see https://github.com/apache/datafusion/issues/10073 for more info).

Experimental API

Parameters:
column : str

The column to be indexed. Must be a boolean, integer, float, or string column.

index_type : str

The type of the index. One of "BTREE", "BITMAP", "LABEL_LIST", "NGRAM", "FTS" or "INVERTED".

name : str, optional

The index name. If not provided, it will be generated from the column name.

replace : bool, default True

Replace the existing index if it exists.

with_position : bool, default True

This is for the INVERTED index. If True, the index will store the positions of the words in the document, so that you can conduct phrase query. This will significantly increase the index size. It won’t impact the performance of non-phrase queries even if it is set to True.

base_tokenizer : str, default "simple"

This is for the INVERTED index. The base tokenizer to use. The value can be: * “simple”: splits tokens on whitespace and punctuation. * “whitespace”: splits tokens on whitespace. * “raw”: no tokenization.

language : str, default "English"

This is for the INVERTED index. The language for stemming and stop words. This is only used when stem or remove_stop_words is true

max_token_length : Optional[int], default 40

This is for the INVERTED index. The maximum token length. Any token longer than this will be removed.

lower_case : bool, default True

This is for the INVERTED index. If True, the index will convert all text to lowercase.

stem : bool, default False

This is for the INVERTED index. If True, the index will stem the tokens.

remove_stop_words : bool, default False

This is for the INVERTED index. If True, the index will remove stop words.

ascii_folding : bool, default False

This is for the INVERTED index. If True, the index will convert non-ascii characters to ascii characters if possible. This would remove accents like “é” -> “e”.

Examples

import lance

dataset = lance.dataset("/tmp/images.lance")
dataset.create_index(
    "category",
    "BTREE",
)

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND’d or OR’d together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

To determine if a scan is making use of a scalar index you can use explain_plan to look at the query plan that lance has created. Queries that use scalar indices will either have a ScalarIndexQuery relation or a MaterializeIndex operator.

LanceDataset.drop_index(name: str)

Drops an index from the dataset

Note: Indices are dropped by “index name”. This is not the same as the field name. If you did not specify a name when you created the index then a name was generated for you. You can use the list_indices method to get the names of the indices.

LanceDataset.scanner(columns: list[str] | dict[str, str] | None = None, filter: Expression | str | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, fragments: Iterable[LanceFragment] | None = None, full_text_query: str | dict | FullTextQuery | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, io_buffer_size: int | None = None, late_materialization: bool | list[str] | None = None, use_scalar_index: bool | None = None, include_deleted_rows: bool | None = None) LanceScanner

Return a Scanner that can support various pushdowns.

Parameters:
columns : list of str, or dict of str to str default None

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

filter : pa.compute.Expression or str

Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

limit : int, default None

Fetch up to this many rows. All rows if None or unspecified.

offset : int, default None

Fetch starting with this row. 0 if None or unspecified.

nearest : dict, default None

Get the rows corresponding to the K most similar vectors. Example:

{
    "column": <embedding col name>,
    "q": <query vector as pa.Float32Array>,
    "k": 10,
    "nprobes": 1,
    "refine_factor": 1
}

batch_size : int, default None

The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

io_buffer_size : int, default None

The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

batch_readahead : int, optional

The number of batches to read ahead.

fragment_readahead : int, optional

The number of fragments to read ahead.

scan_in_order : bool, default True

Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

fragments : iterable of LanceFragment, default None

If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

prefilter : bool, default False

If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.

If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.

use_scalar_index : bool, default True

Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

late_materialization : bool or List[str], default None

Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.

Early materialization can be better when there are many results or the columns are very narrow.

If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.

full_text_query : str or dict, optional

query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

  • columns: list[str]

    The columns to search, currently only supports a single column in the columns list.

  • query: str

    The query string to search for.

If True, then the search will only be performed on the indexed data, which yields faster search time.

include_deleted_rows : bool, default False

If True, then rows that have been deleted, but are still present in the fragment, will be returned. These rows will have the _rowid column set to null. All other columns will reflect the value stored on disk and may not be null.

Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Note

For now, if BOTH filter and nearest is specified, then:

  1. nearest is executed first.

  2. The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}

API Reference

More information can be found in the API reference.