lance.LanceDataset.scanner - Lance documentation

Return a Scanner that can support various pushdowns.

Parameters:

columns : list of str, or dict of str to str default None¶

List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

filter : pa.compute.Expression or str¶

Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

limit : int, default None¶

Fetch up to this many rows. All rows if None or unspecified.

offset : int, default None¶

Fetch starting with this row. 0 if None or unspecified.

nearest : dict, default None¶

Get the rows corresponding to the K most similar vectors. Example:

{
    "column": <embedding col name>,
    "q": <query vector as pa.Float32Array>,
    "k": 10,
    "nprobes": 1,
    "refine_factor": 1
}

batch_size : int, default None¶

The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

io_buffer_size : int, default None¶

The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

batch_readahead : int, optional¶

The number of batches to read ahead.

fragment_readahead : int, optional¶

The number of fragments to read ahead.

scan_in_order : bool, default True¶

Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

fragments : iterable of LanceFragment, default None¶

If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

prefilter : bool, default False¶

If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.

If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.

use_scalar_index : bool, default True¶

Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

late_materialization : bool or List[str], default None¶

Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.

Early materialization can be better when there are many results or the columns are very narrow.

If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.

full_text_query : str or dict, optional¶

query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

columns: list[str]
The columns to search, currently only supports a single column in the columns list.
query: str
The query string to search for.

fast_search : bool, default False¶

If True, then the search will only be performed on the indexed data, which yields faster search time.

scan_stats_callback : Callable[[ScanStatistics], None], default None¶

A callback function that will be called with the scan statistics after the scan is complete. Errors raised by the callback will be logged but not re-raised.

include_deleted_rows : bool, default False¶

If True, then rows that have been deleted, but are still present in the fragment, will be returned. These rows will have the _rowid column set to null. All other columns will reflect the value stored on disk and may not be null.

Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Note

For now, if BOTH filter and nearest is specified, then:

nearest is executed first.
The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}