Indexing and Searching
- LanceDataset.create_index(column: str | List[str], index_type: str, name: str | None = None, metric: str = 'L2', replace: bool = False, num_partitions: int | None = None, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, num_sub_vectors: int | None = None, accelerator: str | 'torch.Device' | None = None, index_cache_size: int | None = None, shuffle_partition_batches: int | None = None, shuffle_partition_concurrency: int | None = None, ivf_centroids_file: str | None = None, precomputed_partition_dataset: str | None = None, storage_options: Dict[str, str] | None = None, filter_nan: bool = True, one_pass_ivfpq: bool = False, **kwargs) LanceDataset
Create index on column.
Experimental API
- Parameters:
column (str) – The column to be indexed.
index_type (str) – The type of the index.
"IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ"
are supported now.name (str, optional) – The index name. If not provided, it will be generated from the column name.
metric (str) – The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.
replace (bool) – Replace the existing index if it exists.
num_partitions (int, optional) – The number of partitions of IVF (Inverted File Index).
ivf_centroids (optional) – It can be either
np.ndarray
,pyarrow.FixedSizeListArray
orpyarrow.FixedShapeTensorArray
. Anum_partitions x dimension
array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.pq_codebook (optional,) –
It can be
np.ndarray
,pyarrow.FixedSizeListArray
, orpyarrow.FixedShapeTensorArray
. Anum_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors)
array of K-mean centroids for PQ codebook.Note:
nbits
is always 8 for now. If not provided, a new PQ model will be trained.num_sub_vectors (int, optional) – The number of sub-vectors for PQ (Product Quantization).
accelerator (str or
torch.Device
, optional) – If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.index_cache_size (int, optional) – The size of the index cache in number of entries. Default value is 256.
shuffle_partition_batches (int, optional) –
The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.
Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.
shuffle_partition_concurrency (int, optional) –
The number of shuffle partitions to process concurrently. Default value is 2
By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.
storage_options (optional, dict) – Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.
filter_nan (bool) – Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.
one_pass_ivfpq (bool) – Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.
kwargs – Parameters passed to the index building process.
The SQ (Scalar Quantization) is available for only
IVF_HNSW_SQ
index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is ofnum_bits
, now only 8 bits are supported.- If
index_type
is “IVF_*”, then the following parameters are required: num_partitions
- If
index_type
is with “PQ”, then the following parameters are required: num_sub_vectors
Optional parameters for IVF_PQ:
- ivf_centroids
Existing K-mean centroids for IVF clustering.
- num_bits
The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.
- index_file_version
The version of the index file. Default is “V3”.
- Optional parameters for IVF_HNSW_*:
- max_level
Int, the maximum number of levels in the graph.
- m
Int, the number of edges per node in the graph.
- ef_construction
Int, the number of nodes to examine during the construction.
Examples
import lance dataset = lance.dataset("/tmp/sift.lance") dataset.create_index( "vector", "IVF_PQ", num_partitions=256, num_sub_vectors=16 )
import lance dataset = lance.dataset("/tmp/sift.lance") dataset.create_index( "vector", "IVF_HNSW_SQ", num_partitions=256, )
Experimental Accelerator (GPU) support:
- accelerate: use GPU to train IVF partitions.
Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.
import lance dataset = lance.dataset("/tmp/sift.lance") dataset.create_index( "vector", "IVF_PQ", num_partitions=256, num_sub_vectors=16, accelerator="cuda" )
References
- LanceDataset.create_scalar_index(column: str, index_type: Literal['BTREE', 'BITMAP', 'LABEL_LIST', 'INVERTED', 'FTS', 'NGRAM'], name: str | None = None, *, replace: bool = True, **kwargs)
Create a scalar index on a column.
Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column
my_col
has a scalar index:import lance dataset = lance.dataset("/tmp/images.lance") my_table = dataset.scanner(filter="my_col != 7").to_table()
Vector search with pre-filers can also benefit from scalar indices. For example,
import lance dataset = lance.dataset("/tmp/images.lance") my_table = dataset.scanner( nearest=dict( column="vector", q=[1, 2, 3, 4], k=10, ) filter="my_col != 7", prefilter=True )
There are 5 types of scalar indices available today.
BTREE
. The most common type isBTREE
. This index is inspired by the btree data structure although only the first few layers of the btree are cached in memory. It will perform well on columns with a large number of unique values and few rows per value.BITMAP
. This index stores a bitmap for each unique value in the column. This index is useful for columns with a small number of unique values and many rows per value.LABEL_LIST
. A special index that is used to index list columns whose values have small cardinality. For example, a column that contains lists of tags (e.g.["tag1", "tag2", "tag3"]
) can be indexed with aLABEL_LIST
index. This index can only speedup queries witharray_has_any
orarray_has_all
filters.NGRAM
. A special index that is used to index string columns. This index creates a bitmap for each ngram in the string. By default we use trigrams. This index can currently speed up queries using thecontains
function in filters.FTS/INVERTED
. It is used to index document columns. This index can conduct full-text searches. For example, a column that contains any word of query string “hello world”. The results will be ranked by BM25.
Note that the
LANCE_BYPASS_SPILLING
environment variable can be used to bypass spilling to disk. Setting this to true can avoid memory exhaustion issues (see https://github.com/apache/datafusion/issues/10073 for more info).Experimental API
- Parameters:
column (str) – The column to be indexed. Must be a boolean, integer, float, or string column.
index_type (str) – The type of the index. One of
"BTREE"
,"BITMAP"
,"LABEL_LIST"
,"NGRAM"
,"FTS"
or"INVERTED"
.name (str, optional) – The index name. If not provided, it will be generated from the column name.
replace (bool, default True) – Replace the existing index if it exists.
with_position (bool, default True) – This is for the
INVERTED
index. If True, the index will store the positions of the words in the document, so that you can conduct phrase query. This will significantly increase the index size. It won’t impact the performance of non-phrase queries even if it is set to True.base_tokenizer (str, default "simple") – This is for the
INVERTED
index. The base tokenizer to use. The value can be: * “simple”: splits tokens on whitespace and punctuation. * “whitespace”: splits tokens on whitespace. * “raw”: no tokenization.language (str, default "English") – This is for the
INVERTED
index. The language for stemming and stop words. This is only used when stem or remove_stop_words is truemax_token_length (Optional[int], default 40) – This is for the
INVERTED
index. The maximum token length. Any token longer than this will be removed.lower_case (bool, default True) – This is for the
INVERTED
index. If True, the index will convert all text to lowercase.stem (bool, default False) – This is for the
INVERTED
index. If True, the index will stem the tokens.remove_stop_words (bool, default False) – This is for the
INVERTED
index. If True, the index will remove stop words.ascii_folding (bool, default False) – This is for the
INVERTED
index. If True, the index will convert non-ascii characters to ascii characters if possible. This would remove accents like “é” -> “e”.
Examples
import lance dataset = lance.dataset("/tmp/images.lance") dataset.create_index( "category", "BTREE", )
Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g.
my_col BETWEEN 0 AND 100
), and set membership (e.g. my_col IN (0, 1, 2))Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND’d or OR’d together (e.g.
my_col < 0 AND other_col> 100
)Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column
not_indexed
does not have a scalar index then the filtermy_col = 0 OR not_indexed = 1
will not be able to use any scalar index onmy_col
.To determine if a scan is making use of a scalar index you can use
explain_plan
to look at the query plan that lance has created. Queries that use scalar indices will either have aScalarIndexQuery
relation or aMaterializeIndex
operator.
- LanceDataset.drop_index(name: str)
Drops an index from the dataset
Note: Indices are dropped by “index name”. This is not the same as the field name. If you did not specify a name when you created the index then a name was generated for you. You can use the list_indices method to get the names of the indices.
- LanceDataset.scanner(columns: List[str] | Dict[str, str] | None = None, filter: str | Expression | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, fragments: Iterable[LanceFragment] | None = None, full_text_query: str | dict | FullTextQuery | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, io_buffer_size: int | None = None, late_materialization: bool | List[str] | None = None, use_scalar_index: bool | None = None, include_deleted_rows: bool | None = None, scan_stats_callback: Callable[[ScanStatistics], None] | None = None, strict_batch_size: bool | None = None) LanceScanner
Return a Scanner that can support various pushdowns.
- Parameters:
columns (list of str, or dict of str to str default None) – List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.
filter (pa.compute.Expression or str) – Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.
limit (int, default None) – Fetch up to this many rows. All rows if None or unspecified.
offset (int, default None) – Fetch starting with this row. 0 if None or unspecified.
nearest (dict, default None) –
Get the rows corresponding to the K most similar vectors. Example:
{ "column": <embedding col name>, "q": <query vector as pa.Float32Array>, "k": 10, "minimum_nprobes": 20, "maximum_nprobes": 50, "refine_factor": 1 }
batch_size (int, default None) – The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.
io_buffer_size (int, default None) – The size of the IO buffer. See
ScannerBuilder.io_buffer_size
for more information.batch_readahead (int, optional) – The number of batches to read ahead.
fragment_readahead (int, optional) – The number of fragments to read ahead.
scan_in_order (bool, default True) – Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.
fragments (iterable of LanceFragment, default None) – If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.
prefilter (bool, default False) –
If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.
If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.
use_scalar_index (bool, default True) – Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.
late_materialization (bool or List[str], default None) –
Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.
Early materialization can be better when there are many results or the columns are very narrow.
If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.
The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.
full_text_query (str or dict, optional) –
query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:
- columns: list[str]
The columns to search, currently only supports a single column in the columns list.
- query: str
The query string to search for.
fast_search (bool, default False) – If True, then the search will only be performed on the indexed data, which yields faster search time.
scan_stats_callback (Callable[[ScanStatistics], None], default None) – A callback function that will be called with the scan statistics after the scan is complete. Errors raised by the callback will be logged but not re-raised.
include_deleted_rows (bool, default False) –
If True, then rows that have been deleted, but are still present in the fragment, will be returned. These rows will have the _rowid column set to null. All other columns will reflect the value stored on disk and may not be null.
Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.
Note
For now, if BOTH filter and nearest is specified, then:
nearest is executed first.
The results are filtered afterwards.
For debugging ANN results, you can choose to not use the index even if present by specifying
use_index=False
. For example, the following will always return exact KNN results:dataset.to_table(nearest={ "column": "vector", "k": 10, "q": <query vector>, "use_index": False }