-
lance.LanceDataset.create_scalar_index(column: str, index_type: 'BTREE' | 'BITMAP' | 'LABEL_LIST' | 'INVERTED' | 'FTS' | 'NGRAM', name: str | None =
None
, *, replace: bool =True
, **kwargs) Create a scalar index on a column.
Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column
my_col
has a scalar index:import lance dataset = lance.dataset("/tmp/images.lance") my_table = dataset.scanner(filter="my_col != 7").to_table()
Vector search with pre-filers can also benefit from scalar indices. For example,
import lance dataset = lance.dataset("/tmp/images.lance") my_table = dataset.scanner( nearest=dict( column="vector", q=[1, 2, 3, 4], k=10, ) filter="my_col != 7", prefilter=True )
There are 5 types of scalar indices available today.
BTREE
. The most common type isBTREE
. This index is inspired by the btree data structure although only the first few layers of the btree are cached in memory. It will perform well on columns with a large number of unique values and few rows per value.BITMAP
. This index stores a bitmap for each unique value in the column. This index is useful for columns with a small number of unique values and many rows per value.LABEL_LIST
. A special index that is used to index list columns whose values have small cardinality. For example, a column that contains lists of tags (e.g.["tag1", "tag2", "tag3"]
) can be indexed with aLABEL_LIST
index. This index can only speedup queries witharray_has_any
orarray_has_all
filters.NGRAM
. A special index that is used to index string columns. This index creates a bitmap for each ngram in the string. By default we use trigrams. This index can currently speed up queries using thecontains
function in filters.FTS/INVERTED
. It is used to index document columns. This index can conduct full-text searches. For example, a column that contains any word of query string “hello world”. The results will be ranked by BM25.
Note that the
LANCE_BYPASS_SPILLING
environment variable can be used to bypass spilling to disk. Setting this to true can avoid memory exhaustion issues (see https://github.com/apache/datafusion/issues/10073 for more info).Experimental API
- Parameters:
- column : str¶
The column to be indexed. Must be a boolean, integer, float, or string column.
- index_type : str¶
The type of the index. One of
"BTREE"
,"BITMAP"
,"LABEL_LIST"
,"NGRAM"
,"FTS"
or"INVERTED"
.- name : str, optional¶
The index name. If not provided, it will be generated from the column name.
- replace : bool, default True¶
Replace the existing index if it exists.
- with_position : bool, default True
This is for the
INVERTED
index. If True, the index will store the positions of the words in the document, so that you can conduct phrase query. This will significantly increase the index size. It won’t impact the performance of non-phrase queries even if it is set to True.- base_tokenizer : str, default "simple"
This is for the
INVERTED
index. The base tokenizer to use. The value can be: * “simple”: splits tokens on whitespace and punctuation. * “whitespace”: splits tokens on whitespace. * “raw”: no tokenization.- language : str, default "English"
This is for the
INVERTED
index. The language for stemming and stop words. This is only used when stem or remove_stop_words is true- max_token_length : Optional[int], default 40
This is for the
INVERTED
index. The maximum token length. Any token longer than this will be removed.- lower_case : bool, default True
This is for the
INVERTED
index. If True, the index will convert all text to lowercase.- stem : bool, default False
This is for the
INVERTED
index. If True, the index will stem the tokens.- remove_stop_words : bool, default False
This is for the
INVERTED
index. If True, the index will remove stop words.- ascii_folding : bool, default False
This is for the
INVERTED
index. If True, the index will convert non-ascii characters to ascii characters if possible. This would remove accents like “é” -> “e”.
Examples
import lance dataset = lance.dataset("/tmp/images.lance") dataset.create_index( "category", "BTREE", )
Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g.
my_col BETWEEN 0 AND 100
), and set membership (e.g. my_col IN (0, 1, 2))Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND’d or OR’d together (e.g.
my_col < 0 AND other_col> 100
)Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column
not_indexed
does not have a scalar index then the filtermy_col = 0 OR not_indexed = 1
will not be able to use any scalar index onmy_col
.To determine if a scan is making use of a scalar index you can use
explain_plan
to look at the query plan that lance has created. Queries that use scalar indices will either have aScalarIndexQuery
relation or aMaterializeIndex
operator.