lance.LanceDataset.create_scalar_index

Create a scalar index on a column.

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(filter="my_col != 7").to_table()

Vector search with pre-filers can also benefit from scalar indices. For example,

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(
    nearest=dict(
       column="vector",
       q=[1, 2, 3, 4],
       k=10,
    )
    filter="my_col != 7",
    prefilter=True
)

There are 5 types of scalar indices available today.

BTREE. The most common type is BTREE. This index is inspired by the btree data structure although only the first few layers of the btree are cached in memory. It will perform well on columns with a large number of unique values and few rows per value.
BITMAP. This index stores a bitmap for each unique value in the column. This index is useful for columns with a small number of unique values and many rows per value.
LABEL_LIST. A special index that is used to index list columns whose values have small cardinality. For example, a column that contains lists of tags (e.g. ["tag1", "tag2", "tag3"]) can be indexed with a LABEL_LIST index. This index can only speedup queries with array_has_any or array_has_all filters.
NGRAM. A special index that is used to index string columns. This index creates a bitmap for each ngram in the string. By default we use trigrams. This index can currently speed up queries using the contains function in filters.
FTS/INVERTED. It is used to index document columns. This index can conduct full-text searches. For example, a column that contains any word of query string “hello world”. The results will be ranked by BM25.

Note that the LANCE_BYPASS_SPILLING environment variable can be used to bypass spilling to disk. Setting this to true can avoid memory exhaustion issues (see https://github.com/apache/datafusion/issues/10073 for more info).

Experimental API

Parameters:

column : str¶: The column to be indexed. Must be a boolean, integer, float, or string column.
index_type : str¶: The type of the index. One of "BTREE", "BITMAP", "LABEL_LIST", "NGRAM", "FTS" or "INVERTED".
name : str, optional¶: The index name. If not provided, it will be generated from the column name.
replace : bool, default True¶: Replace the existing index if it exists.
with_position : bool, default True: This is for the INVERTED index. If True, the index will store the positions of the words in the document, so that you can conduct phrase query. This will significantly increase the index size. It won’t impact the performance of non-phrase queries even if it is set to True.
base_tokenizer : str, default "simple": This is for the INVERTED index. The base tokenizer to use. The value can be: * “simple”: splits tokens on whitespace and punctuation. * “whitespace”: splits tokens on whitespace. * “raw”: no tokenization.
language : str, default "English": This is for the INVERTED index. The language for stemming and stop words. This is only used when stem or remove_stop_words is true
max_token_length : Optional[int], default 40: This is for the INVERTED index. The maximum token length. Any token longer than this will be removed.
lower_case : bool, default True: This is for the INVERTED index. If True, the index will convert all text to lowercase.
stem : bool, default False: This is for the INVERTED index. If True, the index will stem the tokens.
remove_stop_words : bool, default False: This is for the INVERTED index. If True, the index will remove stop words.
ascii_folding : bool, default False: This is for the INVERTED index. If True, the index will convert non-ascii characters to ascii characters if possible. This would remove accents like “é” -> “e”.

Examples

import lance

dataset = lance.dataset("/tmp/images.lance")
dataset.create_index(
    "category",
    "BTREE",
)

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND’d or OR’d together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

To determine if a scan is making use of a scalar index you can use explain_plan to look at the query plan that lance has created. Queries that use scalar indices will either have a ScalarIndexQuery relation or a MaterializeIndex operator.