Indexing and Searching

LanceDataset.create_index(column: str | List[str], index_type: str, name: str | None = None, metric: str = 'L2', replace: bool = False, num_partitions: int | None = None, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, num_sub_vectors: int | None = None, accelerator: str | 'torch.Device' | None = None, index_cache_size: int | None = None, shuffle_partition_batches: int | None = None, shuffle_partition_concurrency: int | None = None, ivf_centroids_file: str | None = None, precomputed_partition_dataset: str | None = None, storage_options: Dict[str, str] | None = None, filter_nan: bool = True, one_pass_ivfpq: bool = False, **kwargs) LanceDataset

Create index on column.

Experimental API

Parameters:
  • column (str) – The column to be indexed.

  • index_type (str) – The type of the index. "IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ" are supported now.

  • name (str, optional) – The index name. If not provided, it will be generated from the column name.

  • metric (str) – The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.

  • replace (bool) – Replace the existing index if it exists.

  • num_partitions (int, optional) – The number of partitions of IVF (Inverted File Index).

  • ivf_centroids (optional) – It can be either np.ndarray, pyarrow.FixedSizeListArray or pyarrow.FixedShapeTensorArray. A num_partitions x dimension array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.

  • pq_codebook (optional,) –

    It can be np.ndarray, pyarrow.FixedSizeListArray, or pyarrow.FixedShapeTensorArray. A num_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors) array of K-mean centroids for PQ codebook.

    Note: nbits is always 8 for now. If not provided, a new PQ model will be trained.

  • num_sub_vectors (int, optional) – The number of sub-vectors for PQ (Product Quantization).

  • accelerator (str or torch.Device, optional) – If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.

  • index_cache_size (int, optional) – The size of the index cache in number of entries. Default value is 256.

  • shuffle_partition_batches (int, optional) –

    The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.

    Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

  • shuffle_partition_concurrency (int, optional) –

    The number of shuffle partitions to process concurrently. Default value is 2

    By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

  • storage_options (optional, dict) – Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

  • filter_nan (bool) – Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.

  • one_pass_ivfpq (bool) – Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.

  • kwargs – Parameters passed to the index building process.

The SQ (Scalar Quantization) is available for only IVF_HNSW_SQ index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is of num_bits, now only 8 bits are supported.

If index_type is “IVF_*”, then the following parameters are required:

num_partitions

If index_type is with “PQ”, then the following parameters are required:

num_sub_vectors

Optional parameters for IVF_PQ:

  • ivf_centroids

    Existing K-mean centroids for IVF clustering.

  • num_bits

    The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.

  • index_file_version

    The version of the index file. Default is “V3”.

Optional parameters for IVF_HNSW_*:
max_level

Int, the maximum number of levels in the graph.

m

Int, the number of edges per node in the graph.

ef_construction

Int, the number of nodes to examine during the construction.

Examples

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16
)
import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_HNSW_SQ",
    num_partitions=256,
)

Experimental Accelerator (GPU) support:

  • accelerate: use GPU to train IVF partitions.

    Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16,
    accelerator="cuda"
)

References

LanceDataset.create_scalar_index(column: str, index_type: Literal['BTREE', 'BITMAP', 'LABEL_LIST', 'INVERTED', 'FTS', 'NGRAM'], name: str | None = None, *, replace: bool = True, **kwargs)

Create a scalar index on a column.

Scalar indices, like vector indices, can be used to speed up scans. A scalar index can speed up scans that contain filter expressions on the indexed column. For example, the following scan will be faster if the column my_col has a scalar index:

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(filter="my_col != 7").to_table()

Vector search with pre-filers can also benefit from scalar indices. For example,

import lance

dataset = lance.dataset("/tmp/images.lance")
my_table = dataset.scanner(
    nearest=dict(
       column="vector",
       q=[1, 2, 3, 4],
       k=10,
    )
    filter="my_col != 7",
    prefilter=True
)

There are 5 types of scalar indices available today.

  • BTREE. The most common type is BTREE. This index is inspired by the btree data structure although only the first few layers of the btree are cached in memory. It will perform well on columns with a large number of unique values and few rows per value.

  • BITMAP. This index stores a bitmap for each unique value in the column. This index is useful for columns with a small number of unique values and many rows per value.

  • LABEL_LIST. A special index that is used to index list columns whose values have small cardinality. For example, a column that contains lists of tags (e.g. ["tag1", "tag2", "tag3"]) can be indexed with a LABEL_LIST index. This index can only speedup queries with array_has_any or array_has_all filters.

  • NGRAM. A special index that is used to index string columns. This index creates a bitmap for each ngram in the string. By default we use trigrams. This index can currently speed up queries using the contains function in filters.

  • FTS/INVERTED. It is used to index document columns. This index can conduct full-text searches. For example, a column that contains any word of query string “hello world”. The results will be ranked by BM25.

Note that the LANCE_BYPASS_SPILLING environment variable can be used to bypass spilling to disk. Setting this to true can avoid memory exhaustion issues (see https://github.com/apache/datafusion/issues/10073 for more info).

Experimental API

Parameters:
  • column (str) – The column to be indexed. Must be a boolean, integer, float, or string column.

  • index_type (str) – The type of the index. One of "BTREE", "BITMAP", "LABEL_LIST", "NGRAM", "FTS" or "INVERTED".

  • name (str, optional) – The index name. If not provided, it will be generated from the column name.

  • replace (bool, default True) – Replace the existing index if it exists.

  • with_position (bool, default True) – This is for the INVERTED index. If True, the index will store the positions of the words in the document, so that you can conduct phrase query. This will significantly increase the index size. It won’t impact the performance of non-phrase queries even if it is set to True.

  • base_tokenizer (str, default "simple") – This is for the INVERTED index. The base tokenizer to use. The value can be: * “simple”: splits tokens on whitespace and punctuation. * “whitespace”: splits tokens on whitespace. * “raw”: no tokenization.

  • language (str, default "English") – This is for the INVERTED index. The language for stemming and stop words. This is only used when stem or remove_stop_words is true

  • max_token_length (Optional[int], default 40) – This is for the INVERTED index. The maximum token length. Any token longer than this will be removed.

  • lower_case (bool, default True) – This is for the INVERTED index. If True, the index will convert all text to lowercase.

  • stem (bool, default False) – This is for the INVERTED index. If True, the index will stem the tokens.

  • remove_stop_words (bool, default False) – This is for the INVERTED index. If True, the index will remove stop words.

  • ascii_folding (bool, default False) – This is for the INVERTED index. If True, the index will convert non-ascii characters to ascii characters if possible. This would remove accents like “é” -> “e”.

Examples

import lance

dataset = lance.dataset("/tmp/images.lance")
dataset.create_index(
    "category",
    "BTREE",
)

Scalar indices can only speed up scans for basic filters using equality, comparison, range (e.g. my_col BETWEEN 0 AND 100), and set membership (e.g. my_col IN (0, 1, 2))

Scalar indices can be used if the filter contains multiple indexed columns and the filter criteria are AND’d or OR’d together (e.g. my_col < 0 AND other_col> 100)

Scalar indices may be used if the filter contains non-indexed columns but, depending on the structure of the filter, they may not be usable. For example, if the column not_indexed does not have a scalar index then the filter my_col = 0 OR not_indexed = 1 will not be able to use any scalar index on my_col.

To determine if a scan is making use of a scalar index you can use explain_plan to look at the query plan that lance has created. Queries that use scalar indices will either have a ScalarIndexQuery relation or a MaterializeIndex operator.

LanceDataset.drop_index(name: str)

Drops an index from the dataset

Note: Indices are dropped by “index name”. This is not the same as the field name. If you did not specify a name when you created the index then a name was generated for you. You can use the list_indices method to get the names of the indices.

LanceDataset.scanner(columns: List[str] | Dict[str, str] | None = None, filter: str | Expression | None = None, limit: int | None = None, offset: int | None = None, nearest: dict | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, scan_in_order: bool | None = None, fragments: Iterable[LanceFragment] | None = None, full_text_query: str | dict | FullTextQuery | None = None, *, prefilter: bool | None = None, with_row_id: bool | None = None, with_row_address: bool | None = None, use_stats: bool | None = None, fast_search: bool | None = None, io_buffer_size: int | None = None, late_materialization: bool | List[str] | None = None, use_scalar_index: bool | None = None, include_deleted_rows: bool | None = None, scan_stats_callback: Callable[[ScanStatistics], None] | None = None, strict_batch_size: bool | None = None) LanceScanner

Return a Scanner that can support various pushdowns.

Parameters:
  • columns (list of str, or dict of str to str default None) – List of column names to be fetched. Or a dictionary of column names to SQL expressions. All columns are fetched if None or unspecified.

  • filter (pa.compute.Expression or str) – Expression or str that is a valid SQL where clause. See Lance filter pushdown for valid SQL expressions.

  • limit (int, default None) – Fetch up to this many rows. All rows if None or unspecified.

  • offset (int, default None) – Fetch starting with this row. 0 if None or unspecified.

  • nearest (dict, default None) –

    Get the rows corresponding to the K most similar vectors. Example:

    {
        "column": <embedding col name>,
        "q": <query vector as pa.Float32Array>,
        "k": 10,
        "minimum_nprobes": 20,
        "maximum_nprobes": 50,
        "refine_factor": 1
    }
    

  • batch_size (int, default None) – The target size of batches returned. In some cases batches can be up to twice this size (but never larger than this). In some cases batches can be smaller than this size.

  • io_buffer_size (int, default None) – The size of the IO buffer. See ScannerBuilder.io_buffer_size for more information.

  • batch_readahead (int, optional) – The number of batches to read ahead.

  • fragment_readahead (int, optional) – The number of fragments to read ahead.

  • scan_in_order (bool, default True) – Whether to read the fragments and batches in order. If false, throughput may be higher, but batches will be returned out of order and memory use might increase.

  • fragments (iterable of LanceFragment, default None) – If specified, only scan these fragments. If scan_in_order is True, then the fragments will be scanned in the order given.

  • prefilter (bool, default False) –

    If True then the filter will be applied before the vector query is run. This will generate more correct results but it may be a more costly query. It’s generally good when the filter is highly selective.

    If False then the filter will be applied after the vector query is run. This will perform well but the results may have fewer than the requested number of rows (or be empty) if the rows closest to the query do not match the filter. It’s generally good when the filter is not very selective.

  • use_scalar_index (bool, default True) – Lance will automatically use scalar indices to optimize a query. In some corner cases this can make query performance worse and this parameter can be used to disable scalar indices in these cases.

  • late_materialization (bool or List[str], default None) –

    Allows custom control over late materialization. Late materialization fetches non-query columns using a take operation after the filter. This is useful when there are few results or columns are very large.

    Early materialization can be better when there are many results or the columns are very narrow.

    If True, then all columns are late materialized. If False, then all columns are early materialized. If a list of strings, then only the columns in the list are late materialized.

    The default uses a heuristic that assumes filters will select about 0.1% of the rows. If your filter is more selective (e.g. find by id) you may want to set this to True. If your filter is not very selective (e.g. matches 20% of the rows) you may want to set this to False.

  • full_text_query (str or dict, optional) –

    query string to search for, the results will be ranked by BM25. e.g. “hello world”, would match documents containing “hello” or “world”. or a dictionary with the following keys:

    • columns: list[str]

      The columns to search, currently only supports a single column in the columns list.

    • query: str

      The query string to search for.

  • fast_search (bool, default False) – If True, then the search will only be performed on the indexed data, which yields faster search time.

  • scan_stats_callback (Callable[[ScanStatistics], None], default None) – A callback function that will be called with the scan statistics after the scan is complete. Errors raised by the callback will be logged but not re-raised.

  • include_deleted_rows (bool, default False) –

    If True, then rows that have been deleted, but are still present in the fragment, will be returned. These rows will have the _rowid column set to null. All other columns will reflect the value stored on disk and may not be null.

    Note: if this is a search operation, or a take operation (including scalar indexed scans) then deleted rows cannot be returned.

Note

For now, if BOTH filter and nearest is specified, then:

  1. nearest is executed first.

  2. The results are filtered afterwards.

For debugging ANN results, you can choose to not use the index even if present by specifying use_index=False. For example, the following will always return exact KNN results:

dataset.to_table(nearest={
    "column": "vector",
    "k": 10,
    "q": <query vector>,
    "use_index": False
}