lance.LanceDataset.create_index - Lance documentation

lance.LanceDataset.create_index(column: str | list[str], index_type: str, name: str | None = None, metric: str = 'L2', replace: bool = False, num_partitions: int | None = None, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, num_sub_vectors: int | None = None, accelerator: str | 'torch.Device' | None = None, index_cache_size: int | None = None, shuffle_partition_batches: int | None = None, shuffle_partition_concurrency: int | None = None, ivf_centroids_file: str | None = None, precomputed_partition_dataset: str | None = None, storage_options: dict[str, str] | None = None, filter_nan: bool = True, one_pass_ivfpq: bool = False, **kwargs) → LanceDataset

Create index on column.

Experimental API

Parameters:

column : str¶

The column to be indexed.

index_type : str¶

The type of the index. "IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ" are supported now.

name : str, optional¶

The index name. If not provided, it will be generated from the column name.

metric : str¶

The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.

replace : bool¶

Replace the existing index if it exists.

num_partitions : int, optional¶

The number of partitions of IVF (Inverted File Index).

ivf_centroids : optional¶

It can be either np.ndarray, pyarrow.FixedSizeListArray or pyarrow.FixedShapeTensorArray. A num_partitions x dimension array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.

pq_codebook : optional,¶

It can be np.ndarray, pyarrow.FixedSizeListArray, or pyarrow.FixedShapeTensorArray. A num_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors) array of K-mean centroids for PQ codebook.

Note: nbits is always 8 for now. If not provided, a new PQ model will be trained.

num_sub_vectors : int, optional¶

The number of sub-vectors for PQ (Product Quantization).

accelerator: str | 'torch.Device' | None = None¶

If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.

index_cache_size : int, optional¶

The size of the index cache in number of entries. Default value is 256.

shuffle_partition_batches : int, optional¶

The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.

Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

shuffle_partition_concurrency : int, optional¶

The number of shuffle partitions to process concurrently. Default value is 2

By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

storage_options : optional, dict¶

Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

filter_nan : bool¶

Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.

one_pass_ivfpq : bool¶

Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.

**kwargs¶

Parameters passed to the index building process.

The SQ (Scalar Quantization) is available for only IVF_HNSW_SQ index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is of num_bits, now only 8 bits are supported.

If index_type is “IVF_*”, then the following parameters are required:: num_partitions
If index_type is with “PQ”, then the following parameters are required:: num_sub_vectors

Optional parameters for IVF_PQ:

ivf_centroids
Existing K-mean centroids for IVF clustering.

num_bits
The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.

index_file_version
The version of the index file. Default is “V3”.

Optional parameters for IVF_HNSW_*:

max_level: Int, the maximum number of levels in the graph.
m: Int, the number of edges per node in the graph.
ef_construction: Int, the number of nodes to examine during the construction.

Examples

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16
)

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_HNSW_SQ",
    num_partitions=256,
)

Experimental Accelerator (GPU) support:

accelerate: use GPU to train IVF partitions.
Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16,
    accelerator="cuda"
)

References