lance.LanceDataset.create_index(column: str | list[str], index_type: str, name: str | None = None, metric: str = 'L2', replace: bool = False, num_partitions: int | None = None, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None = None, num_sub_vectors: int | None = None, accelerator: str | 'torch.Device' | None = None, index_cache_size: int | None = None, shuffle_partition_batches: int | None = None, shuffle_partition_concurrency: int | None = None, ivf_centroids_file: str | None = None, precomputed_partition_dataset: str | None = None, storage_options: dict[str, str] | None = None, filter_nan: bool = True, one_pass_ivfpq: bool = False, **kwargs) LanceDataset

Create index on column.

Experimental API

Parameters:
column : str

The column to be indexed.

index_type : str

The type of the index. "IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ" are supported now.

name : str, optional

The index name. If not provided, it will be generated from the column name.

metric : str

The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.

replace : bool

Replace the existing index if it exists.

num_partitions : int, optional

The number of partitions of IVF (Inverted File Index).

ivf_centroids : optional

It can be either np.ndarray, pyarrow.FixedSizeListArray or pyarrow.FixedShapeTensorArray. A num_partitions x dimension array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.

pq_codebook : optional,

It can be np.ndarray, pyarrow.FixedSizeListArray, or pyarrow.FixedShapeTensorArray. A num_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors) array of K-mean centroids for PQ codebook.

Note: nbits is always 8 for now. If not provided, a new PQ model will be trained.

num_sub_vectors : int, optional

The number of sub-vectors for PQ (Product Quantization).

accelerator: str | 'torch.Device' | None = None

If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.

index_cache_size : int, optional

The size of the index cache in number of entries. Default value is 256.

shuffle_partition_batches : int, optional

The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.

Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

shuffle_partition_concurrency : int, optional

The number of shuffle partitions to process concurrently. Default value is 2

By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.

storage_options : optional, dict

Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.

filter_nan : bool

Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.

one_pass_ivfpq : bool

Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.

**kwargs

Parameters passed to the index building process.

The SQ (Scalar Quantization) is available for only IVF_HNSW_SQ index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is of num_bits, now only 8 bits are supported.

If index_type is “IVF_*”, then the following parameters are required:

num_partitions

If index_type is with “PQ”, then the following parameters are required:

num_sub_vectors

Optional parameters for IVF_PQ:

  • ivf_centroids

    Existing K-mean centroids for IVF clustering.

  • num_bits

    The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.

Optional parameters for IVF_HNSW_*:
max_level

Int, the maximum number of levels in the graph.

m

Int, the number of edges per node in the graph.

ef_construction

Int, the number of nodes to examine during the construction.

Examples

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16
)
import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_HNSW_SQ",
    num_partitions=256,
)

Experimental Accelerator (GPU) support:

  • accelerate: use GPU to train IVF partitions.

    Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.

import lance

dataset = lance.dataset("/tmp/sift.lance")
dataset.create_index(
    "vector",
    "IVF_PQ",
    num_partitions=256,
    num_sub_vectors=16,
    accelerator="cuda"
)

References