-
lance.LanceDataset.create_index(column: str | list[str], index_type: str, name: str | None =
None
, metric: str ='L2'
, replace: bool =False
, num_partitions: int | None =None
, ivf_centroids: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None =None
, pq_codebook: np.ndarray | pa.FixedSizeListArray | pa.FixedShapeTensorArray | None =None
, num_sub_vectors: int | None =None
, accelerator: str | 'torch.Device' | None =None
, index_cache_size: int | None =None
, shuffle_partition_batches: int | None =None
, shuffle_partition_concurrency: int | None =None
, ivf_centroids_file: str | None =None
, precomputed_partition_dataset: str | None =None
, storage_options: dict[str, str] | None =None
, filter_nan: bool =True
, one_pass_ivfpq: bool =False
, **kwargs) LanceDataset Create index on column.
Experimental API
- Parameters:
- column : str¶
The column to be indexed.
- index_type : str¶
The type of the index.
"IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ"
are supported now.- name : str, optional¶
The index name. If not provided, it will be generated from the column name.
- metric : str¶
The distance metric type, i.e., “L2” (alias to “euclidean”), “cosine” or “dot” (dot product). Default is “L2”.
- replace : bool¶
Replace the existing index if it exists.
- num_partitions : int, optional¶
The number of partitions of IVF (Inverted File Index).
- ivf_centroids : optional¶
It can be either
np.ndarray
,pyarrow.FixedSizeListArray
orpyarrow.FixedShapeTensorArray
. Anum_partitions x dimension
array of existing K-mean centroids for IVF clustering. If not provided, a new KMeans model will be trained.- pq_codebook : optional,¶
It can be
np.ndarray
,pyarrow.FixedSizeListArray
, orpyarrow.FixedShapeTensorArray
. Anum_sub_vectors x (2 ^ nbits * dimensions // num_sub_vectors)
array of K-mean centroids for PQ codebook.Note:
nbits
is always 8 for now. If not provided, a new PQ model will be trained.- num_sub_vectors : int, optional¶
The number of sub-vectors for PQ (Product Quantization).
- accelerator: str | 'torch.Device' | None =
None
¶ If set, use an accelerator to speed up the training process. Accepted accelerator: “cuda” (Nvidia GPU) and “mps” (Apple Silicon GPU). If not set, use the CPU.
- index_cache_size : int, optional¶
The size of the index cache in number of entries. Default value is 256.
- shuffle_partition_batches : int, optional¶
The number of batches, using the row group size of the dataset, to include in each shuffle partition. Default value is 10240.
Assuming the row group size is 1024, each shuffle partition will hold 10240 * 1024 = 10,485,760 rows. By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.
- shuffle_partition_concurrency : int, optional¶
The number of shuffle partitions to process concurrently. Default value is 2
By making this value smaller, this shuffle will consume less memory but will take longer to complete, and vice versa.
- storage_options : optional, dict¶
Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.
- filter_nan : bool¶
Defaults to True. False is UNSAFE, and will cause a crash if any null/nan values are present (and otherwise will not). Disables the null filter used for nullable columns. Obtains a small speed boost.
- one_pass_ivfpq : bool¶
Defaults to False. If enabled, index type must be “IVF_PQ”. Reduces disk IO.
- **kwargs¶
Parameters passed to the index building process.
The SQ (Scalar Quantization) is available for only
IVF_HNSW_SQ
index type, this quantization method is used to reduce the memory usage of the index, it maps the float vectors to integer vectors, each integer is ofnum_bits
, now only 8 bits are supported.- If
index_type
is “IVF_*”, then the following parameters are required: num_partitions
- If
index_type
is with “PQ”, then the following parameters are required: num_sub_vectors
Optional parameters for IVF_PQ:
- ivf_centroids
Existing K-mean centroids for IVF clustering.
- num_bits
The number of bits for PQ (Product Quantization). Default is 8. Only 4, 8 are supported.
- Optional parameters for IVF_HNSW_*:
- max_level
Int, the maximum number of levels in the graph.
- m
Int, the number of edges per node in the graph.
- ef_construction
Int, the number of nodes to examine during the construction.
Examples
import lance dataset = lance.dataset("/tmp/sift.lance") dataset.create_index( "vector", "IVF_PQ", num_partitions=256, num_sub_vectors=16 )
import lance dataset = lance.dataset("/tmp/sift.lance") dataset.create_index( "vector", "IVF_HNSW_SQ", num_partitions=256, )
Experimental Accelerator (GPU) support:
- accelerate: use GPU to train IVF partitions.
Only supports CUDA (Nvidia) or MPS (Apple) currently. Requires PyTorch being installed.
import lance dataset = lance.dataset("/tmp/sift.lance") dataset.create_index( "vector", "IVF_PQ", num_partitions=256, num_sub_vectors=16, accelerator="cuda" )
References