lance.torch package¶

Submodules¶

lance.torch.async_dataset module¶

class lance.torch.async_dataset.AsyncDataset(dataset_creator: Callable[[], IterableDataset], *, queue_size: int = 4)¶

Bases: IterableDataset

close()¶

lance.torch.async_dataset.async_dataset(dataset_creator: Callable[[], IterableDataset], *, queue_size: int = 4) → Iterable[AsyncDataset]¶

lance.torch.bench_utils module¶

Benchmark Utilities built on PyTorch

lance.torch.bench_utils.ground_truth(ds: LanceDataset, column: str, query: Tensor | ndarray, metric_type: str = 'L2', k: int = 100, batch_size: int = 10240, device: str | None = None) → Tensor¶

Find ground truth from dataset.

Parameters:

ds (LanceDataset) – The dataset to test.
column (str) – The name of the vector column.
query (2-D vectors) – A 2-D query vectors, with the shape of [N, dimension].
k (int) – The number of the nearest vectors to collect for each query vector.
metric_type (str) – Metric type. How to compute distance, accepts L2 or cosine.
batch_size (int) – Batch size to read from the input dataset.

Return type:

a 2-D array of row_ids for the nearest vectors from each query vector.

lance.torch.data module¶

Read Lance dataset as torch DataPipe.

Bases: IterableDataset

PyTorch torch.utils.data.IterableDataset over lance dataset.

lance.torch.dist module¶

Pytorch Distributed Utilities

lance.torch.dist.get_dist_rank() → int¶

Get the rank of the current process in the distributed training setup.

Returns:

int: The rank of the current process if distributed training is initialized,: otherwise 0.

lance.torch.dist.get_dist_world_size() → int¶

Get the number of processes in the distributed training setup.

Returns:

int: The number of distributed processes if distributed training is initialized,: otherwise 1.

lance.torch.dist.get_global_rank() → int¶

Get the global rank of the current process across distributed and multiprocessing contexts.

Returns:: int: The global rank of the current process.

lance.torch.dist.get_global_world_size() → int¶

Get the global world size across distributed and multiprocessing contexts.

Returns:: int: The global world size, defaulting to 1 if not set in the environment.

lance.torch.dist.get_mp_rank() → int¶

Get the rank of the current DataLoader worker process.

Returns:

int: The rank of the current DataLoader worker if running in a worker process,: otherwise 0.

lance.torch.dist.get_mp_world_size() → int¶

Get the number of worker processes for the current DataLoader.

Returns:

int: The number of worker processes if running in a DataLoader worker,: otherwise 1.

lance.torch.distance module¶

lance.torch.distance.cosine_distance(vectors: Tensor, centroids: Tensor) → Tuple[Tensor, Tensor]¶

Cosine pair-wise distances between two 2-D Tensors.

Cosine distance = 1 - |xy| / ||x|| * ||y||

Parameters:

vectors (torch.Tensor) – A 2-D [N, D] tensor
centroids (torch.Tensor) – A 2-D [M, D] tensor

Returns:

A tuple of Tensors, for centroids id, and distance to the centroid.
A 2-D [N, M] tensor of cosine distances between x and y

lance.torch.distance.l2_distance(vectors: Tensor, centroids: Tensor, y2: Tensor | None = None) → Tuple[Tensor, Tensor]¶

Pair-wise L2 / Euclidean distance between two 2-D Tensors.

Parameters:

vectors (torch.Tensor) – A 2-D [N, D] tensor
centroids (torch.Tensor) – A 2-D [M, D] tensor

Return type:

A tuple of Tensors, for centroids id, and distance to the centroids.

lance.torch.distance.pairwise_cosine(x: Tensor, y: Tensor, *, y2: Tensor | None = None) → Tensor¶

Compute pair-wise cosine distance between x and y.

Parameters:

x (torch.Tensor) – A 2-D [M, D] tensor, containing M vectors.
y (torch.Tensor) – A 2-D [N, D] tensor, containing N vectors.

Return type:

A [M, N] tensor with pair-wise cosine distances between x and y.

lance.torch.kmeans module¶

class lance.torch.kmeans.KMeans(k: int, *, metric: Literal['l2', 'euclidean', 'cosine', 'dot'] = 'l2', init: Literal['random'] = 'random', max_iters: int = 50, tolerance: float = 0.0001, centroids: Tensor | None = None, seed: int | None = None, device: str | None = None)¶

Bases: object

K-Means trains over vectors and divide into K clusters.

This implement is built on PyTorch, supporting CPU, GPU and Apple Silicon GPU.

Parameters:

k (int) – The number of clusters
metric (str) – Metric type, support “l2”, “cosine” or “dot”
init (str) – Initialization method. Only support “random” now.
max_iters (int) – Max number of iterations to train the kmean model.
tolerance (float) – Relative tolerance in regard to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
centroids (torch.Tensor, optional.) – Provide existing centroids.
seed (int, optional) – Random seed
device (str, optional) – The device to run the PyTorch algorithms. Default we will pick the most performant device on the host. See lance.torch.preferred_device()

fit(data: IterableDataset | ndarray | Tensor | FixedSizeListArray) → None¶

Fit - Train the kmeans model.

Parameters:: data (pa.FixedSizeListArray, np.ndarray, or torch.Tensor) – 2-D vectors to train kmeans.

rebuild_index()¶

transform(data: Array | ndarray | Tensor) → Tensor¶: Transform the input data to cluster ids for each row.

Module contents¶

lance.torch.preferred_device(device: str | None = None)¶

Get the preferred device for computation.

Parameters:: device (str, optional) – Device to use for computation. If None, the device will be detected automatically based on the platform.
Returns:: device – Device to use for computation.
Return type:: torch.device