lance.torch package

Submodules

lance.torch.async_dataset module

class lance.torch.async_dataset.AsyncDataset(dataset_creator: Callable[[], IterableDataset], *, queue_size: int = 4)

Bases: IterableDataset

close()
lance.torch.async_dataset.async_dataset(dataset_creator: Callable[[], IterableDataset], *, queue_size: int = 4) Iterable[AsyncDataset]

lance.torch.bench_utils module

Benchmark Utilities built on PyTorch

lance.torch.bench_utils.ground_truth(ds: LanceDataset, column: str, query: Tensor | ndarray, metric_type: str = 'L2', k: int = 100, batch_size: int = 10240, device: str | None = None) Tensor

Find ground truth from dataset.

Parameters:
  • ds (LanceDataset) – The dataset to test.

  • column (str) – The name of the vector column.

  • query (2-D vectors) – A 2-D query vectors, with the shape of [N, dimension].

  • k (int) – The number of the nearest vectors to collect for each query vector.

  • metric_type (str) – Metric type. How to compute distance, accepts L2 or cosine.

  • batch_size (int) – Batch size to read from the input dataset.

Return type:

a 2-D array of row_ids for the nearest vectors from each query vector.

lance.torch.data module

Read Lance dataset as torch DataPipe.

class lance.torch.data.LanceDataset(dataset: torch.utils.data.Dataset | str | Path, batch_size: int, *args, columns: List[str] | Dict[str, str] | None = None, filter: str | None = None, samples: int | None = 0, cache: str | bool | None = None, with_row_id: bool = False, rank: int | None = None, world_size: int | None = None, shard_granularity: Literal['fragment', 'batch'] | None = None, batch_readahead: int = 16, to_tensor_fn: callable[[pa.RecordBatch], dict[str, torch.Tensor] | torch.Tensor] | None = None, sampler: Sampler | None = None, **kwargs)

Bases: IterableDataset

PyTorch torch.utils.data.IterableDataset over lance dataset.

lance.torch.dist module

Pytorch Distributed Utilities

lance.torch.dist.get_dist_rank() int

Get the rank of the current process in the distributed training setup.

Returns:
int: The rank of the current process if distributed training is initialized,

otherwise 0.

lance.torch.dist.get_dist_world_size() int

Get the number of processes in the distributed training setup.

Returns:
int: The number of distributed processes if distributed training is initialized,

otherwise 1.

lance.torch.dist.get_global_rank() int

Get the global rank of the current process across distributed and multiprocessing contexts.

Returns:

int: The global rank of the current process.

lance.torch.dist.get_global_world_size() int

Get the global world size across distributed and multiprocessing contexts.

Returns:

int: The global world size, defaulting to 1 if not set in the environment.

lance.torch.dist.get_mp_rank() int

Get the rank of the current DataLoader worker process.

Returns:
int: The rank of the current DataLoader worker if running in a worker process,

otherwise 0.

lance.torch.dist.get_mp_world_size() int

Get the number of worker processes for the current DataLoader.

Returns:
int: The number of worker processes if running in a DataLoader worker,

otherwise 1.

lance.torch.distance module

lance.torch.distance.cosine_distance(vectors: Tensor, centroids: Tensor) Tuple[Tensor, Tensor]

Cosine pair-wise distances between two 2-D Tensors.

Cosine distance = 1 - |xy| / ||x|| * ||y||

Parameters:
  • vectors (torch.Tensor) – A 2-D [N, D] tensor

  • centroids (torch.Tensor) – A 2-D [M, D] tensor

Returns:

  • A tuple of Tensors, for centroids id, and distance to the centroid.

  • A 2-D [N, M] tensor of cosine distances between x and y

lance.torch.distance.l2_distance(vectors: Tensor, centroids: Tensor, y2: Tensor | None = None) Tuple[Tensor, Tensor]

Pair-wise L2 / Euclidean distance between two 2-D Tensors.

Parameters:
  • vectors (torch.Tensor) – A 2-D [N, D] tensor

  • centroids (torch.Tensor) – A 2-D [M, D] tensor

Return type:

A tuple of Tensors, for centroids id, and distance to the centroids.

lance.torch.distance.pairwise_cosine(x: Tensor, y: Tensor, *, y2: Tensor | None = None) Tensor

Compute pair-wise cosine distance between x and y.

Parameters:
  • x (torch.Tensor) – A 2-D [M, D] tensor, containing M vectors.

  • y (torch.Tensor) – A 2-D [N, D] tensor, containing N vectors.

Return type:

A [M, N] tensor with pair-wise cosine distances between x and y.

lance.torch.kmeans module

class lance.torch.kmeans.KMeans(k: int, *, metric: Literal['l2', 'euclidean', 'cosine', 'dot'] = 'l2', init: Literal['random'] = 'random', max_iters: int = 50, tolerance: float = 0.0001, centroids: Tensor | None = None, seed: int | None = None, device: str | None = None)

Bases: object

K-Means trains over vectors and divide into K clusters.

This implement is built on PyTorch, supporting CPU, GPU and Apple Silicon GPU.

Parameters:
  • k (int) – The number of clusters

  • metric (str) – Metric type, support “l2”, “cosine” or “dot”

  • init (str) – Initialization method. Only support “random” now.

  • max_iters (int) – Max number of iterations to train the kmean model.

  • tolerance (float) – Relative tolerance in regard to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

  • centroids (torch.Tensor, optional.) – Provide existing centroids.

  • seed (int, optional) – Random seed

  • device (str, optional) – The device to run the PyTorch algorithms. Default we will pick the most performant device on the host. See lance.torch.preferred_device()

fit(data: IterableDataset | ndarray | Tensor | FixedSizeListArray, column: str | None = None) None

Fit - Train the kmeans model.

Parameters:

data (pa.FixedSizeListArray, np.ndarray, or torch.Tensor) – 2-D vectors to train kmeans.

rebuild_index()
transform(data: Array | ndarray | Tensor) Tensor

Transform the input data to cluster ids for each row.

Module contents

lance.torch.preferred_device(device: str | None = None)

Get the preferred device for computation.

Parameters:

device (str, optional) – Device to use for computation. If None, the device will be detected automatically based on the platform.

Returns:

device – Device to use for computation.

Return type:

torch.device