lance.torch package¶
Submodules¶
lance.torch.async_dataset module¶
- class lance.torch.async_dataset.AsyncDataset(dataset_creator: Callable[[], IterableDataset], *, queue_size: int = 4)¶
Bases:
IterableDataset
- close()¶
- lance.torch.async_dataset.async_dataset(dataset_creator: Callable[[], IterableDataset], *, queue_size: int = 4) Iterable[AsyncDataset] ¶
lance.torch.bench_utils module¶
Benchmark Utilities built on PyTorch
- lance.torch.bench_utils.ground_truth(ds: LanceDataset, column: str, query: Tensor | ndarray, metric_type: str = 'L2', k: int = 100, batch_size: int = 10240, device: str | None = None) Tensor ¶
Find ground truth from dataset.
- Parameters:
ds (LanceDataset) – The dataset to test.
column (str) – The name of the vector column.
query (2-D vectors) – A 2-D query vectors, with the shape of [N, dimension].
k (int) – The number of the nearest vectors to collect for each query vector.
metric_type (str) – Metric type. How to compute distance, accepts L2 or cosine.
batch_size (int) – Batch size to read from the input dataset.
- Return type:
a 2-D array of row_ids for the nearest vectors from each query vector.
lance.torch.data module¶
Read Lance dataset as torch DataPipe.
- class lance.torch.data.LanceDataset(dataset: Dataset | str | Path, batch_size: int, *args, columns: List[str] | Dict[str, str] | None = None, filter: str | None = None, samples: int | None = 0, cache: str | bool | None = None, with_row_id: bool = False, rank: int | None = None, world_size: int | None = None, shard_granularity: Literal['fragment', 'batch'] | None = None, batch_readahead: int = 16, to_tensor_fn: Callable[[RecordBatch], dict[str, torch.Tensor] | Tensor] | None = None, sampler: Sampler | None = None, **kwargs)¶
Bases:
IterableDataset
PyTorch
torch.utils.data.IterableDataset
over lance dataset.- property schema: Schema¶
lance.torch.dist module¶
Pytorch Distributed Utilities
- lance.torch.dist.get_dist_rank() int ¶
Get the rank of the current process in the distributed training setup.
- Returns:
- int: The rank of the current process if distributed training is initialized,
otherwise 0.
- lance.torch.dist.get_dist_world_size() int ¶
Get the number of processes in the distributed training setup.
- Returns:
- int: The number of distributed processes if distributed training is initialized,
otherwise 1.
- lance.torch.dist.get_global_rank() int ¶
Get the global rank of the current process across distributed and multiprocessing contexts.
- Returns:
int: The global rank of the current process.
- lance.torch.dist.get_global_world_size() int ¶
Get the global world size across distributed and multiprocessing contexts.
- Returns:
int: The global world size, defaulting to 1 if not set in the environment.
- lance.torch.dist.get_mp_rank() int ¶
Get the rank of the current DataLoader worker process.
- Returns:
- int: The rank of the current DataLoader worker if running in a worker process,
otherwise 0.
- lance.torch.dist.get_mp_world_size() int ¶
Get the number of worker processes for the current DataLoader.
- Returns:
- int: The number of worker processes if running in a DataLoader worker,
otherwise 1.
lance.torch.distance module¶
- lance.torch.distance.cosine_distance(vectors: Tensor, centroids: Tensor) Tuple[Tensor, Tensor] ¶
Cosine pair-wise distances between two 2-D Tensors.
Cosine distance =
1 - |xy| / ||x|| * ||y||
- Parameters:
vectors (torch.Tensor) – A 2-D [N, D] tensor
centroids (torch.Tensor) – A 2-D [M, D] tensor
- Returns:
A tuple of Tensors, for centroids id, and distance to the centroid.
A 2-D [N, M] tensor of cosine distances between x and y
- lance.torch.distance.l2_distance(vectors: Tensor, centroids: Tensor, y2: Tensor | None = None) Tuple[Tensor, Tensor] ¶
Pair-wise L2 / Euclidean distance between two 2-D Tensors.
- Parameters:
vectors (torch.Tensor) – A 2-D [N, D] tensor
centroids (torch.Tensor) – A 2-D [M, D] tensor
- Return type:
A tuple of Tensors, for centroids id, and distance to the centroids.
- lance.torch.distance.pairwise_cosine(x: Tensor, y: Tensor, *, y2: Tensor | None = None) Tensor ¶
Compute pair-wise cosine distance between x and y.
- Parameters:
x (torch.Tensor) – A 2-D
[M, D]
tensor, containing M vectors.y (torch.Tensor) – A 2-D
[N, D]
tensor, containing N vectors.
- Return type:
A
[M, N]
tensor with pair-wise cosine distances between x and y.
lance.torch.kmeans module¶
- class lance.torch.kmeans.KMeans(k: int, *, metric: Literal['l2', 'euclidean', 'dot', 'cosine'] = 'l2', init: Literal['random'] = 'random', max_iters: int = 50, tolerance: float = 0.0001, centroids: Tensor | None = None, seed: int | None = None, device: str | None = None)¶
Bases:
object
K-Means trains over vectors and divide into K clusters.
This implement is built on PyTorch, supporting CPU, GPU and Apple Silicon GPU.
- Parameters:
k (int) – The number of clusters
metric (str) – Metric type, support “l2”, “cosine” or “dot”
init (str) – Initialization method. Only support “random” now.
max_iters (int) – Max number of iterations to train the kmean model.
tolerance (float) – Relative tolerance in regard to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
centroids (torch.Tensor, optional.) – Provide existing centroids.
seed (int, optional) – Random seed
device (str, optional) – The device to run the PyTorch algorithms. Default we will pick the most performant device on the host. See lance.torch.preferred_device()
- fit(data: IterableDataset | ndarray | Tensor | FixedSizeListArray, column: str | None = None) None ¶
Fit - Train the kmeans model.
- Parameters:
data (pa.FixedSizeListArray, np.ndarray, or torch.Tensor) – 2-D vectors to train kmeans.
- rebuild_index()¶
- transform(data: Array | ndarray | Tensor) Tensor ¶
Transform the input data to cluster ids for each row.
Module contents¶
- lance.torch.preferred_device(device: str | None = None)¶
Get the preferred device for computation.
- Parameters:
device (str, optional) – Device to use for computation. If None, the device will be detected automatically based on the platform.
- Returns:
device – Device to use for computation.
- Return type:
torch.device