lance.tf package¶

Submodules¶

lance.tf.data module¶

Tensorflow Dataset (tf.data) implementation for Lance.

Warning

Experimental feature. API stability is not guaranteed.

lance.tf.data.arrow_data_type_to_tf(dt: DataType) → DType¶: Convert Pyarrow DataType to Tensorflow.

lance.tf.data.column_to_tensor(array: Array, tensor_spec: TensorSpec) → Tensor¶: Convert a PyArrow array into a TensorFlow tensor.

lance.tf.data.data_type_to_tensor_spec(dt: DataType) → TensorSpec¶: Convert PyArrow DataType to Tensorflow TensorSpec.

Create a tf.data.Dataset from a Lance dataset.

Parameters:

dataset (Union[str, Path, LanceDataset]) – Lance dataset or dataset URI/path.
columns (Optional[List[str]], optional) – List of columns to include in the output dataset. If not set, all columns will be read.
batch_size (int, optional) – Batch size, by default 256
filter (Optional[str], optional) – SQL filter expression, by default None.
fragments (Union[List[LanceFragment], tf.data.Dataset], optional) – If provided, only the fragments are read. It can be used to feed for distributed training.
output_signature (Optional[tf.TypeSpec], optional) – Override output signature of the returned tensors. If not provided, the output signature is inferred from the projection Schema.

Examples

import tensorflow as tf
import lance.tf.data

ds = lance.tf.data.from_lance(
    "s3://bucket/path",
    columns=["image", "id"],
    filter="catalog = 'train' AND split = 'train'",
    batch_size=100)

for batch in ds.repeat(10).shuffle(128).map(io_decode):
    print(batch["image"].shape)

from_lance can take an iterator or tf.data.Dataset of Fragments. So that it can be used to feed for distributed training.

import tensorflow as tf
import lance.tf.data

seed = 200  # seed to shuffle the fragments in distributed machines.
fragments = lance.tf.data.lance_fragments("s3://bucket/path")
    repeat(10).shuffle(4, seed=seed)
ds = lance.tf.data.from_lance(
    "s3://bucket/path",
    columns=["image", "id"],
    filter="catalog = 'train' AND split = 'train'",
    fragments=fragments,
    batch_size=100)
for batch in ds.shuffle(128).map(io_decode):
    print(batch["image"].shape)

lance.tf.data.from_lance_batches(dataset: str | Path | LanceDataset, *, shuffle: bool = False, seed: int | None = None, batch_size: int = 1024, skip: int = 0) → tf.data.Dataset¶

Create a tf.data.Dataset of batch indices for a Lance dataset.

Parameters:

dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.
shuffle (bool, optional) – Shuffle the batches, by default False
seed (Optional[int], optional) – Random seed for shuffling, by default None
batch_size (int, optional) – Batch size, by default 1024
skip (int, optional) – Number of batches to skip.

Returns:

A tensorflow dataset of batch slice ranges. These can be passed to lance_take_batches() to create a Tensorflow dataset of batches.

Return type:

tf.data.Dataset

lance.tf.data.lance_fragments(dataset: str | Path | LanceDataset) → tf.data.Dataset¶

Create a tf.data.Dataset of Lance Fragments in the dataset.

Parameters:: dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.

lance.tf.data.lance_take_batches(dataset: str | Path | LanceDataset, batch_ranges: Iterable[Tuple[int, int]], *, columns: List[str] | None = None, output_signature: Dict[str, tf.TypeSpec] | None = None, batch_readahead: int = 10) → tf.data.Dataset¶

Create a tf.data.Dataset of batches from a Lance dataset.

Parameters:

dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.
batch_ranges (Iterable[Tuple[int, int]]) – Iterable of batch indices.
columns (Optional[List[str]], optional) – List of columns to include in the output dataset. If not set, all columns will be read.
output_signature (Optional[tf.TypeSpec], optional) – Override output signature of the returned tensors. If not provided, the output signature is inferred from the projection Schema.
batch_readahead (int, default 10) – The number of batches to read ahead in parallel.

Examples

You can compose this with from_lance_batches to create a randomized Tensorflow dataset. With from_lance_batches, you can deterministically randomized the batches by setting seed.

batch_iter = from_lance_batches(dataset, batch_size=100, shuffle=True, seed=200)
batch_iter = batch_iter.as_numpy_iterator()
lance_ds = lance_take_batches(dataset, batch_iter)
lance_ds = lance_ds.unbatch().shuffle(500, seed=42).batch(100)

lance.tf.data.schema_to_spec(schema: Schema) → TypeSpec¶: Convert PyArrow Schema to Tensorflow output signature.

lance.tf.tfrecord module¶

lance.tf.tfrecord.infer_tfrecord_schema(uri, *, tensor_features=None, string_features=None, num_rows=None)¶

Infer schema from tfrecord file

Parameters:

uri (str) – URI of the tfrecord file
tensor_features (Optional[List[str]]) – Names of features that should be treated as tensors. Currently only fixed-shape tensors are supported.
string_features (Optional[List[str]]) – Names of features that should be treated as strings. Otherwise they will be treated as binary.
batch_size (Optional[int], default None) – Number of records to read to infer the schema. If None, will read the entire file.

Returns:

An Arrow schema inferred from the tfrecord file. The schema is alphabetically sorted by field names, since TFRecord doesn’t have a concept of field order.

Return type:

pyarrow.Schema

lance.tf.tfrecord.read_tfrecord(uri, schema, *, batch_size=10000)¶

Read tfrecord file as an Arrow stream

Parameters:

uri (str) – URI of the tfrecord file
schema (pyarrow.Schema) – Arrow schema of the tfrecord file. Use infer_tfrecord_schema() to infer the schema. The schema is allowed to be a subset of fields; the reader will only parse the fields that are present in the schema.
batch_size (int, default 10k) – Number of records to read per batch.

Returns:

An Arrow reader, which can be passed directly to lance.write_dataset(). The output schema will match the schema provided, including field order.

Return type:

pyarrow.RecordBatchReader

lance.tf package¶

Submodules¶

lance.tf.data module¶

lance.tf.tfrecord module¶

Module contents¶