lance.tf package

Submodules

lance.tf.data module

Tensorflow Dataset (tf.data) implementation for Lance.

Warning

Experimental feature. API stability is not guaranteed.

lance.tf.data.arrow_data_type_to_tf(dt: DataType) DType

Convert Pyarrow DataType to Tensorflow.

lance.tf.data.column_to_tensor(array: Array, tensor_spec: TensorSpec) Tensor

Convert a PyArrow array into a TensorFlow tensor.

lance.tf.data.data_type_to_tensor_spec(dt: DataType) TensorSpec

Convert PyArrow DataType to Tensorflow TensorSpec.

lance.tf.data.from_lance(dataset: str | Path | LanceDataset, *, columns: List[str] | Dict[str, str] | None = None, batch_size: int = 256, filter: str | None = None, fragments: Iterable[int] | Iterable[LanceFragment] | tf.data.Dataset = None, output_signature: Dict[str, tf.TypeSpec] | None = None) tf.data.Dataset

Create a tf.data.Dataset from a Lance dataset.

Parameters:
  • dataset (Union[str, Path, LanceDataset]) – Lance dataset or dataset URI/path.

  • columns (Optional[List[str]], optional) – List of columns to include in the output dataset. If not set, all columns will be read.

  • batch_size (int, optional) – Batch size, by default 256

  • filter (Optional[str], optional) – SQL filter expression, by default None.

  • fragments (Union[List[LanceFragment], tf.data.Dataset], optional) – If provided, only the fragments are read. It can be used to feed for distributed training.

  • output_signature (Optional[tf.TypeSpec], optional) – Override output signature of the returned tensors. If not provided, the output signature is inferred from the projection Schema.

Examples

import tensorflow as tf
import lance.tf.data

ds = lance.tf.data.from_lance(
    "s3://bucket/path",
    columns=["image", "id"],
    filter="catalog = 'train' AND split = 'train'",
    batch_size=100)

for batch in ds.repeat(10).shuffle(128).map(io_decode):
    print(batch["image"].shape)

from_lance can take an iterator or tf.data.Dataset of Fragments. So that it can be used to feed for distributed training.

import tensorflow as tf
import lance.tf.data

seed = 200  # seed to shuffle the fragments in distributed machines.
fragments = lance.tf.data.lance_fragments("s3://bucket/path")
    repeat(10).shuffle(4, seed=seed)
ds = lance.tf.data.from_lance(
    "s3://bucket/path",
    columns=["image", "id"],
    filter="catalog = 'train' AND split = 'train'",
    fragments=fragments,
    batch_size=100)
for batch in ds.shuffle(128).map(io_decode):
    print(batch["image"].shape)
lance.tf.data.from_lance_batches(dataset: str | Path | LanceDataset, *, shuffle: bool = False, seed: int | None = None, batch_size: int = 1024, skip: int = 0) tf.data.Dataset

Create a tf.data.Dataset of batch indices for a Lance dataset.

Parameters:
  • dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.

  • shuffle (bool, optional) – Shuffle the batches, by default False

  • seed (Optional[int], optional) – Random seed for shuffling, by default None

  • batch_size (int, optional) – Batch size, by default 1024

  • skip (int, optional) – Number of batches to skip.

Returns:

A tensorflow dataset of batch slice ranges. These can be passed to lance_take_batches() to create a Tensorflow dataset of batches.

Return type:

tf.data.Dataset

lance.tf.data.lance_fragments(dataset: str | Path | LanceDataset) tf.data.Dataset

Create a tf.data.Dataset of Lance Fragments in the dataset.

Parameters:

dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.

lance.tf.data.lance_take_batches(dataset: str | Path | LanceDataset, batch_ranges: Iterable[Tuple[int, int]], *, columns: List[str] | Dict[str, str] | None = None, output_signature: Dict[str, tf.TypeSpec] | None = None, batch_readahead: int = 10) tf.data.Dataset

Create a tf.data.Dataset of batches from a Lance dataset.

Parameters:
  • dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.

  • batch_ranges (Iterable[Tuple[int, int]]) – Iterable of batch indices.

  • columns (Optional[List[str]], optional) – List of columns to include in the output dataset. If not set, all columns will be read.

  • output_signature (Optional[tf.TypeSpec], optional) – Override output signature of the returned tensors. If not provided, the output signature is inferred from the projection Schema.

  • batch_readahead (int, default 10) – The number of batches to read ahead in parallel.

Examples

You can compose this with from_lance_batches to create a randomized Tensorflow dataset. With from_lance_batches, you can deterministically randomized the batches by setting seed.

batch_iter = from_lance_batches(dataset, batch_size=100, shuffle=True, seed=200)
batch_iter = batch_iter.as_numpy_iterator()
lance_ds = lance_take_batches(dataset, batch_iter)
lance_ds = lance_ds.unbatch().shuffle(500, seed=42).batch(100)
lance.tf.data.schema_to_spec(schema: Schema) TypeSpec

Convert PyArrow Schema to Tensorflow output signature.

lance.tf.tfrecord module

lance.tf.tfrecord.infer_tfrecord_schema(uri, *, tensor_features=None, string_features=None, num_rows=None)

Infer schema from tfrecord file

Parameters:
  • uri (str) – URI of the tfrecord file

  • tensor_features (Optional[List[str]]) – Names of features that should be treated as tensors. Currently only fixed-shape tensors are supported.

  • string_features (Optional[List[str]]) – Names of features that should be treated as strings. Otherwise they will be treated as binary.

  • batch_size (Optional[int], default None) – Number of records to read to infer the schema. If None, will read the entire file.

Returns:

An Arrow schema inferred from the tfrecord file. The schema is alphabetically sorted by field names, since TFRecord doesn’t have a concept of field order.

Return type:

pyarrow.Schema

lance.tf.tfrecord.read_tfrecord(uri, schema, *, batch_size=10000)

Read tfrecord file as an Arrow stream

Parameters:
  • uri (str) – URI of the tfrecord file

  • schema (pyarrow.Schema) – Arrow schema of the tfrecord file. Use infer_tfrecord_schema() to infer the schema. The schema is allowed to be a subset of fields; the reader will only parse the fields that are present in the schema.

  • batch_size (int, default 10k) – Number of records to read per batch.

Returns:

An Arrow reader, which can be passed directly to lance.write_dataset(). The output schema will match the schema provided, including field order.

Return type:

pyarrow.RecordBatchReader

Module contents