lance.tf package¶
Submodules¶
lance.tf.data module¶
Tensorflow Dataset (tf.data) implementation for Lance.
Warning
Experimental feature. API stability is not guaranteed.
- lance.tf.data.arrow_data_type_to_tf(dt: DataType) DType ¶
Convert Pyarrow DataType to Tensorflow.
- lance.tf.data.column_to_tensor(array: Array, tensor_spec: TensorSpec) Tensor ¶
Convert a PyArrow array into a TensorFlow tensor.
- lance.tf.data.data_type_to_tensor_spec(dt: DataType) TensorSpec ¶
Convert PyArrow DataType to Tensorflow TensorSpec.
- lance.tf.data.from_lance(dataset: str | Path | LanceDataset, *, columns: List[str] | Dict[str, str] | None = None, batch_size: int = 256, filter: str | None = None, fragments: Iterable[int] | Iterable[LanceFragment] | tf.data.Dataset = None, output_signature: Dict[str, tf.TypeSpec] | None = None) tf.data.Dataset ¶
Create a
tf.data.Dataset
from a Lance dataset.- Parameters:
dataset (Union[str, Path, LanceDataset]) – Lance dataset or dataset URI/path.
columns (Optional[List[str]], optional) – List of columns to include in the output dataset. If not set, all columns will be read.
batch_size (int, optional) – Batch size, by default 256
filter (Optional[str], optional) – SQL filter expression, by default None.
fragments (Union[List[LanceFragment], tf.data.Dataset], optional) – If provided, only the fragments are read. It can be used to feed for distributed training.
output_signature (Optional[tf.TypeSpec], optional) – Override output signature of the returned tensors. If not provided, the output signature is inferred from the projection Schema.
Examples
import tensorflow as tf import lance.tf.data ds = lance.tf.data.from_lance( "s3://bucket/path", columns=["image", "id"], filter="catalog = 'train' AND split = 'train'", batch_size=100) for batch in ds.repeat(10).shuffle(128).map(io_decode): print(batch["image"].shape)
from_lance
can take an iterator ortf.data.Dataset
of Fragments. So that it can be used to feed for distributed training.import tensorflow as tf import lance.tf.data seed = 200 # seed to shuffle the fragments in distributed machines. fragments = lance.tf.data.lance_fragments("s3://bucket/path") repeat(10).shuffle(4, seed=seed) ds = lance.tf.data.from_lance( "s3://bucket/path", columns=["image", "id"], filter="catalog = 'train' AND split = 'train'", fragments=fragments, batch_size=100) for batch in ds.shuffle(128).map(io_decode): print(batch["image"].shape)
- lance.tf.data.from_lance_batches(dataset: str | Path | LanceDataset, *, shuffle: bool = False, seed: int | None = None, batch_size: int = 1024, skip: int = 0) tf.data.Dataset ¶
Create a
tf.data.Dataset
of batch indices for a Lance dataset.- Parameters:
dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.
shuffle (bool, optional) – Shuffle the batches, by default False
seed (Optional[int], optional) – Random seed for shuffling, by default None
batch_size (int, optional) – Batch size, by default 1024
skip (int, optional) – Number of batches to skip.
- Returns:
A tensorflow dataset of batch slice ranges. These can be passed to
lance_take_batches()
to create a Tensorflow dataset of batches.- Return type:
tf.data.Dataset
- lance.tf.data.lance_fragments(dataset: str | Path | LanceDataset) tf.data.Dataset ¶
Create a
tf.data.Dataset
of Lance Fragments in the dataset.- Parameters:
dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.
- lance.tf.data.lance_take_batches(dataset: str | Path | LanceDataset, batch_ranges: Iterable[Tuple[int, int]], *, columns: List[str] | Dict[str, str] | None = None, output_signature: Dict[str, tf.TypeSpec] | None = None, batch_readahead: int = 10) tf.data.Dataset ¶
Create a
tf.data.Dataset
of batches from a Lance dataset.- Parameters:
dataset (Union[str, Path, LanceDataset]) – A Lance Dataset or dataset URI/path.
batch_ranges (Iterable[Tuple[int, int]]) – Iterable of batch indices.
columns (Optional[List[str]], optional) – List of columns to include in the output dataset. If not set, all columns will be read.
output_signature (Optional[tf.TypeSpec], optional) – Override output signature of the returned tensors. If not provided, the output signature is inferred from the projection Schema.
batch_readahead (int, default 10) – The number of batches to read ahead in parallel.
Examples
You can compose this with
from_lance_batches
to create a randomized Tensorflow dataset. Withfrom_lance_batches
, you can deterministically randomized the batches by settingseed
.batch_iter = from_lance_batches(dataset, batch_size=100, shuffle=True, seed=200) batch_iter = batch_iter.as_numpy_iterator() lance_ds = lance_take_batches(dataset, batch_iter) lance_ds = lance_ds.unbatch().shuffle(500, seed=42).batch(100)
- lance.tf.data.schema_to_spec(schema: Schema) TypeSpec ¶
Convert PyArrow Schema to Tensorflow output signature.
lance.tf.tfrecord module¶
- lance.tf.tfrecord.infer_tfrecord_schema(uri, *, tensor_features=None, string_features=None, num_rows=None)¶
Infer schema from tfrecord file
- Parameters:
uri (str) – URI of the tfrecord file
tensor_features (Optional[List[str]]) – Names of features that should be treated as tensors. Currently only fixed-shape tensors are supported.
string_features (Optional[List[str]]) – Names of features that should be treated as strings. Otherwise they will be treated as binary.
batch_size (Optional[int], default None) – Number of records to read to infer the schema. If None, will read the entire file.
- Returns:
An Arrow schema inferred from the tfrecord file. The schema is alphabetically sorted by field names, since TFRecord doesn’t have a concept of field order.
- Return type:
pyarrow.Schema
- lance.tf.tfrecord.read_tfrecord(uri, schema, *, batch_size=10000)¶
Read tfrecord file as an Arrow stream
- Parameters:
uri (str) – URI of the tfrecord file
schema (pyarrow.Schema) – Arrow schema of the tfrecord file. Use
infer_tfrecord_schema()
to infer the schema. The schema is allowed to be a subset of fields; the reader will only parse the fields that are present in the schema.batch_size (int, default 10k) – Number of records to read per batch.
- Returns:
An Arrow reader, which can be passed directly to
lance.write_dataset()
. The output schema will match the schema provided, including field order.- Return type:
pyarrow.RecordBatchReader