class lance.LanceDataset(pyarrow._dataset.Dataset)

A Lance Dataset in Lance format where the data is stored at the given uri.

Public members

LanceDataset(uri: str | Path, version: int | str | None = None, ...)

Initialize self. See help(type(self)) for accurate signature.

__reduce__()

Dataset.__reduce_cython__(self)

property uri : str

The location of the data

property tags : Tags
list_indices() list[Index]
index_statistics(index_name: str) dict[str, Any]
property has_index
scanner(...) LanceScanner

Return a Scanner that can support various pushdowns.

property schema : Schema

The pyarrow Schema for this dataset

property lance_schema : LanceSchema

The LanceSchema for this dataset

property data_storage_version : str

The version of the data storage format this dataset is using

property max_field_id : int

The max_field_id in manifest

to_table(...) Table

Read the data into memory as a pyarrow.Table

property partition_expression

Not implemented (just override pyarrow dataset to prevent segfault)

replace_schema(schema: Schema)

Not implemented (just override pyarrow dataset to prevent segfault)

replace_schema_metadata(new_metadata: dict[str, str])

Replace the schema metadata of the dataset

replace_field_metadata(field_name: str, new_metadata)

Replace the metadata of a field in the schema

get_fragments(...) list[LanceFragment]

Get all fragments from the dataset.

get_fragment(fragment_id: int) LanceFragment | None

Get the fragment with fragment id.

to_batches(...) Iterator[RecordBatch]

Read the dataset as materialized record batches.

sample(num_rows: int, ...) Table

Select a random sample of data

take(indices: list[int] | Array, ...) Table

Select rows of data by index.

take_blobs(row_ids: list[int] | Array, ...) list[BlobFile]

Select blobs by row IDs.

head(num_rows, **kwargs)

Load the first N rows of the dataset.

count_rows(filter: str | Expression | None = None, **kwargs) int

Count rows matching the scanner filter.

join(right_dataset, keys, right_keys=None, ...)

Not implemented (just override pyarrow dataset to prevent segfault)

alter_columns(*alterations: Iterable[AlterColumn])

Alter column name, data type, and nullability.

merge(data_obj: ReaderLike, left_on: str, ...)

Merge another dataset into this one.

add_columns(transforms: dict[str, str] | BatchUDF | ReaderLike, ...)

Add new columns with defined values.

drop_columns(columns: list[str])

Drop one or more columns from the dataset

delete(predicate: str | Expression)

Delete rows from the dataset.

insert(data: ReaderLike, *, mode='append', **kwargs)

Insert data into the dataset.

merge_insert(on: str | Iterable[str])

Returns a builder that can be used to create a “merge insert” operation

update(updates: dict[str, str], ...) UpdateResult

Update column values for rows matching where.

versions()

Return all versions in this dataset.

property version : int

Returns the currently checked out version of the dataset

property latest_version : int

Returns the latest version of the dataset.

checkout_version(version: int | str) LanceDataset

Load the given version of the dataset.

restore()

Restore the currently checked out version as the latest version of the dataset.

cleanup_old_versions(...) CleanupStats

Cleans up old versions of the dataset.

create_scalar_index(column: str, index_type, ...)

Create a scalar index on a column.

create_index(column: str | list[str], ...) LanceDataset

Create index on column.

drop_index(name: str)

Drops an index from the dataset

session() _Session

Return the dataset session, which holds the dataset’s state.

static commit(base_uri, ...) LanceDataset

Create a new version of dataset

static commit_batch(dest, ...) BulkCommitResult

Create a new version of dataset with multiple transactions.

validate()

Validate the dataset.

migrate_manifest_paths_v2()

Migrate the manifest paths to the new format.

property optimize : DatasetOptimizer
property stats : LanceStats

Experimental API

static drop(base_uri: str | Path, ...) None
classmethod LanceDataset(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

filter(expression)

Apply a row filter to the dataset.

sort_by(sorting, **kwargs)

Sort the Dataset by one or multiple columns.

join_asof(right_dataset, on, by, tolerance, right_on=None, ...)

Perform an asof join between this dataset and another one.