lance.LanceDataset - Lance documentation

Lance documentation

lance.LanceDataset

class lance.LanceDataset(pyarrow._dataset.Dataset)

A Lance Dataset in Lance format where the data is stored at the given uri.

Public members¶

LanceDataset(uri: str | Path, version: int | str | None = None, ...): Initialize self. See help(type(self)) for accurate signature.

__reduce__(): Dataset.__reduce_cython__(self)

property uri : str: The location of the data

property tags : Tags: Tag management for the dataset.

list_indices() → list[Index]

index_statistics(index_name: str) → dict[str, Any]

property has_index

scanner(...) → LanceScanner: Return a Scanner that can support various pushdowns.

property schema : Schema: The pyarrow Schema for this dataset

property lance_schema : LanceSchema: The LanceSchema for this dataset

property data_storage_version : str: The version of the data storage format this dataset is using

property max_field_id : int: The max_field_id in manifest

to_table(...) → Table: Read the data into memory as a pyarrow.Table

property partition_expression: Not implemented (just override pyarrow dataset to prevent segfault)

replace_schema(schema: Schema): Not implemented (just override pyarrow dataset to prevent segfault)

replace_schema_metadata(new_metadata: dict[str, str]): Replace the schema metadata of the dataset

replace_field_metadata(field_name: str, new_metadata): Replace the metadata of a field in the schema

get_fragments(...) → list[LanceFragment]: Get all fragments from the dataset.

get_fragment(fragment_id: int) → LanceFragment | None: Get the fragment with fragment id.

to_batches(...) → Iterator[RecordBatch]: Read the dataset as materialized record batches.

sample(num_rows: int, ...) → Table: Select a random sample of data

take(indices: list[int] | Array, ...) → Table: Select rows of data by index.

take_blobs(blob_column: str, ...) → list[BlobFile]: Select blobs by row IDs.

head(num_rows, **kwargs): Load the first N rows of the dataset.

count_rows(filter: str | Expression | None = None, **kwargs) → int: Count rows matching the scanner filter.

join(right_dataset, keys, right_keys=None, ...): Not implemented (just override pyarrow dataset to prevent segfault)

alter_columns(*alterations: Iterable[AlterColumn]): Alter column name, data type, and nullability.

merge(data_obj: ReaderLike, left_on: str, ...): Merge another dataset into this one.

add_columns(transforms, ...): Add new columns with defined values.

drop_columns(columns: list[str]): Drop one or more columns from the dataset

delete(predicate: str | Expression): Delete rows from the dataset.

insert(data: ReaderLike, *, mode='append', **kwargs): Insert data into the dataset.

merge_insert(on: str | Iterable[str]) → MergeInsertBuilder: Returns a builder that can be used to create a “merge insert” operation

update(updates: dict[str, str], ...) → UpdateResult: Update column values for rows matching where.

versions(): Return all versions in this dataset.

property version : int: Returns the currently checked out version of the dataset

property latest_version : int: Returns the latest version of the dataset.

checkout_version(version: int | str) → LanceDataset: Load the given version of the dataset.

restore(): Restore the currently checked out version as the latest version of the dataset.

cleanup_old_versions(...) → CleanupStats: Cleans up old versions of the dataset.

create_scalar_index(column: str, index_type, ...): Create a scalar index on a column.

create_index(column: str | list[str], ...) → LanceDataset: Create index on column.

drop_index(name: str): Drops an index from the dataset

prewarm_index(name: str): Prewarm an index

session() → _Session: Return the dataset session, which holds the dataset’s state.

static commit(base_uri, ...) → LanceDataset: Create a new version of dataset

static commit_batch(dest, ...) → BulkCommitResult: Create a new version of dataset with multiple transactions.

validate(): Validate the dataset.

migrate_manifest_paths_v2(): Migrate the manifest paths to the new format.

update_config(upsert_values: dict[str, str]) → None: Update the dataset configuration.

delete_config_keys(keys: list[str]) → None: Delete specified configuration keys from the dataset.

property optimize : DatasetOptimizer

property stats : LanceStats: Experimental API

static drop(base_uri: str | Path, ...) → None

classmethod LanceDataset(*args, **kwargs): Create and return a new object. See help(type) for accurate signature.

filter(expression): Apply a row filter to the dataset.

sort_by(sorting, **kwargs): Sort the Dataset by one or multiple columns.

join_asof(right_dataset, on, by, tolerance, right_on=None, ...): Perform an asof join between this dataset and another one.