- class lance.LanceDataset(pyarrow._dataset.Dataset)
A Lance Dataset in Lance format where the data is stored at the given uri.
Public members¶
-
LanceDataset(uri: str | Path, version: int | str | None =
None
, ...) Initialize self. See help(type(self)) for accurate signature.
- __reduce__()
Dataset.__reduce_cython__(self)
- property uri : str
The location of the data
- property tags : Tags
- list_indices() list[Index]
- index_statistics(index_name: str) dict[str, Any]
- property has_index
- scanner(...) LanceScanner
Return a Scanner that can support various pushdowns.
- property lance_schema : LanceSchema
The LanceSchema for this dataset
- property data_storage_version : str
The version of the data storage format this dataset is using
- property max_field_id : int
The max_field_id in manifest
- to_table(...) Table
Read the data into memory as a
pyarrow.Table
- property partition_expression
Not implemented (just override pyarrow dataset to prevent segfault)
- replace_schema(schema: Schema)
Not implemented (just override pyarrow dataset to prevent segfault)
- replace_schema_metadata(new_metadata: dict[str, str])
Replace the schema metadata of the dataset
- replace_field_metadata(field_name: str, new_metadata)
Replace the metadata of a field in the schema
- get_fragments(...) list[LanceFragment]
Get all fragments from the dataset.
- get_fragment(fragment_id: int) LanceFragment | None
Get the fragment with fragment id.
- to_batches(...) Iterator[RecordBatch]
Read the dataset as materialized record batches.
- take_blobs(row_ids: list[int] | Array, ...) list[BlobFile]
Select blobs by row IDs.
-
count_rows(filter: str | Expression | None =
None
, **kwargs) int Count rows matching the scanner filter.
-
join(right_dataset, keys, right_keys=
None
, ...) Not implemented (just override pyarrow dataset to prevent segfault)
- alter_columns(*alterations: Iterable[AlterColumn])
Alter column name, data type, and nullability.
- add_columns(transforms: dict[str, str] | BatchUDF | ReaderLike, ...)
Add new columns with defined values.
- drop_columns(columns: list[str])
Drop one or more columns from the dataset
- delete(predicate: str | Expression)
Delete rows from the dataset.
- merge_insert(on: str | Iterable[str])
Returns a builder that can be used to create a “merge insert” operation
- versions()
Return all versions in this dataset.
- property version : int
Returns the currently checked out version of the dataset
- property latest_version : int
Returns the latest version of the dataset.
- checkout_version(version: int | str) LanceDataset
Load the given version of the dataset.
- restore()
Restore the currently checked out version as the latest version of the dataset.
- cleanup_old_versions(...) CleanupStats
Cleans up old versions of the dataset.
- create_scalar_index(column: str, index_type, ...)
Create a scalar index on a column.
- create_index(column: str | list[str], ...) LanceDataset
Create index on column.
- drop_index(name: str)
Drops an index from the dataset
- session() _Session
Return the dataset session, which holds the dataset’s state.
- static commit(base_uri, ...) LanceDataset
Create a new version of dataset
- static commit_batch(dest, ...) BulkCommitResult
Create a new version of dataset with multiple transactions.
- validate()
Validate the dataset.
- migrate_manifest_paths_v2()
Migrate the manifest paths to the new format.
- property optimize : DatasetOptimizer
- property stats : LanceStats
Experimental API
- classmethod LanceDataset(*args, **kwargs)
Create and return a new object. See help(type) for accurate signature.
- filter(expression)
Apply a row filter to the dataset.
-
join_asof(right_dataset, on, by, tolerance, right_on=
None
, ...) Perform an asof join between this dataset and another one.
-
LanceDataset(uri: str | Path, version: int | str | None =