-
static lance.LanceDataset.commit(base_uri: str | Path | LanceDataset, operation: LanceOperation.BaseOperation | Transaction, blobs_op: LanceOperation.BaseOperation | None =
None
, read_version: int | None =None
, commit_lock: CommitLock | None =None
, storage_options: dict[str, str] | None =None
, enable_v2_manifest_paths: bool | None =None
, detached: bool | None =False
, max_retries: int =20
) LanceDataset Create a new version of dataset
This method is an advanced method which allows users to describe a change that has been made to the data files. This method is not needed when using Lance to apply changes (e.g. when using
LanceDataset
orwrite_dataset()
.)It’s current purpose is to allow for changes being made in a distributed environment where no single process is doing all of the work. For example, a distributed bulk update or a distributed bulk modify operation.
Once all of the changes have been made, this method can be called to make the changes visible by updating the dataset manifest.
Warning
This is an advanced API and doesn’t provide the same level of validation as the other APIs. For example, it’s the responsibility of the caller to ensure that the fragments are valid for the schema.
- Parameters:
- base_uri : str, Path, or LanceDataset¶
The base uri of the dataset, or the dataset object itself. Using the dataset object can be more efficient because it can re-use the file metadata cache.
- operation : BaseOperation¶
The operation to apply to the dataset. This describes what changes have been made. See available operations under
LanceOperation
.- read_version : int, optional¶
The version of the dataset that was used as the base for the changes. This is not needed for overwrite or restore operations.
- commit_lock : CommitLock, optional¶
A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.
- storage_options : optional, dict¶
Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.
- enable_v2_manifest_paths : bool, optional¶
If True, and this is a new dataset, uses the new V2 manifest paths. These paths provide more efficient opening of datasets with many versions on object stores. This parameter has no effect if the dataset already exists. To migrate an existing dataset, instead use the
migrate_manifest_paths_v2()
method. Default is False. WARNING: turning this on will make the dataset unreadable for older versions of Lance (prior to 0.17.0).- detached : bool, optional¶
If True, then the commit will not be part of the dataset lineage. It will never show up as the latest dataset and the only way to check it out in the future will be to specifically check it out by version. The version will be a random version that is only unique amongst detached commits. The caller should store this somewhere as there will be no other way to obtain it in the future.
- max_retries : int¶
The maximum number of retries to perform when committing the dataset.
- Returns:
A new version of Lance Dataset.
- Return type:
Examples
Creating a new dataset with the
LanceOperation.Overwrite
operation:>>> import lance >>> import pyarrow as pa >>> tab1 = pa.table({"a": [1, 2], "b": ["a", "b"]}) >>> tab2 = pa.table({"a": [3, 4], "b": ["c", "d"]}) >>> fragment1 = lance.fragment.LanceFragment.create("example", tab1) >>> fragment2 = lance.fragment.LanceFragment.create("example", tab2) >>> fragments = [fragment1, fragment2] >>> operation = lance.LanceOperation.Overwrite(tab1.schema, fragments) >>> dataset = lance.LanceDataset.commit("example", operation) >>> dataset.to_table().to_pandas() a b 0 1 a 1 2 b 2 3 c 3 4 d