lance.LanceDataset.commit - Lance documentation

static lance.LanceDataset.commit(base_uri: str | Path | LanceDataset, operation: LanceOperation.BaseOperation | Transaction, blobs_op: LanceOperation.BaseOperation | None = None, read_version: int | None = None, commit_lock: CommitLock | None = None, storage_options: dict[str, str] | None = None, enable_v2_manifest_paths: bool | None = None, detached: bool | None = False, max_retries: int = 20) → LanceDataset

Create a new version of dataset

This method is an advanced method which allows users to describe a change that has been made to the data files. This method is not needed when using Lance to apply changes (e.g. when using LanceDataset or write_dataset().)

It’s current purpose is to allow for changes being made in a distributed environment where no single process is doing all of the work. For example, a distributed bulk update or a distributed bulk modify operation.

Once all of the changes have been made, this method can be called to make the changes visible by updating the dataset manifest.

Warning

This is an advanced API and doesn’t provide the same level of validation as the other APIs. For example, it’s the responsibility of the caller to ensure that the fragments are valid for the schema.

Parameters:

base_uri : str, Path, or LanceDataset¶: The base uri of the dataset, or the dataset object itself. Using the dataset object can be more efficient because it can re-use the file metadata cache.
operation : BaseOperation¶: The operation to apply to the dataset. This describes what changes have been made. See available operations under LanceOperation.
read_version : int, optional¶: The version of the dataset that was used as the base for the changes. This is not needed for overwrite or restore operations.
commit_lock : CommitLock, optional¶: A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.
storage_options : optional, dict¶: Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.
enable_v2_manifest_paths : bool, optional¶: If True, and this is a new dataset, uses the new V2 manifest paths. These paths provide more efficient opening of datasets with many versions on object stores. This parameter has no effect if the dataset already exists. To migrate an existing dataset, instead use the migrate_manifest_paths_v2() method. Default is False. WARNING: turning this on will make the dataset unreadable for older versions of Lance (prior to 0.17.0).
detached : bool, optional¶: If True, then the commit will not be part of the dataset lineage. It will never show up as the latest dataset and the only way to check it out in the future will be to specifically check it out by version. The version will be a random version that is only unique amongst detached commits. The caller should store this somewhere as there will be no other way to obtain it in the future.
max_retries : int¶: The maximum number of retries to perform when committing the dataset.

Returns:

A new version of Lance Dataset.

Return type:

LanceDataset

Examples

Creating a new dataset with the LanceOperation.Overwrite operation:

>>> import lance
>>> import pyarrow as pa
>>> tab1 = pa.table({"a": [1, 2], "b": ["a", "b"]})
>>> tab2 = pa.table({"a": [3, 4], "b": ["c", "d"]})
>>> fragment1 = lance.fragment.LanceFragment.create("example", tab1)
>>> fragment2 = lance.fragment.LanceFragment.create("example", tab2)
>>> fragments = [fragment1, fragment2]
>>> operation = lance.LanceOperation.Overwrite(tab1.schema, fragments)
>>> dataset = lance.LanceDataset.commit("example", operation)
>>> dataset.to_table().to_pandas()
   a  b
0  1  a
1  2  b
2  3  c
3  4  d