- class lance.LanceOperation.Merge(lance.LanceOperation.BaseOperation)
Operation that adds columns. Unlike Overwrite, this should not change the structure of the fragments, allowing existing indices to be kept.
- fragments¶
The fragments that make up the new dataset.
- Type:
iterable of FragmentMetadata
- schema¶
The schema of the new dataset. Passing a LanceSchema is preferred, and passing a pyarrow.Schema is deprecated.
- Type:
LanceSchema or pyarrow.Schema
Warning
This is an advanced API for distributed operations. To overwrite or create new dataset on a single machine, use
lance.write_dataset()
.Examples
To add new columns to a dataset, first define a method that will create the new columns based on the existing columns. Then use
lance.fragment.LanceFragment.add_columns()
>>> import lance >>> import pyarrow as pa >>> import pyarrow.compute as pc >>> table = pa.table({"a": [1, 2, 3, 4], "b": ["a", "b", "c", "d"]}) >>> dataset = lance.write_dataset(table, "example") >>> dataset.to_table().to_pandas() a b 0 1 a 1 2 b 2 3 c 3 4 d >>> def double_a(batch: pa.RecordBatch) -> pa.RecordBatch: ... doubled = pc.multiply(batch["a"], 2) ... return pa.record_batch([doubled], ["a_doubled"]) >>> fragments = [] >>> for fragment in dataset.get_fragments(): ... new_fragment, new_schema = fragment.merge_columns(double_a, ... columns=['a']) ... fragments.append(new_fragment) >>> operation = lance.LanceOperation.Merge(fragments, new_schema) >>> dataset = lance.LanceDataset.commit("example", operation, ... read_version=dataset.version) >>> dataset.to_table().to_pandas() a b a_doubled 0 1 a 2 1 2 b 4 2 3 c 6 3 4 d 8
Public members¶
- Merge(fragments: Iterable[FragmentMetadata], schema)
Initialize self. See help(type(self)) for accurate signature.
- __repr__()
Return repr(self).
- fragments : Iterable[FragmentMetadata]