lance.LanceDataset.merge_insert(on: str | Iterable[str])

Returns a builder that can be used to create a “merge insert” operation

This operation can add rows, update rows, and remove rows in a single transaction. It is a very generic tool that can be used to create behaviors like “insert if not exists”, “update or insert (i.e. upsert)”, or even replace a portion of existing data with new data (e.g. replace all data where month=”january”)

The merge insert operation works by combining new data from a source table with existing data in a target table by using a join. There are three categories of records.

“Matched” records are records that exist in both the source table and the target table. “Not matched” records exist only in the source table (e.g. these are new data). “Not matched by source” records exist only in the target table (this is old data).

The builder returned by this method can be used to customize what should happen for each category of data.

Please note that the data will be reordered as part of this operation. This is because updated rows will be deleted from the dataset and then reinserted at the end with the new values. The order of the newly inserted rows may fluctuate randomly because a hash-join operation is used internally.

Parameters:
on : Union[str, Iterable[str]]

A column (or columns) to join on. This is how records from the source table and target table are matched. Typically this is some kind of key or id column.

Examples

Use when_matched_update_all() and when_not_matched_insert_all() to perform an “upsert” operation. This will update rows that already exist in the dataset and insert rows that do not exist.

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> dataset = lance.write_dataset(table, "example")
>>> new_table = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> dataset.merge_insert("a")     \
...             .when_matched_update_all()     \
...             .when_not_matched_insert_all() \
...             .execute(new_table)
{'num_inserted_rows': 1, 'num_updated_rows': 2, 'num_deleted_rows': 0}
>>> dataset.to_table().sort_by("a").to_pandas()
   a  b
0  1  b
1  2  x
2  3  y
3  4  z

Use when_not_matched_insert_all() to perform an “insert if not exists” operation. This will only insert rows that do not already exist in the dataset.

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})
>>> dataset = lance.write_dataset(table, "example2")
>>> new_table = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform an "insert if not exists" operation
>>> dataset.merge_insert("a")     \
...             .when_not_matched_insert_all() \
...             .execute(new_table)
{'num_inserted_rows': 1, 'num_updated_rows': 0, 'num_deleted_rows': 0}
>>> dataset.to_table().sort_by("a").to_pandas()
   a  b
0  1  a
1  2  b
2  3  c
3  4  z

You are not required to provide all the columns. If you only want to update a subset of columns, you can omit columns you don’t want to update. Omitted columns will keep their existing values if they are updated, or will be null if they are inserted.

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"], \
...                   "c": ["x", "y", "z"]})
>>> dataset = lance.write_dataset(table, "example3")
>>> new_table = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform an "upsert" operation, only updating column "a"
>>> dataset.merge_insert("a")     \
...             .when_matched_update_all()     \
...             .when_not_matched_insert_all() \
...             .execute(new_table)
{'num_inserted_rows': 1, 'num_updated_rows': 2, 'num_deleted_rows': 0}
>>> dataset.to_table().sort_by("a").to_pandas()
   a  b     c
0  1  a     x
1  2  x     y
2  3  y     z
3  4  z  None