Data Evolution

Lance supports zero-copy data evolution, which means that you can add new columns and backfill column data to the dataset cheaply.

LanceDataset.add_columns(transforms: Dict[str, str] | BatchUDF | ReaderLike | pyarrow.Field | List[pyarrow.Field] | pyarrow.Schema, read_columns: List[str] | None = None, reader_schema: pa.Schema | None = None, batch_size: int | None = None)

Add new columns with defined values.

There are several ways to specify the new columns. First, you can provide SQL expressions for each new column. Second you can provide a UDF that takes a batch of existing data and returns a new batch with the new columns. These new columns will be appended to the dataset.

You can also provide a RecordBatchReader which will read the new column values from some external source. This is often useful when the new column values have already been staged to files (often by some distributed process)

See the lance.add_columns_udf() decorator for more information on writing UDFs.

Parameters:
  • transforms (dict or AddColumnsUDF or ReaderLike) – If this is a dictionary, then the keys are the names of the new columns and the values are SQL expression strings. These strings can reference existing columns in the dataset. If this is a AddColumnsUDF, then it is a UDF that takes a batch of existing data and returns a new batch with the new columns. If this is pyarrow.Field or pyarrow.Schema, it adds all NULL columns with the given schema, in a metadata-only operation.

  • read_columns (list of str, optional) –

    The names of the columns that the UDF will read. If None, then the UDF will read all columns. This is only used when transforms is a UDF. Otherwise, the read columns are inferred from the SQL expressions.

    This can include _rowid or _rowaddr to read the row id or row address from the dataset.

  • reader_schema (pa.Schema, optional) – Only valid if transforms is a ReaderLike object. This will be used to determine the schema of the reader.

  • batch_size (int, optional) – The number of rows to read at a time from the source dataset when applying the transform. This is ignored if the dataset is a v1 dataset.

Examples

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3]})
>>> dataset = lance.write_dataset(table, "my_dataset")
>>> @lance.batch_udf()
... def double_a(batch):
...     df = batch.to_pandas()
...     return pd.DataFrame({'double_a': 2 * df['a']})
>>> dataset.add_columns(double_a)
>>> dataset.to_table().to_pandas()
   a  double_a
0  1         2
1  2         4
2  3         6
>>> dataset.add_columns({"triple_a": "a * 3"})
>>> dataset.to_table().to_pandas()
   a  double_a  triple_a
0  1         2         3
1  2         4         6
2  3         6         9

See also

LanceDataset.merge

Merge a pre-computed set of columns into the dataset.

LanceDataset.drop_columns(columns: List[str])

Drop one or more columns from the dataset

This is a metadata-only operation and does not remove the data from the underlying storage. In order to remove the data, you must subsequently call compact_files to rewrite the data without the removed columns and then call cleanup_old_versions to remove the old files.

Parameters:

columns (list of str) – The names of the columns to drop. These can be nested column references (e.g. “a.b.c”) or top-level column names (e.g. “a”).

Examples

>>> import lance
>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})
>>> dataset = lance.write_dataset(table, "example")
>>> dataset.drop_columns(["a"])
>>> dataset.to_table().to_pandas()
   b
0  a
1  b
2  c