-
lance.LanceDataset.add_columns(transforms: dict[str, str] | BatchUDF | ReaderLike, read_columns: list[str] | None =
None
, reader_schema: pa.Schema | None =None
, batch_size: int | None =None
) Add new columns with defined values.
There are several ways to specify the new columns. First, you can provide SQL expressions for each new column. Second you can provide a UDF that takes a batch of existing data and returns a new batch with the new columns. These new columns will be appended to the dataset.
You can also provide a RecordBatchReader which will read the new column values from some external source. This is often useful when the new column values have already been staged to files (often by some distributed process)
See the
lance.add_columns_udf()
decorator for more information on writing UDFs.- Parameters:
- transforms : dict or AddColumnsUDF or ReaderLike¶
If this is a dictionary, then the keys are the names of the new columns and the values are SQL expression strings. These strings can reference existing columns in the dataset. If this is a AddColumnsUDF, then it is a UDF that takes a batch of existing data and returns a new batch with the new columns.
- read_columns : list of str, optional¶
The names of the columns that the UDF will read. If None, then the UDF will read all columns. This is only used when transforms is a UDF. Otherwise, the read columns are inferred from the SQL expressions.
- reader_schema : pa.Schema, optional¶
Only valid if transforms is a ReaderLike object. This will be used to determine the schema of the reader.
- batch_size : int, optional¶
The number of rows to read at a time from the source dataset when applying the transform. This is ignored if the dataset is a v1 dataset.
Examples
>>> import lance >>> import pyarrow as pa >>> table = pa.table({"a": [1, 2, 3]}) >>> dataset = lance.write_dataset(table, "my_dataset") >>> @lance.batch_udf() ... def double_a(batch): ... df = batch.to_pandas() ... return pd.DataFrame({'double_a': 2 * df['a']}) >>> dataset.add_columns(double_a) >>> dataset.to_table().to_pandas() a double_a 0 1 2 1 2 4 2 3 6 >>> dataset.add_columns({"triple_a": "a * 3"}) >>> dataset.to_table().to_pandas() a double_a triple_a 0 1 2 3 1 2 4 6 2 3 6 9
See also
LanceDataset.merge
Merge a pre-computed set of columns into the dataset.