-
lance.write_dataset(data_obj: ReaderLike, uri: str | Path | LanceDataset, schema: pa.Schema | None =
None
, mode: str ='create'
, *, max_rows_per_file: int =1048576
, max_rows_per_group: int =1024
, max_bytes_per_file: int =96636764160
, commit_lock: CommitLock | None =None
, progress: FragmentWriteProgress | None =None
, storage_options: dict[str, str] | None =None
, data_storage_version: str | None =None
, use_legacy_format: bool | None =None
, enable_v2_manifest_paths: bool =False
, enable_move_stable_row_ids: bool =False
) LanceDataset Write a given data_obj to the given uri
- Parameters:
- data_obj : Reader-like¶
The data to be written. Acceptable types are: - Pandas DataFrame, Pyarrow Table, Dataset, Scanner, or RecordBatchReader - Huggingface dataset
- uri : str, Path, or LanceDataset¶
Where to write the dataset to (directory). If a LanceDataset is passed, the session will be reused.
- schema : Schema, optional¶
If specified and the input is a pandas DataFrame, use this schema instead of the default pandas to arrow table conversion.
- mode : str¶
create - create a new dataset (raises if uri already exists). overwrite - create a new snapshot version append - create a new version that is the concat of the input the latest version (raises if uri does not exist)
- max_rows_per_file : int, default 1024 * 1024¶
The max number of rows to write before starting a new file
- max_rows_per_group : int, default 1024¶
The max number of rows before starting a new group (in the same file)
- max_bytes_per_file : int, default 90 * 1024 * 1024 * 1024¶
The max number of bytes to write before starting a new file. This is a soft limit. This limit is checked after each group is written, which means larger groups may cause this to be overshot meaningfully. This defaults to 90 GB, since we have a hard limit of 100 GB per file on object stores.
- commit_lock : CommitLock, optional¶
A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.
- progress : FragmentWriteProgress, optional¶
Experimental API. Progress tracking for writing the fragment. Pass a custom class that defines hooks to be called when each fragment is starting to write and finishing writing.
- storage_options : optional, dict¶
Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.
- data_storage_version : optional, str, default None¶
The version of the data storage format to use. Newer versions are more efficient but require newer versions of lance to read. The default (None) will use the latest stable version. See the user guide for more details.
- use_legacy_format : optional, bool, default None¶
Deprecated method for setting the data storage version. Use the data_storage_version parameter instead.
- enable_v2_manifest_paths : bool, optional¶
If True, and this is a new dataset, uses the new V2 manifest paths. These paths provide more efficient opening of datasets with many versions on object stores. This parameter has no effect if the dataset already exists. To migrate an existing dataset, instead use the
LanceDataset.migrate_manifest_paths_v2()
method. Default is False.- enable_move_stable_row_ids : bool, optional¶
Experimental parameter: if set to true, the writer will use move-stable row ids. These row ids are stable after compaction operations, but not after updates. This makes compaction more efficient, since with stable row ids no secondary indices need to be updated to point to new row ids.