lance.write_dataset - Lance documentation

lance.write_dataset(data_obj: ReaderLike, uri: str | Path | LanceDataset, schema: pa.Schema | None = None, mode: str = 'create', *, max_rows_per_file: int = 1048576, max_rows_per_group: int = 1024, max_bytes_per_file: int = 96636764160, commit_lock: CommitLock | None = None, progress: FragmentWriteProgress | None = None, storage_options: dict[str, str] | None = None, data_storage_version: str | None = None, use_legacy_format: bool | None = None, enable_v2_manifest_paths: bool = False, enable_move_stable_row_ids: bool = False, auto_cleanup_options: AutoCleanupConfig | None = None) → LanceDataset

Write a given data_obj to the given uri

Parameters:

data_obj : Reader-like¶: The data to be written. Acceptable types are: - Pandas DataFrame, Pyarrow Table, Dataset, Scanner, or RecordBatchReader - Huggingface dataset
uri : str, Path, or LanceDataset¶: Where to write the dataset to (directory). If a LanceDataset is passed, the session will be reused.
schema : Schema, optional¶: If specified and the input is a pandas DataFrame, use this schema instead of the default pandas to arrow table conversion.
mode : str¶: create - create a new dataset (raises if uri already exists). overwrite - create a new snapshot version append - create a new version that is the concat of the input the latest version (raises if uri does not exist)
max_rows_per_file : int, default 1024 * 1024¶: The max number of rows to write before starting a new file
max_rows_per_group : int, default 1024¶: The max number of rows before starting a new group (in the same file)
max_bytes_per_file : int, default 90 * 1024 * 1024 * 1024¶: The max number of bytes to write before starting a new file. This is a soft limit. This limit is checked after each group is written, which means larger groups may cause this to be overshot meaningfully. This defaults to 90 GB, since we have a hard limit of 100 GB per file on object stores.
commit_lock : CommitLock, optional¶: A custom commit lock. Only needed if your object store does not support atomic commits. See the user guide for more details.
progress : FragmentWriteProgress, optional¶: Experimental API. Progress tracking for writing the fragment. Pass a custom class that defines hooks to be called when each fragment is starting to write and finishing writing.
storage_options : optional, dict¶: Extra options that make sense for a particular storage connection. This is used to store connection parameters like credentials, endpoint, etc.
data_storage_version : optional, str, default None¶: The version of the data storage format to use. Newer versions are more efficient but require newer versions of lance to read. The default (None) will use the latest stable version. See the user guide for more details.
use_legacy_format : optional, bool, default None¶: Deprecated method for setting the data storage version. Use the data_storage_version parameter instead.
enable_v2_manifest_paths : bool, optional¶: If True, and this is a new dataset, uses the new V2 manifest paths. These paths provide more efficient opening of datasets with many versions on object stores. This parameter has no effect if the dataset already exists. To migrate an existing dataset, instead use the LanceDataset.migrate_manifest_paths_v2() method. Default is False.
enable_move_stable_row_ids : bool, optional¶: Experimental parameter: if set to true, the writer will use move-stable row ids. These row ids are stable after compaction operations, but not after updates. This makes compaction more efficient, since with stable row ids no secondary indices need to be updated to point to new row ids.
auto_cleanup_options : optional, AutoCleanupConfig¶: Config options for automatic cleanup of the dataset. If set, and this is a new dataset, old dataset versions will be automatically cleaned up according to this parameter. To add autocleaning to an existing dataset, use Dataset::update_config to set lance.auto_cleanup.interval and lance.auto_cleanup.older_than. Both parameters must be set to invoke autocleaning. If you do not set this parameter(default behavior), then no autocleaning will be performed. Note: this option only takes effect when creating a new dataset, it has no effect on existing datasets.