lance.LanceFragment.to_batches - Lance documentation

lance.LanceFragment.to_batches(self, Schema schema=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, bool cache_metadata=True, MemoryPool memory_pool=None)

Read the fragment as materialized record batches.

Parameters:

schema : Schema, optional

Concrete schema to use for scanning.

columns : list of str, default None

The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.

The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).

The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.

filter : Expression, default None

Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.

batch_size : int, default 131_072

The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.

batch_readahead : int, default 16

The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.

fragment_readahead : int, default 4

The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.

fragment_scan_options : FragmentScanOptions, default None

Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.

use_threads : bool, default True

If enabled, then maximum parallelism will be used determined by the number of available CPU cores.

cache_metadata : bool, default True

If enabled, metadata may be cached when scanning to speed up repeated scans.

memory_pool : MemoryPool, default None

For memory allocations, if required. If not specified, uses the default pool.

Returns:

record_batches

Return type:

iterator of RecordBatch