File Format¶
File Structure¶
Each .lance
file is the container for the actual data.
At the tail of the file, ColumnMetadata
protobuf blocks are used to describe the encoding of the columns in the file.
message ColumnMetadata {
// This describes a page of column data.
message Page {
// The file offsets for each of the page buffers
//
// The number of buffers is variable and depends on the encoding. There
// may be zero buffers (e.g. constant encoded data) in which case this
// could be empty.
repeated uint64 buffer_offsets = 1;
// The size (in bytes) of each of the page buffers
//
// This field will have the same length as `buffer_offsets` and
// may be empty.
repeated uint64 buffer_sizes = 2;
// Logical length (e.g. # rows) of the page
uint64 length = 3;
// The encoding used to encode the page
Encoding encoding = 4;
// The priority of the page
//
// For tabular data this will be the top-level row number of the first row
// in the page (and top-level rows should not split across pages).
uint64 priority = 5;
}
// Encoding information about the column itself. This typically describes
// how to interpret the column metadata buffers. For example, it could
// describe how statistics or dictionaries are stored in the column metadata.
Encoding encoding = 1;
// The pages in the column
repeated Page pages = 2;
// The file offsets of each of the column metadata buffers
//
// There may be zero buffers.
repeated uint64 buffer_offsets = 3;
// The size (in bytes) of each of the column metadata buffers
//
// This field will have the same length as `buffer_offsets` and
// may be empty.
repeated uint64 buffer_sizes = 4;
}
A Footer
describes the overall layout of the file. The entire file layout is described here:
// Note: the number of buffers (BN) is independent of the number of columns (CN)
// and pages.
//
// Buffers often need to be aligned. 64-byte alignment is common when
// working with SIMD operations. 4096-byte alignment is common when
// working with direct I/O. In order to ensure these buffers are aligned
// writers may need to insert padding before the buffers.
//
// If direct I/O is required then most (but not all) fields described
// below must be sector aligned. We have marked these fields with an
// asterisk for clarity. Readers should assume there will be optional
// padding inserted before these fields.
//
// All footer fields are unsigned integers written with little endian
// byte order.
//
// ├──────────────────────────────────┤
// | Data Pages |
// | Data Buffer 0* |
// | ... |
// | Data Buffer BN* |
// ├──────────────────────────────────┤
// | Column Metadatas |
// | |A| Column 0 Metadata* |
// | Column 1 Metadata* |
// | ... |
// | Column CN Metadata* |
// ├──────────────────────────────────┤
// | Column Metadata Offset Table |
// | |B| Column 0 Metadata Position* |
// | Column 0 Metadata Size |
// | ... |
// | Column CN Metadata Position |
// | Column CN Metadata Size |
// ├──────────────────────────────────┤
// | Global Buffers Offset Table |
// | |C| Global Buffer 0 Position* |
// | Global Buffer 0 Size |
// | ... |
// | Global Buffer GN Position |
// | Global Buffer GN Size |
// ├──────────────────────────────────┤
// | Footer |
// | A u64: Offset to column meta 0 |
// | B u64: Offset to CMO table |
// | C u64: Offset to GBO table |
// | u32: Number of global bufs |
// | u32: Number of columns |
// | u16: Major version |
// | u16: Minor version |
// | "LANC" |
// ├──────────────────────────────────┤
//
// File Layout-End
File Version¶
The Lance file format has gone through a number of changes including a breaking change from version 1 to version 2. There are a number of APIs that allow the file version to be specified. Using a newer version of the file format will lead to better compression and/or performance. However, older software versions may not be able to read newer files.
In addition, the latest version of the file format (next) is unstable and should not be used for production use cases.
Breaking changes could be made to unstable encodings and that would mean that files written with these encodings are
no longer readable by any newer versions of Lance. The next
version should only be used for experimentation and
benchmarking upcoming features.
The following values are supported:
Version | Minimal Lance Version | Maximum Lance Version | Description |
---|---|---|---|
0.1 | Any | Any | This is the initial Lance format. |
2.0 | 0.16.0 | Any | Rework of the Lance file format that removed row groups and introduced null support for lists, fixed size lists, and primitives |
2.1 (unstable) | None | Any | Enhances integer and string compression, adds support for nulls in struct fields, and improves random access performance with nested fields. |
legacy | N/A | N/A | Alias for 0.1 |
stable | N/A | N/A | Alias for the latest stable version (currently 2.0) |
next | N/A | N/A | Alias for the latest unstable version (currently 2.1) |
File Encodings¶
Lance supports a variety of encodings for different data types. The encodings are chosen to give both random access and scan performance. Encodings are added over time and may be extended in the future. The manifest records a max format version which controls which encodings will be used. This allows for a gradual migration to a new data format so that old readers can still read new data while a migration is in progress.
Encodings are divided into "field encodings" and "array encodings".
Field encodings are consistent across an entire field of data,
while array encodings are used for individual pages of data within a field.
Array encodings can nest other array encodings (e.g. a dictionary encoding can bitpack the indices)
however array encodings cannot nest field encodings.
For this reason data types such as Dictionary<UInt8, List<String>>
are not yet supported (since there is no dictionary field encoding)
Encodings Available¶
Encoding Name | Encoding Type | What it does | Supported Versions | When it is applied |
---|---|---|---|---|
Basic struct | Field encoding | Encodes non-nullable struct data | >= 2.0 | Default encoding for structs |
List | Field encoding | Encodes lists (nullable or non-nullable) | >= 2.0 | Default encoding for lists |
Basic Primitive | Field encoding | Encodes primitive data types using separate validity array | >= 2.0 | Default encoding for primitive data types |
Value | Array encoding | Encodes a single vector of fixed-width values | >= 2.0 | Fallback encoding for fixed-width types |
Binary | Array encoding | Encodes a single vector of variable-width data | >= 2.0 | Fallback encoding for variable-width types |
Dictionary | Array encoding | Encodes data using a dictionary array and an indices array which is useful for large data types with few unique values | >= 2.0 | Used on string pages with fewer than 100 unique elements |
Packed struct | Array encoding | Encodes a struct with fixed-width fields in a row-major format making random access more efficient | >= 2.0 | Only used on struct types if the field metadata attribute "packed" is set to "true" |
Fsst | Array encoding | Compresses binary data by identifying common substrings (of 8 bytes or less) and encoding them as symbols | >= 2.1 | Used on string pages that are not dictionary encoded |
Bitpacking | Array encoding | Encodes a single vector of fixed-width values using bitpacking which is useful for integral types that do not span the full range of values | >= 2.1 | Used on integral types |
Statistics¶
Statistics are stored within Lance files. The statistics can be used to determine which pages can be skipped within a query. The null count, lower bound (min), and upper bound (max) are stored.
Statistics themselves are stored in Lance's columnar format, which allows for selectively reading only relevant stats columns.
Statistic Values¶
Three types of statistics are stored per column: null count, min value, max value. The min and max values are stored as their native data types in arrays.
There are special behaviors for different data types to account for nulls:
- For integer-based data types (including signed and unsigned integers, dates, and timestamps), if the min and max are unknown (all values are null), then the minimum/maximum representable values should be used instead.
- For float data types, if the min and max are unknown,
then use
-Inf
and+Inf
, respectively. (-Inf
and+Inf
may also be used for min and max if those values are present in the arrays.)NaN
values should be ignored for the purpose of min and max statistics. If the max value is zero (negative or positive), the max value should be recorded as+0.0
. Likewise, if the min value is zero (positive or negative), it should be recorded as-0.0
. - For binary data types, if the min or max are unknown or unrepresentable, then use null value.
Binary data type bounds can also be truncated. For example,
an array containing just the value
"abcd"
could have a truncated min of"abc"
and max of"abd"
. If there is no truncated value greater than the maximum value, then instead use null for the maximum.
Warning
The min
and max
values are not guaranteed to be within the array; they are simply upper and lower bounds. Two common cases where they are not contained in the array is if the min or max original value was deleted and when binary data is truncated. Therefore, statistic should not be used to compute queries such as SELECT max(col) FROM table
.
Page-level Statistics Format¶
Page-level statistics are stored as arrays within the Lance file. Each array contains one page long and is num_pages
long. The page offsets are stored in an array just like the data page table. The offset to the statistics page table is stored in the metadata.
The schema for the statistics is:
<field_id_1>: struct
null_count: i64
min_value: <field_1_data_type>
max_value: <field_1_data_type>
...
<field_id_N>: struct
null_count: i64
min_value: <field_N_data_type>
max_value: <field_N_data_type>
Any number of fields may be missing, as statistics for some fields or of some kind may be skipped. In addition, readers should expect there may be extra fields that are not in this schema. These should be ignored. Future changes to the format may add additional fields, but these changes will be backwards compatible.
However, writers should not write extra fields that aren't described in this document. Until they are defined in the specification, there is no guarantee that readers will be able to safely interpret new forms of statistics.