File Format¶

File Structure¶

Each .lance file is the container for the actual data.

Format Overview

At the tail of the file, ColumnMetadata protobuf blocks are used to describe the encoding of the columns in the file.

message ColumnMetadata {

  // This describes a page of column data.
  message Page {
    // The file offsets for each of the page buffers
    //
    // The number of buffers is variable and depends on the encoding.  There
    // may be zero buffers (e.g. constant encoded data) in which case this
    // could be empty.
    repeated uint64 buffer_offsets = 1;
    // The size (in bytes) of each of the page buffers
    //
    // This field will have the same length as `buffer_offsets` and
    // may be empty.
    repeated uint64 buffer_sizes = 2;
    // Logical length (e.g. # rows) of the page
    uint64 length = 3;
    // The encoding used to encode the page
    Encoding encoding = 4;
    // The priority of the page
    //
    // For tabular data this will be the top-level row number of the first row
    // in the page (and top-level rows should not split across pages).
    uint64 priority = 5;
  }
  // Encoding information about the column itself.  This typically describes
  // how to interpret the column metadata buffers.  For example, it could
  // describe how statistics or dictionaries are stored in the column metadata.
  Encoding encoding = 1;
  // The pages in the column
  repeated Page pages = 2;   
  // The file offsets of each of the column metadata buffers
  //
  // There may be zero buffers.
  repeated uint64 buffer_offsets = 3;
  // The size (in bytes) of each of the column metadata buffers
  //
  // This field will have the same length as `buffer_offsets` and
  // may be empty.
  repeated uint64 buffer_sizes = 4;

}

A Footer describes the overall layout of the file. The entire file layout is described here:

// Note: the number of buffers (BN) is independent of the number of columns (CN)
//       and pages.
//
//       Buffers often need to be aligned.  64-byte alignment is common when
//       working with SIMD operations.  4096-byte alignment is common when
//       working with direct I/O.  In order to ensure these buffers are aligned
//       writers may need to insert padding before the buffers.
//       
//       If direct I/O is required then most (but not all) fields described
//       below must be sector aligned.  We have marked these fields with an
//       asterisk for clarity.  Readers should assume there will be optional
//       padding inserted before these fields.
//
//       All footer fields are unsigned integers written with  little endian
//       byte order.
//
// ├──────────────────────────────────┤
// | Data Pages                       |
// |   Data Buffer 0*                 |
// |   ...                            |
// |   Data Buffer BN*                |
// ├──────────────────────────────────┤
// | Column Metadatas                 |
// | |A| Column 0 Metadata*           |
// |     Column 1 Metadata*           |
// |     ...                          |
// |     Column CN Metadata*          |
// ├──────────────────────────────────┤
// | Column Metadata Offset Table     |
// | |B| Column 0 Metadata Position*  |
// |     Column 0 Metadata Size       |
// |     ...                          |
// |     Column CN Metadata Position  |
// |     Column CN Metadata Size      |
// ├──────────────────────────────────┤
// | Global Buffers Offset Table      |
// | |C| Global Buffer 0 Position*    |
// |     Global Buffer 0 Size         |
// |     ...                          |
// |     Global Buffer GN Position    |
// |     Global Buffer GN Size        |
// ├──────────────────────────────────┤
// | Footer                           |
// | A u64: Offset to column meta 0   |
// | B u64: Offset to CMO table       |
// | C u64: Offset to GBO table       |
// |   u32: Number of global bufs     |
// |   u32: Number of columns         |
// |   u16: Major version             |
// |   u16: Minor version             |
// |   "LANC"                         |
// ├──────────────────────────────────┤
//
// File Layout-End

File Version¶

The Lance file format has gone through a number of changes including a breaking change from version 1 to version 2. There are a number of APIs that allow the file version to be specified. Using a newer version of the file format will lead to better compression and/or performance. However, older software versions may not be able to read newer files.

In addition, the latest version of the file format (next) is unstable and should not be used for production use cases. Breaking changes could be made to unstable encodings and that would mean that files written with these encodings are no longer readable by any newer versions of Lance. The next version should only be used for experimentation and benchmarking upcoming features.

The following values are supported:

Version	Minimal Lance Version	Maximum Lance Version	Description
0.1	Any	Any	This is the initial Lance format.
2.0	0.16.0	Any	Rework of the Lance file format that removed row groups and introduced null support for lists, fixed size lists, and primitives
2.1 (unstable)	None	Any	Enhances integer and string compression, adds support for nulls in struct fields, and improves random access performance with nested fields.
legacy	N/A	N/A	Alias for 0.1
stable	N/A	N/A	Alias for the latest stable version (currently 2.0)
next	N/A	N/A	Alias for the latest unstable version (currently 2.1)

File Encodings¶

Lance supports a variety of encodings for different data types. The encodings are chosen to give both random access and scan performance. Encodings are added over time and may be extended in the future. The manifest records a max format version which controls which encodings will be used. This allows for a gradual migration to a new data format so that old readers can still read new data while a migration is in progress.

Encodings are divided into "field encodings" and "array encodings". Field encodings are consistent across an entire field of data, while array encodings are used for individual pages of data within a field. Array encodings can nest other array encodings (e.g. a dictionary encoding can bitpack the indices) however array encodings cannot nest field encodings. For this reason data types such as Dictionary<UInt8, List<String>> are not yet supported (since there is no dictionary field encoding)

Encodings Available¶

Encoding Name	Encoding Type	What it does	Supported Versions	When it is applied
Basic struct	Field encoding	Encodes non-nullable struct data	>= 2.0	Default encoding for structs
List	Field encoding	Encodes lists (nullable or non-nullable)	>= 2.0	Default encoding for lists
Basic Primitive	Field encoding	Encodes primitive data types using separate validity array	>= 2.0	Default encoding for primitive data types
Value	Array encoding	Encodes a single vector of fixed-width values	>= 2.0	Fallback encoding for fixed-width types
Binary	Array encoding	Encodes a single vector of variable-width data	>= 2.0	Fallback encoding for variable-width types
Dictionary	Array encoding	Encodes data using a dictionary array and an indices array which is useful for large data types with few unique values	>= 2.0	Used on string pages with fewer than 100 unique elements
Packed struct	Array encoding	Encodes a struct with fixed-width fields in a row-major format making random access more efficient	>= 2.0	Only used on struct types if the field metadata attribute `"packed"` is set to `"true"`
Fsst	Array encoding	Compresses binary data by identifying common substrings (of 8 bytes or less) and encoding them as symbols	>= 2.1	Used on string pages that are not dictionary encoded
Bitpacking	Array encoding	Encodes a single vector of fixed-width values using bitpacking which is useful for integral types that do not span the full range of values	>= 2.1	Used on integral types

Statistics¶

Statistics are stored within Lance files. The statistics can be used to determine which pages can be skipped within a query. The null count, lower bound (min), and upper bound (max) are stored.

Statistics themselves are stored in Lance's columnar format, which allows for selectively reading only relevant stats columns.

Statistic Values¶

Three types of statistics are stored per column: null count, min value, max value. The min and max values are stored as their native data types in arrays.

There are special behaviors for different data types to account for nulls:

For integer-based data types (including signed and unsigned integers, dates, and timestamps), if the min and max are unknown (all values are null), then the minimum/maximum representable values should be used instead.
For float data types, if the min and max are unknown, then use -Inf and +Inf, respectively. (-Inf and +Inf may also be used for min and max if those values are present in the arrays.) NaN values should be ignored for the purpose of min and max statistics. If the max value is zero (negative or positive), the max value should be recorded as +0.0. Likewise, if the min value is zero (positive or negative), it should be recorded as -0.0.
For binary data types, if the min or max are unknown or unrepresentable, then use null value. Binary data type bounds can also be truncated. For example, an array containing just the value "abcd" could have a truncated min of "abc" and max of "abd". If there is no truncated value greater than the maximum value, then instead use null for the maximum.

Warning

The min and max values are not guaranteed to be within the array; they are simply upper and lower bounds. Two common cases where they are not contained in the array is if the min or max original value was deleted and when binary data is truncated. Therefore, statistic should not be used to compute queries such as SELECT max(col) FROM table.

Page-level Statistics Format¶

Page-level statistics are stored as arrays within the Lance file. Each array contains one page long and is num_pages long. The page offsets are stored in an array just like the data page table. The offset to the statistics page table is stored in the metadata.

The schema for the statistics is:

<field_id_1>: struct
    null_count: i64
    min_value: <field_1_data_type>
    max_value: <field_1_data_type>
...
<field_id_N>: struct
    null_count: i64
    min_value: <field_N_data_type>
    max_value: <field_N_data_type>

Any number of fields may be missing, as statistics for some fields or of some kind may be skipped. In addition, readers should expect there may be extra fields that are not in this schema. These should be ignored. Future changes to the format may add additional fields, but these changes will be backwards compatible.

However, writers should not write extra fields that aren't described in this document. Until they are defined in the specification, there is no guarantee that readers will be able to safely interpret new forms of statistics.