Lance Formats

The Lance project includes both a table format and a file format. Lance typically refers to tables as “datasets”. A Lance dataset is designed to efficiently handle secondary indices, fast ingestion and modification of data, and a rich set of schema evolution features.

Dataset Directory

A Lance Dataset is organized in a directory.

/path/to/dataset:
    data/*.lance  -- Data directory
    _versions/*.manifest -- Manifest file for each dataset version.
    _indices/{UUID-*}/index.idx -- Secondary index, each index per directory.
    _deletions/*.{arrow,bin} -- Deletion files, which contain ids of rows
      that have been deleted.

A Manifest file includes the metadata to describe a version of the dataset.

  1// Manifest is a global section shared between all the files.
  2message Manifest {
  3  // All fields of the dataset, including the nested fields.
  4  repeated lance.file.Field fields = 1;
  5
  6  // Fragments of the dataset.
  7  repeated DataFragment fragments = 2;
  8
  9  // Snapshot version number.
 10  uint64 version = 3;
 11
 12  // The file position of the version auxiliary data.
 13  //  * It is not inheritable between versions.
 14  //  * It is not loaded by default during query.
 15  uint64 version_aux_data = 4;
 16
 17  // Schema metadata.
 18  map<string, bytes> metadata = 5;
 19
 20  message WriterVersion {
 21    // The name of the library that created this file.
 22    string library = 1;
 23    // The version of the library that created this file. Because we cannot assume
 24    // that the library is semantically versioned, this is a string. However, if it
 25    // is semantically versioned, it should be a valid semver string without any 'v'
 26    // prefix. For example: `2.0.0`, `2.0.0-rc.1`.
 27    string version = 2;
 28  }
 29
 30  // The version of the writer that created this file.
 31  //
 32  // This information may be used to detect whether the file may have known bugs
 33  // associated with that writer.
 34  WriterVersion writer_version = 13;
 35
 36  // If presented, the file position of the index metadata.
 37  optional uint64 index_section = 6;
 38
 39  // Version creation Timestamp, UTC timezone
 40  google.protobuf.Timestamp timestamp = 7;
 41
 42  // Optional version tag
 43  string tag = 8;
 44
 45  // Feature flags for readers.
 46  //
 47  // A bitmap of flags that indicate which features are required to be able to
 48  // read the table. If a reader does not recognize a flag that is set, it
 49  // should not attempt to read the dataset.
 50  //
 51  // Known flags:
 52  // * 1: deletion files are present
 53  // * 2: move_stable_row_ids: row IDs are tracked and stable after move operations
 54  //       (such as compaction), but not updates.
 55  // * 4: use v2 format (deprecated)
 56  // * 8: table config is present
 57  uint64 reader_feature_flags = 9;
 58
 59  // Feature flags for writers.
 60  //
 61  // A bitmap of flags that indicate which features are required to be able to
 62  // write to the dataset. if a writer does not recognize a flag that is set, it
 63  // should not attempt to write to the dataset.
 64  //
 65  // The flags are the same as for reader_feature_flags, although they will not
 66  // always apply to both.
 67  uint64 writer_feature_flags = 10;
 68
 69  // The highest fragment ID that has been used so far.
 70  //
 71  // This ID is not guaranteed to be present in the current version, but it may
 72  // have been used in previous versions.
 73  // 
 74  // For a single file, will be zero.
 75  uint32 max_fragment_id = 11;
 76
 77  // Path to the transaction file, relative to `{root}/_transactions`
 78  //
 79  // This contains a serialized Transaction message representing the transaction
 80  // that created this version.
 81  //
 82  // May be empty if no transaction file was written.
 83  //
 84  // The path format is "{read_version}-{uuid}.txn" where {read_version} is the
 85  // version of the table the transaction read from, and {uuid} is a 
 86  // hyphen-separated UUID.
 87  string transaction_file = 12;
 88
 89  // The next unused row id. If zero, then the table does not have any rows.
 90  //
 91  // This is only used if the "move_stable_row_ids" feature flag is set.
 92  uint64 next_row_id = 14;
 93
 94  message DataStorageFormat {
 95    // The format of the data files (e.g. "lance")
 96    string file_format = 1;
 97    // The max format version of the data files.
 98    //
 99    // This is the maximum version of the file format that the dataset will create.
100    // This may be lower than the maximum version that can be written in order to allow
101    // older readers to read the dataset.
102    string version = 2;
103  }
104
105  // The data storage format
106  //
107  // This specifies what format is used to store the data files.
108  DataStorageFormat data_format = 15;
109  
110  // Table config.
111  //
112  // Keys with the prefix "lance." are reserved for the Lance library. Other
113  // libraries may wish to similarly prefix their configuration keys
114  // appropriately.
115  map<string, string> config = 16;
116
117  // The version of the blob dataset associated with this table.  Changes to
118  // blob fields will modify the blob dataset and update this version in the parent
119  // table.
120  //
121  // If this value is 0 then there are no blob fields.
122  uint64 blob_dataset_version = 17;
123
124} // Manifest

Fragments

DataFragment represents a chunk of data in the dataset. Itself includes one or more DataFile, where each DataFile can contain several columns in the chunk of data. It also may include a DeletionFile, which is explained in a later section.

  1// Data fragment. A fragment is a set of files which represent the
  2// different columns of the same rows.
  3// If column exists in the schema, but the related file does not exist,
  4// treat this column as nulls.
  5message DataFragment {
  6  // Unique ID of each DataFragment
  7  uint64 id = 1;
  8
  9  repeated DataFile files = 2;
 10
 11  // File that indicates which rows, if any, should be considered deleted.
 12  DeletionFile deletion_file = 3;
 13
 14  // TODO: What's the simplest way we can allow an inline tombstone bitmap?
 15
 16  // A serialized RowIdSequence message (see rowids.proto).
 17  //
 18  // These are the row ids for the fragment, in order of the rows as they appear.
 19  // That is, if a fragment has 3 rows, and the row ids are [1, 42, 3], then the
 20  // first row is row 1, the second row is row 42, and the third row is row 3.
 21  oneof row_id_sequence {
 22    // If small (< 200KB), the row ids are stored inline.
 23    bytes inline_row_ids = 5;
 24    // Otherwise, stored as part of a file.
 25    ExternalFile external_row_ids = 6;
 26  } // row_id_sequence
 27
 28  // Number of original rows in the fragment, this includes rows that are 
 29  // now marked with deletion tombstones. To compute the current number of rows, 
 30  // subtract `deletion_file.num_deleted_rows` from this value.
 31  uint64 physical_rows = 4;
 32}
 33
 34// Lance Data File
 35message DataFile {
 36  // Relative path to the root.
 37  string path = 1;
 38  // The ids of the fields/columns in this file.
 39  //
 40  // -1 is used for "unassigned" while in memory. It is not meant to be written
 41  // to disk. -2 is used for "tombstoned", meaningful a field that is no longer
 42  // in use. This is often because the original field id was reassigned to a
 43  // different data file.
 44  //
 45  // In Lance v1 IDs are assigned based on position in the file, offset by the max
 46  // existing field id in the table (if any already). So when a fragment is first
 47  // created with one file of N columns, the field ids will be 1, 2, ..., N. If a
 48  // second, fragment is created with M columns, the field ids will be N+1, N+2,
 49  // ..., N+M.
 50  //
 51  // In Lance v1 there is one field for each field in the input schema, this includes
 52  // nested fields (both struct and list).  Fixed size list fields have only a single
 53  // field id (these are not considered nested fields in Lance v1).
 54  //
 55  // This allows column indices to be calculated from field IDs and the input schema.
 56  //
 57  // In Lance v2 the field IDs generally follow the same pattern but there is no
 58  // way to calculate the column index from the field ID.  This is because a given
 59  // field could be encoded in many different ways, some of which occupy a different
 60  // number of columns.  For example, a struct field could be encoded into N + 1 columns
 61  // or it could be encoded into a single packed column.  To determine column indices
 62  // the column_indices property should be used instead.
 63  //
 64  // In Lance v1 these ids must be sorted but might not always be contiguous.
 65  repeated int32 fields = 2;
 66  // The top-level column indices for each field in the file.
 67  //
 68  // If the data file is version 1 then this property will be empty
 69  //
 70  // Otherwise there must be one entry for each field in `fields`.
 71  //
 72  // Some fields may not correspond to a top-level column in the file.  In these cases
 73  // the index will -1.
 74  //
 75  // For example, consider the schema:
 76  //
 77  // - dimension: packed-struct (0):
 78  //   - x: u32 (1)
 79  //   - y: u32 (2)
 80  // - path: list<u32> (3)
 81  // - embedding: fsl<768> (4)
 82  //   - fp64
 83  // - borders: fsl<4> (5)
 84  //   - simple-struct (6)
 85  //     - margin: fp64 (7)
 86  //     - padding: fp64 (8)
 87  //
 88  // One possible column indices array could be:
 89  // [0, -1, -1, 1, 3, 4, 5, 6, 7]
 90  //
 91  // This reflects quite a few phenomenon:
 92  // - The packed struct is encoded into a single column and there is no top-level column
 93  //   for the x or y fields
 94  // - The variable sized list is encoded into two columns
 95  // - The embedding is encoded into a single column (common for FSL of primitive) and there
 96  //   is not "FSL column"
 97  // - The borders field actually does have an "FSL column"
 98  //
 99  // The column indices table may not have duplicates (other than -1)
100  repeated int32 column_indices = 3;
101  // The major file version used to create the file
102  uint32 file_major_version = 4;
103  // The minor file version used to create the file
104  //
105  // If both `file_major_version` and `file_minor_version` are set to 0,
106  // then this is a version 0.1 or version 0.2 file.
107  uint32 file_minor_version = 5;
108} // DataFile

The overall structure of a fragment is shown below. One or more data files store the columns of a fragment. New columns can be added to a fragment by adding new data files. The deletion file (if present), stores the rows that have been deleted from the fragment.

_images/fragment_structure.png

Every row has a unique id, which is an u64 that is composed of two u32s: the fragment id and the local row id. The local row id is just the index of the row in the data files.

File Structure

Each .lance file is the container for the actual data.

_images/format_overview.png

At the tail of the file, ColumnMetadata protobuf blocks are used to describe the encoding of the columns in the file.

 1// ## Metadata
 2
 3// Each column has a metadata block that is placed at the end of the file.
 4// These may be read individually to allow for column projection.
 5message ColumnMetadata {
 6
 7  // This describes a page of column data.
 8  message Page {
 9    // The file offsets for each of the page buffers
10    //
11    // The number of buffers is variable and depends on the encoding.  There
12    // may be zero buffers (e.g. constant encoded data) in which case this
13    // could be empty.
14    repeated uint64 buffer_offsets = 1;
15    // The size (in bytes) of each of the page buffers
16    //
17    // This field will have the same length as `buffer_offsets` and
18    // may be empty.
19    repeated uint64 buffer_sizes = 2;
20    // Logical length (e.g. # rows) of the page
21    uint64 length = 3;
22    // The encoding used to encode the page
23    Encoding encoding = 4;
24    // The priority of the page
25    //
26    // For tabular data this will be the top-level row number of the first row
27    // in the page (and top-level rows should not split across pages).
28    uint64 priority = 5;
29  }
30  // Encoding information about the column itself.  This typically describes
31  // how to interpret the column metadata buffers.  For example, it could
32  // describe how statistics or dictionaries are stored in the column metadata.
33  Encoding encoding = 1;
34  // The pages in the column
35  repeated Page pages = 2;   
36  // The file offsets of each of the column metadata buffers
37  //
38  // There may be zero buffers.
39  repeated uint64 buffer_offsets = 3;
40  // The size (in bytes) of each of the column metadata buffers
41  //
42  // This field will have the same length as `buffer_offsets` and
43  // may be empty.
44  repeated uint64 buffer_sizes = 4;
45} // Metadata-End

A Footer describes the overall layout of the file. The entire file layout is described here:

 1// ## File Layout
 2//
 3// Note: the number of buffers (BN) is independent of the number of columns (CN)
 4//       and pages.
 5//
 6//       Buffers often need to be aligned.  64-byte alignment is common when
 7//       working with SIMD operations.  4096-byte alignment is common when
 8//       working with direct I/O.  In order to ensure these buffers are aligned
 9//       writers may need to insert padding before the buffers.
10//       
11//       If direct I/O is required then most (but not all) fields described
12//       below must be sector aligned.  We have marked these fields with an
13//       asterisk for clarity.  Readers should assume there will be optional
14//       padding inserted before these fields.
15//
16//       All footer fields are unsigned integers written with  little endian
17//       byte order.
18//
19// ├──────────────────────────────────┤
20// | Data Pages                       |
21// |   Data Buffer 0*                 |
22// |   ...                            |
23// |   Data Buffer BN*                |
24// ├──────────────────────────────────┤
25// | Column Metadatas                 |
26// | |A| Column 0 Metadata*           |
27// |     Column 1 Metadata*           |
28// |     ...                          |
29// |     Column CN Metadata*          |
30// ├──────────────────────────────────┤
31// | Column Metadata Offset Table     |
32// | |B| Column 0 Metadata Position*  |
33// |     Column 0 Metadata Size       |
34// |     ...                          |
35// |     Column CN Metadata Position  |
36// |     Column CN Metadata Size      |
37// ├──────────────────────────────────┤
38// | Global Buffers Offset Table      |
39// | |C| Global Buffer 0 Position*    |
40// |     Global Buffer 0 Size         |
41// |     ...                          |
42// |     Global Buffer GN Position    |
43// |     Global Buffer GN Size        |
44// ├──────────────────────────────────┤
45// | Footer                           |
46// | A u64: Offset to column meta 0   |
47// | B u64: Offset to CMO table       |
48// | C u64: Offset to GBO table       |
49// |   u32: Number of global bufs     |
50// |   u32: Number of columns         |
51// |   u16: Major version             |
52// |   u16: Minor version             |
53// |   "LANC"                         |
54// ├──────────────────────────────────┤
55//
56// File Layout-End

File Version

The Lance file format has gone through a number of changes including a breaking change from version 1 to version 2. There are a number of APIs that allow the file version to be specified. Using a newer version of the file format will lead to better compression and/or performance. However, older software versions may not be able to read newer files.

In addition, the latest version of the file format (next) is unstable and should not be used for production use cases. Breaking changes could be made to unstable encodings and that would mean that files written with these encodings are no longer readable by any newer versions of Lance. The next version should only be used for experimentation and benchmarking upcoming features.

The following values are supported:

File Versions

Version

Minimal Lance Version

Maximum Lance Version

Description

0.1

Any

Any

This is the initial Lance format.

2.0

0.16.0

Any

Rework of the Lance file format that removed row groups and introduced null support for lists, fixed size lists, and primitives

2.1 (unstable)

None

Any

Enhances integer and string compression, adds support for nulls in struct fields, and improves random access performance with nested fields.

legacy

N/A

N/A

Alias for 0.1

stable

N/A

N/A

Alias for the latest stable version (currently 2.0)

next

N/A

N/A

Alias for the latest unstable version (currently 2.1)

File Encodings

Lance supports a variety of encodings for different data types. The encodings are chosen to give both random access and scan performance. Encodings are added over time and may be extended in the future. The manifest records a max format version which controls which encodings will be used. This allows for a gradual migration to a new data format so that old readers can still read new data while a migration is in progress.

Encodings are divided into “field encodings” and “array encodings”. Field encodings are consistent across an entire field of data, while array encodings are used for individual pages of data within a field. Array encodings can nest other array encodings (e.g. a dictionary encoding can bitpack the indices) however array encodings cannot nest field encodings. For this reason data types such as Dictionary<UInt8, List<String>> are not yet supported (since there is no dictionary field encoding)

Encodings Available

Encoding Name

Encoding Type

What it does

Supported Versions

When it is applied

Basic struct

Field encoding

Encodes non-nullable struct data

>= 2.0

Default encoding for structs

List

Field encoding

Encodes lists (nullable or non-nullable)

>= 2.0

Default encoding for lists

Basic Primitive

Field encoding

Encodes primitive data types using separate validity array

>= 2.0

Default encoding for primitive data types

Value

Array encoding

Encodes a single vector of fixed-width values

>= 2.0

Fallback encoding for fixed-width types

Binary

Array encoding

Encodes a single vector of variable-width data

>= 2.0

Fallback encoding for variable-width types

Dictionary

Array encoding

Encodes data using a dictionary array and an indices array which is useful for large data types with few unique values

>= 2.0

Used on string pages with fewer than 100 unique elements

Packed struct

Array encoding

Encodes a struct with fixed-width fields in a row-major format making random access more efficient

>= 2.0

Only used on struct types if the field metadata attribute "packed" is set to "true"

Fsst

Array encoding

Compresses binary data by identifying common substrings (of 8 bytes or less) and encoding them as symbols

>= 2.1

Used on string pages that are not dictionary encoded

Bitpacking

Array encoding

Encodes a single vector of fixed-width values using bitpacking which is useful for integral types that do not span the full range of values

>= 2.1

Used on integral types

Feature Flags

As the file format and dataset evolve, new feature flags are added to the format. There are two separate fields for checking for feature flags, depending on whether you are trying to read or write the table. Readers should check the reader_feature_flags to see if there are any flag it is not aware of. Writers should check writer_feature_flags. If either sees a flag they don’t know, they should return an “unsupported” error on any read or write operation.

Fields

Fields represent the metadata for a column. This includes the name, data type, id, nullability, and encoding.

Fields are listed in depth first order, and can be one of (1) parent (struct), (2) repeated (list/array), or (3) leaf (primitive). For example, the schema:

a: i32
b: struct {
    c: list<i32>
    d: i32
}

Would be represented as the following field list:

name

id

type

parent_id

logical_type

a

1

LEAF

0

"int32"

b

2

PARENT

0

"struct"

b.c

3

REPEATED

2

"list"

b.c

4

LEAF

3

"int32"

b.d

5

LEAF

2

"int32"

Dataset Update and Schema Evolution

Lance supports fast dataset update and schema evolution via manipulating the Manifest metadata.

Appending is done by appending new Fragment to the dataset. While adding columns is done by adding new DataFile of the new columns to each Fragment. Finally, Overwrite a dataset can be done by resetting the Fragment list of the Manifest.

_images/schema_evolution.png

Deletion

Rows can be marked deleted by adding a deletion file next to the data in the _deletions folder. These files contain the indices of rows that have between deleted for some fragment. For a given version of the dataset, each fragment can have up to one deletion file. Fragments that have no deleted rows have no deletion file.

Readers should filter out row ids contained in these deletion files during a scan or ANN search.

Deletion files come in two flavors:

  1. Arrow files: which store a column with a flat vector of indices

  2. Roaring bitmaps: which store the indices as compressed bitmaps.

Roaring Bitmaps are used for larger deletion sets, while Arrow files are used for small ones. This is because Roaring Bitmaps are known to be inefficient for small sets.

The filenames of deletion files are structured like:

_deletions/{fragment_id}-{read_version}-{random_id}.{arrow|bin}

Where fragment_id is the fragment the file corresponds to, read_version is the version of the dataset that it was created off of (usually one less than the version it was committed to), and random_id is a random i64 used to avoid collisions. The suffix is determined by the file type (.arrow for Arrow file, .bin for roaring bitmap).

 1// Deletion File
 2//
 3// The path of the deletion file is constructed as:
 4//   {root}/_deletions/{fragment_id}-{read_version}-{id}.{extension}
 5// where {extension} is `.arrow` or `.bin` depending on the type of deletion.
 6message DeletionFile {
 7  // Type of deletion file, which varies depending on what is the most efficient
 8  // way to store the deleted row offsets. If none, then will be unspecified. If there are
 9  // sparsely deleted rows, then ARROW_ARRAY is the most efficient. If there are
10  // densely deleted rows, then BIT_MAP is the most efficient.
11  enum DeletionFileType {
12    // Deletion file is a single Int32Array of deleted row offsets. This is stored as
13    // an Arrow IPC file with one batch and one column. Has a .arrow extension.
14    ARROW_ARRAY = 0;
15    // Deletion file is a Roaring Bitmap of deleted row offsets. Has a .bin extension.
16    BITMAP = 1;
17  }
18
19  // Type of deletion file. If it is unspecified, then the remaining fields will be missing.
20  DeletionFileType file_type = 1;
21  // The version of the dataset this deletion file was built from.
22  uint64 read_version = 2;
23  // An opaque id used to differentiate this file from others written by concurrent
24  // writers.
25  uint64 id = 3;
26  // The number of rows that are marked as deleted.
27  uint64 num_deleted_rows = 4;
28} // DeletionFile

Deletes can be materialized by re-writing data files with the deleted rows removed. However, this invalidates row indices and thus the ANN indices, which can be expensive to recompute.

Committing Datasets

A new version of a dataset is committed by writing a new manifest file to the _versions directory.

To prevent concurrent writers from overwriting each other, the commit process must be atomic and consistent for all writers. If two writers try to commit using different mechanisms, they may overwrite each other’s changes. For any storage system that natively supports atomic rename-if-not-exists or put-if-not-exists, these operations should be used. This is true of local file systems and cloud object stores, with the notable except of AWS S3. For ones that lack this functionality, an external locking mechanism can be configured by the user.

Manifest Naming Schemes

Manifest files must use a consistent naming scheme. The names correspond to the versions. That way we can open the right version of the dataset without having to read all the manifests. It also makes it clear which file path is the next one to be written.

There are two naming schemes that can be used:

  1. V1: _versions/{version}.manifest. This is the legacy naming scheme.

  2. V2: _versions/{u64::MAX - version:020}.manifest. This is the new naming scheme. The version is zero-padded (to 20 digits) and subtracted from u64::MAX. This allows the versions to be sorted in descending order, making it possible to find the latest manifest on object storage using a single list call.

It is an error for there to be a mixture of these two naming schemes.

Conflict resolution

If two writers try to commit at the same time, one will succeed and the other will fail. The failed writer should attempt to retry the commit, but only if it’s changes are compatible with the changes made by the successful writer.

The changes for a given commit are recorded as a transaction file, under the _transactions prefix in the dataset directory. The transaction file is a serialized Transaction protobuf message. See the transaction.proto file for its definition.

_images/conflict_resolution_flow.png

The commit process is as follows:

  1. The writer finishes writing all data files.

  2. The writer creates a transaction file in the _transactions directory. This files describes the operations that were performed, which is used for two purposes: (1) to detect conflicts, and (2) to re-build the manifest during retries.

  3. Look for any new commits since the writer started writing. If there are any, read their transaction files and check for conflicts. If there are any conflicts, abort the commit. Otherwise, continue.

  4. Build a manifest and attempt to commit it to the next version. If the commit fails because another writer has already committed, go back to step 3.

When checking whether two transactions conflict, be conservative. If the transaction file is missing, assume it conflicts. If the transaction file has an unknown operation, assume it conflicts.

External Manifest Store

If the backing object store does not support *-if-not-exists operations, an external manifest store can be used to allow concurrent writers. An external manifest store is a KV store that supports put-if-not-exists operation. The external manifest store supplements but does not replace the manifests in object storage. A reader unaware of the external manifest store could read a table that uses it, but it might be up to one version behind the true latest version of the table.

_images/external_store_commit.gif

The commit process is as follows:

  1. PUT_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} stage a new manifest in object store under a unique path determined by new uuid

  2. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest-{uuid} commit the path of the staged manifest to the external store.

  3. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest copy the staged manifest to the final path

  4. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest update the external store to point to the final manifest

Note that the commit is effectively complete after step 2. If the writer fails after step 2, a reader will be able to detect the external store and object store are out-of-sync, and will try to synchronize the two stores. If the reattempt at synchronization fails, the reader will refuse to load. This is to ensure the that the dataset is always portable by copying the dataset directory without special tool.

_images/external_store_reader.gif

The reader load process is as follows:

  1. GET_EXTERNAL_STORE base_uri, version, path then, if path does not end in a UUID return the path

  2. COPY_OBJECT_STORE mydataset.lance/_versions/{version}.manifest-{uuid} mydataset.lance/_versions/{version}.manifest reattempt synchronization

  3. PUT_EXTERNAL_STORE base_uri, version, mydataset.lance/_versions/{version}.manifest update the external store to point to the final manifest

  4. RETURN mydataset.lance/_versions/{version}.manifest always return the finalized path, return error if synchronization fails

Statistics

Statistics are stored within Lance files. The statistics can be used to determine which pages can be skipped within a query. The null count, lower bound (min), and upper bound (max) are stored.

Statistics themselves are stored in Lance’s columnar format, which allows for selectively reading only relevant stats columns.

Statistic values

Three types of statistics are stored per column: null count, min value, max value. The min and max values are stored as their native data types in arrays.

There are special behavior for different data types to account for nulls:

For integer-based data types (including signed and unsigned integers, dates, and timestamps), if the min and max are unknown (all values are null), then the minimum/maximum representable values should be used instead.

For float data types, if the min and max are unknown, then use -Inf and +Inf, respectively. (-Inf and +Inf may also be used for min and max if those values are present in the arrays.) NaN values should be ignored for the purpose of min and max statistics. If the max value is zero (negative or positive), the max value should be recorded as +0.0. Likewise, if the min value is zero (positive or negative), it should be recorded as -0.0.

For binary data types, if the min or max are unknown or unrepresentable, then use null value. Binary data type bounds can also be truncated. For example, an array containing just the value "abcd" could have a truncated min of "abc" and max of "abd". If there is no truncated value greater than the maximum value, then instead use null for the maximum.

Warning

The min and max values are not guaranteed to be within the array; they are simply upper and lower bounds. Two common cases where they are not contained in the array is if the min or max original value was deleted and when binary data is truncated. Therefore, statistic should not be used to compute queries such as SELECT max(col) FROM table.

Page-level statistics format

Page-level statistics are stored as arrays within the Lance file. Each array contains one page long and is num_pages long. The page offsets are stored in an array just like the data page table. The offset to the statistics page table is stored in the metadata.

The schema for the statistics is:

<field_id_1>: struct
    null_count: i64
    min_value: <field_1_data_type>
    max_value: <field_1_data_type>
...
<field_id_N>: struct
    null_count: i64
    min_value: <field_N_data_type>
    max_value: <field_N_data_type>

Any number of fields may be missing, as statistics for some fields or of some kind may be skipped. In addition, readers should expect there may be extra fields that are not in this schema. These should be ignored. Future changes to the format may add additional fields, but these changes will be backwards compatible.

However, writers should not write extra fields that aren’t described in this document. Until they are defined in the specification, there is no guarantee that readers will be able to safely interpret new forms of statistics.

Feature: Move-Stable Row IDs

The row ids features assigns a unique u64 id to each row in the table. This id is stable after being moved (such as during compaction), but is not necessarily stable after a row is updated. (A future feature may make them stable after updates.) To make access fast, a secondary index is created that maps row ids to their locations in the table. The respective parts of these indices are stored in the respective fragment’s metadata.

row id

A unique auto-incrementing u64 id assigned to each row in the table.

row address

The current location of a row in the table. This is a u64 that can be thought of as a pair of two u32 values: the fragment id and the local row offset. For example, if the row address is (42, 9), then the row is in the 42rd fragment and is the 10th row in that fragment.

row id sequence

The sequence of row ids in a fragment.

row id index

A secondary index that maps row ids to row addresses. This index is constructed by reading all the row id sequences.

Assigning row ids

Row ids are assigned in a monotonically increasing sequence. The next row id is stored in the manifest as the field next_row_id. This starts at zero. When making a commit, the writer uses that field to assign row ids to new fragments. If the commit fails, the writer will re-read the new next_row_id, update the new row ids, and then try again. This is similar to how the max_fragment_id is used to assign new fragment ids.

When a row id updated, it it typically assigned a new row id rather than reusing the old one. This is because this feature doesn’t have a mechanism to update secondary indices that may reference the old values for the row id. By deleting the old row id and creating a new one, the secondary indices will avoid referencing stale data.

Row ID sequences

The row id values for a fragment are stored in a RowIdSequence protobuf message. This is described in the protos/rowids.proto file. Row id sequences are just arrays of u64 values, which have representations optimized for the common case where they are sorted and possibly contiguous. For example, a new fragment will have a row id sequence that is just a simple range, so it is stored as a start and end value.

These sequence messages are either stored inline in the fragment metadata, or are written to a separate file and referenced from the fragment metadata. This choice is typically made based on the size of the sequence. If the sequence is small, it is stored inline. If it is large, it is written to a separate file. By keeping the small sequences inline, we can avoid the overhead of additional IO operations.

  oneof row_id_sequence {
    // If small (< 200KB), the row ids are stored inline.
    bytes inline_row_ids = 5;
    // Otherwise, stored as part of a file.
    ExternalFile external_row_ids = 6;
  } // row_id_sequence

Row ID index

To ensure fast access to rows by their row id, a secondary index is created that maps row ids to their locations in the table. This index is built when a table is loaded, based on the row id sequences in the fragments. For example, if fragment 42 has a row id sequence of [0, 63, 10], then the index will have entries for 0 -> (42, 0), 63 -> (42, 1), 10 -> (42, 2). The exact form of this index is left up to the implementation, but it should be optimized for fast lookups.