Consistency in LanceDB
In LanceDB OSS, users can set the read_consistency_interval
parameter on connections to achieve different levels of read consistency. This parameter determines how frequently the database synchronizes with the underlying storage system to check for updates made by other processes. If another process updates a table, the database will not see the changes until the next synchronization.
There are three possible settings for read_consistency_interval
:
- Unset (default): The database does not check for updates to tables made by other processes. This provides the best query performance, but means that clients may not see the most up-to-date data. This setting is suitable for applications where the data does not change during the lifetime of the table reference.
- Zero seconds (Strong consistency): The database checks for updates on every read. This provides the strongest consistency guarantees, ensuring that all clients see the latest committed data. However, it has the most overhead. This setting is suitable when consistency matters more than having high QPS.
- Custom interval (Eventual consistency): The database checks for updates at a custom interval, such as every 5 seconds. This provides eventual consistency, allowing for some lag between write and read operations. Performance wise, this is a middle ground between strong consistency and no consistency check. This setting is suitable for applications where immediate consistency is not critical, but clients should see updated data eventually.
Consistency in LanceDB Cloud
This is only tune-able in LanceDB OSS. In LanceDB Cloud, readers are always eventually consistent.
Configuring Consistency Parameters
To set strong consistency, use timedelta(0)
:
For eventual consistency, use a custom timedelta
:
By default, a Table
will never check for updates from other writers. To manually check for updates you can use checkout_latest
:
Handling bad vectors
In LanceDB Python, you can use the on_bad_vectors
parameter to choose how
invalid vector values are handled. Invalid vectors are vectors that are not valid
because:
- They are the wrong dimension
- They contain NaN values
- They are null but are on a non-nullable field
By default, LanceDB will raise an error if it encounters a bad vector. You can also choose one of the following options:
drop
: Ignore rows with bad vectorsfill
: Replace bad values (NaNs) or missing values (too few dimensions) with the fill value specified in thefill_value
parameter. An input like[1.0, NaN, 3.0]
will be replaced with[1.0, 0.0, 3.0]
iffill_value=0.0
.null
: Replace bad vectors with null (only works if the column is nullable). A bad vector[1.0, NaN, 3.0]
will be replaced withnull
if the column is nullable. If the vector column is non-nullable, then bad vectors will cause an error