Versioning & Reproducibility¶
Reproducibility is critical for AI. For code, it's easy to keep track of changes using Github or Gitlab. For data, it's not as easy. Most of the time, we're manually writing complicated data tracking code, wrestling with an external tool, and dealing with expensive duplicate snapshot copies with low granularity.
While working with most other vector databases, if we loaded in the wrong data (or any other such mistakes), we have to blow away the index, correct the mistake, and then completely rebuild it. It's really difficult to rollback to an earlier state, and any such corrective action destroys historical data and evidence, which may be useful down the line to debug and diagnose issues.
To our knowledge, LanceDB is the first and only vector database that supports full reproducibility and rollbacks natively. Taking advantage of the Lance columnar data format, LanceDB supports:
- Automatic versioning
- Instant rollback
- Appends, updates, deletions
- Schema evolution
This makes auditing, tracking, and reproducibility a breeze!
Let's see how this all works.
Pickle Rick!¶
We'll start with a local LanceDB connection
import lancedb
db = lancedb.connect("~/.lancedb")
We've got a CSV file with a bunch of quotes from Rick and Morty
!head rick_and_morty_quotes.csv
id,quote,author 1,"Nobody exists on purpose. Nobody belongs anywhere.",Morty 2,"We're all going to die. Come watch TV.",Morty 3,"Losers look stuff up while the rest of us are carpin' all them diems.",Summer 4,"He's not a hot girl. He can't just bail on his life and set up shop in someone else's.",Beth 5,"When you are an a—hole, it doesn't matter how right you are. Nobody wants to give you the satisfaction.",Morty 6,"God's turning people into insect monsters, Beth. I'm the one beating them to death. Thank me.",Jerry 7,"Camping is just being homeless without the change.",Summer 8,"This seems like a good time for a drink and a cold, calculated speech with sinister overtones. A speech about politics, about order, brotherhood, power ... but speeches are for campaigning. Now is the time for action.",Morty 9,"Having a family doesn't mean that you stop being an individual. You know the best thing you can do for the people that depend on you? Be honest with them, even if it means setting them free.",Mr. Meeseeks
Let's load this into a pandas dataframe.
It's got 3 columns, a quote id, the quote string, and the first name of the author of the quote:
import pandas as pd
df = pd.read_csv("rick_and_morty_quotes.csv")
df.head()
id | quote | author | |
---|---|---|---|
0 | 1 | Nobody exists on purpose. Nobody belongs anywh... | Morty |
1 | 2 | We're all going to die. Come watch TV. | Morty |
2 | 3 | Losers look stuff up while the rest of us are ... | Summer |
3 | 4 | He's not a hot girl. He can't just bail on his... | Beth |
4 | 5 | When you are an a—hole, it doesn't matter how ... | Morty |
Creating a LanceDB table from a pandas dataframe is straightforward using create_table
db.drop_table("rick_and_morty", ignore_missing=True)
table = db.create_table("rick_and_morty", df)
table.head().to_pandas()
id | quote | author | |
---|---|---|---|
0 | 1 | Nobody exists on purpose. Nobody belongs anywh... | Morty |
1 | 2 | We're all going to die. Come watch TV. | Morty |
2 | 3 | Losers look stuff up while the rest of us are ... | Summer |
3 | 4 | He's not a hot girl. He can't just bail on his... | Beth |
4 | 5 | When you are an a—hole, it doesn't matter how ... | Morty |
Updates¶
Now, since Rick is the smartest man in the multiverse, he deserves to have his quotes attributed to his full name: Richard Daniel Sanchez.
This can be done via LanceTable.update
. It needs two arguments:
- A
where
string filter (sql syntax) to determine the rows to update - A dict of
values
where the keys are the column names to update and the values are the new values
table.update(where="author='Rick'", values={"author": "Richard Daniel Sanchez"})
table.to_pandas()
id | quote | author | |
---|---|---|---|
0 | 1 | Nobody exists on purpose. Nobody belongs anywh... | Morty |
1 | 2 | We're all going to die. Come watch TV. | Morty |
2 | 3 | Losers look stuff up while the rest of us are ... | Summer |
3 | 4 | He's not a hot girl. He can't just bail on his... | Beth |
4 | 5 | When you are an a—hole, it doesn't matter how ... | Morty |
... | ... | ... | ... |
56 | 57 | If I let you make me nervous, then we can't ge... | Richard Daniel Sanchez |
57 | 58 | Oh, boy, so you actually learned something tod... | Richard Daniel Sanchez |
58 | 59 | I can't abide bureaucracy. I don't like being ... | Richard Daniel Sanchez |
59 | 60 | I think you have to think ahead and live in th... | Richard Daniel Sanchez |
60 | 61 | I know that new situations can be intimidating... | Richard Daniel Sanchez |
61 rows × 3 columns
Schema evolution¶
Ok so this is a vector database, so we need actual vectors. We'll use sentence transformers here to avoid having to deal with api keys and all that.
Let's create a basic model using the "all-MiniLM-L6-v2" model and embed the quotes
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")
vectors = model.encode(df.quote.values.tolist(),
convert_to_numpy=True,
normalize_embeddings=True).tolist()
/Users/ayush/Documents/lancedb/env/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
We can then convert the vectors into a pyarrow Table and merge it to the LanceDB Table.
For the merge to work successfully, we need to have an overlapping column. Here the natural choice is to use the id column
from lance.vector import vec_to_table
import numpy as np
import pyarrow as pa
embeddings = vec_to_table(vectors)
embeddings = embeddings.append_column("id", pa.array(np.arange(len(table))+1))
embeddings.to_pandas().head()
vector | id | |
---|---|---|
0 | [0.044295236, -0.0831885, -0.03597761, -0.0396... | 1 |
1 | [0.057405394, -0.09669633, 0.00515391, -0.0213... | 2 |
2 | [0.057896998, -0.033441037, 0.01376669, -0.015... | 3 |
3 | [0.038649295, 0.01286428, -0.03261163, 0.01939... | 4 |
4 | [0.07633445, 0.03451182, -0.0037649637, 0.0203... | 5 |
And now we'll use the LanceTable.merge
function to add the vector column into the LanceTable.
table.merge(embeddings, left_on="id")
table.head().to_pandas()
id | quote | author | vector | |
---|---|---|---|---|
0 | 1 | Nobody exists on purpose. Nobody belongs anywh... | Morty | [0.044295236, -0.0831885, -0.03597761, -0.0396... |
1 | 2 | We're all going to die. Come watch TV. | Morty | [0.057405394, -0.09669633, 0.00515391, -0.0213... |
2 | 3 | Losers look stuff up while the rest of us are ... | Summer | [0.057896998, -0.033441037, 0.01376669, -0.015... |
3 | 4 | He's not a hot girl. He can't just bail on his... | Beth | [0.038649295, 0.01286428, -0.03261163, 0.01939... |
4 | 5 | When you are an a—hole, it doesn't matter how ... | Morty | [0.07633445, 0.03451182, -0.0037649637, 0.0203... |
If we look at the schema, we see that all-MiniLM-L6-v2
produces 384-dimensional vectors
table.schema
id: int64 quote: string author: string vector: fixed_size_list<item: float>[384] child 0, item: float
Rollback¶
Suppose we used the table and found that the all-MiniLM-L6-v2
model doesn't produce ideal results. Instead we want to try a larger model. How do we use the new embeddings without losing the change history?
First, major operations are automatically versioned in LanceDB. Version 1 is the table creation. This contains no rows but just records the schema and metadata. Version 2 is the initial insertion of data. Versions 3 and 4 represents the update (deletion + append) Version 5 is adding the new column.
table.list_versions()
[{'version': 1, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 39, 40549), 'metadata': {}}, {'version': 2, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 39, 63675), 'metadata': {}}, {'version': 3, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 53, 979216), 'metadata': {}}, {'version': 4, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 53, 988601), 'metadata': {}}, {'version': 5, 'timestamp': datetime.datetime(2023, 10, 20, 14, 35, 44, 475220), 'metadata': {}}]
We can restore version 4, before we added the old vector column
table.restore(4)
table.head().to_pandas()
id | quote | author | |
---|---|---|---|
0 | 1 | Nobody exists on purpose. Nobody belongs anywh... | Morty |
1 | 2 | We're all going to die. Come watch TV. | Morty |
2 | 3 | Losers look stuff up while the rest of us are ... | Summer |
3 | 4 | He's not a hot girl. He can't just bail on his... | Beth |
4 | 5 | When you are an a—hole, it doesn't matter how ... | Morty |
Notice that we now have one more, not less versions. When we restore an old version, we're not deleting the version history, we're just creating a new version where the schema and data is equivalent to the restored old version. In this way, we can keep track of all of the changes and always rollback to a previous state.
table.list_versions()
[{'version': 1, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 39, 40549), 'metadata': {}}, {'version': 2, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 39, 63675), 'metadata': {}}, {'version': 3, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 53, 979216), 'metadata': {}}, {'version': 4, 'timestamp': datetime.datetime(2023, 10, 20, 14, 33, 53, 988601), 'metadata': {}}, {'version': 5, 'timestamp': datetime.datetime(2023, 10, 20, 14, 35, 44, 475220), 'metadata': {}}, {'version': 6, 'timestamp': datetime.datetime(2023, 10, 20, 14, 36, 15, 658370), 'metadata': {}}]
Switching Models¶
Now we'll switch to the all-mpnet-base-v2
model and add the vectors to the restored dataset again
model = SentenceTransformer("all-mpnet-base-v2", device="cpu")
vectors = model.encode(df.quote.values.tolist(),
convert_to_numpy=True,
normalize_embeddings=True).tolist()
embeddings = vec_to_table(vectors)
embeddings = embeddings.append_column("id", pa.array(np.arange(len(table))+1))
table.merge(embeddings, left_on="id")
Downloading (…)a8e1d/.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 6.02MB/s] Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 411kB/s] Downloading (…)b20bca8e1d/README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 14.2MB/s] Downloading (…)0bca8e1d/config.json: 100%|██████████| 571/571 [00:00<00:00, 1.83MB/s] Downloading (…)ce_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 991kB/s] Downloading (…)e1d/data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 188kB/s] Downloading pytorch_model.bin: 100%|██████████| 438M/438M [00:34<00:00, 12.6MB/s] Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 150kB/s] Downloading (…)cial_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 624kB/s] Downloading (…)a8e1d/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 719kB/s] Downloading (…)okenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 1.06MB/s] Downloading (…)8e1d/train_script.py: 100%|██████████| 13.1k/13.1k [00:00<00:00, 21.8MB/s] Downloading (…)b20bca8e1d/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 606kB/s] Downloading (…)bca8e1d/modules.json: 100%|██████████| 349/349 [00:00<00:00, 946kB/s]
table.schema
id: int64 quote: string author: string vector: fixed_size_list<item: float>[768] child 0, item: float
Deletion¶
What if the whole show was just Rick-isms? Let's delete any quote not said by Rick
table.delete("author != 'Richard Daniel Sanchez'")
We can see that the number of rows has been reduced to 30
len(table)
30
Ok we had our fun, let's get back to the full quote set
table.restore(7)
len(table)
61
History¶
We now have 9 versions in the data. We can review the operations that corresponds to each version below:
table.version
9
Versions:
- 1 - Create
- 2 - Append
- 3 - Update (deletion)
- 4 - Update (append)
- 5 - Merge (vector column)
- 6 - Restore (4)
- 7 - Merge (new vector column)
- 8 - Deletion
- 9 - Restore
Summary¶
We never had to explicitly manage the versioning. And we never had to create expensive and slow snapshots. LanceDB automatically tracks the full history of operations I created and supports fast rollbacks. In production this is critical for debugging issues and minimizing downtime by rolling back to a previously successful state in seconds.