Legacy

The legacy with_embeddings API is for Python only and is deprecated.

Hugging Face

The most popular open source option is to use the sentence-transformers library, which can be installed via pip.

pip install sentence-transformers

The example below shows how to use the paraphrase-albert-small-v2 model to generate embeddings for a given document.

from sentence_transformers import SentenceTransformer

name="paraphrase-albert-small-v2"
model = SentenceTransformer(name)

# used for both training and querying
def embed_func(batch):
    return [model.encode(sentence) for sentence in batch]

OpenAI

Another popular alternative is to use an external API like OpenAI's embeddings API.

import openai
import os

# Configuring the environment variable OPENAI_API_KEY
if "OPENAI_API_KEY" not in os.environ:
# OR set the key here as a variable
openai.api_key = "sk-..."

client = openai.OpenAI()

def embed_func(c):    
    rs = client.embeddings.create(input=c, model="text-embedding-ada-002")
    return [record.embedding for record in rs["data"]]

Applying an embedding function to data

Using an embedding function, you can apply it to raw data to generate embeddings for each record.

Say you have a pandas DataFrame with a text column that you want embedded, you can use the with_embeddings function to generate embeddings and add them to an existing table.

    import pandas as pd
    from lancedb.embeddings import with_embeddings

    df = pd.DataFrame(
        [
            {"text": "pepperoni"},
            {"text": "pineapple"}
        ]
    )
    data = with_embeddings(embed_func, df)

    # The output is used to create / append to a table
    tbl = db.create_table("my_table", data=data)

If your data is in a different column, you can specify the column kwarg to with_embeddings.

By default, LanceDB calls the function with batches of 1000 rows. This can be configured using the batch_size parameter to with_embeddings.

LanceDB automatically wraps the function with retry and rate-limit logic to ensure the OpenAI API call is reliable.

Querying using an embedding function

Warning

At query time, you must use the same embedding function you used to vectorize your data. If you use a different embedding function, the embeddings will not reside in the same vector space and the results will be nonsensical.

Python

query = "What's the best pizza topping?"
query_vector = embed_func([query])[0]
results = (
   tbl.search(query_vector)
   .limit(10)
   .to_pandas()
)

The above snippet returns a pandas DataFrame with the 10 closest vectors to the query.