Skip to content

Apache Datafusion

In Python, LanceDB tables can also be queried with Apache Datafusion, an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. This means you can write complex SQL queries to analyze your data in LanceDB.

This integration is done via Datafusion FFI, which provides a native integration between LanceDB and Datafusion. The Datafusion FFI allows to pass down column selections and basic filters to LanceDB, reducing the amount of scanned data when executing your query. Additionally, the integration allows streaming data from LanceDB tables which allows to do aggregation larger-than-memory.

We can demonstrate this by first installing datafusion and lancedb.

pip install datafusion lancedb

We will re-use the dataset created previously:

import lancedb

from datafusion import SessionContext
from lance import FFILanceTableProvider

db = lancedb.connect("data/sample-lancedb")
data = [
    {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
    {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
]
lance_table = db.create_table("lance_table", data)

ctx = SessionContext()

ffi_lance_table = FFILanceTableProvider(
    lance_table.to_lance(), with_row_id=True, with_row_addr=True
)
ctx.register_table_provider("ffi_lance_table", ffi_lance_table)

The to_lance method converts the LanceDB table to a LanceDataset, which is accessible to Datafusion through the Datafusion FFI integration layer. To query the resulting Lance dataset in Datafusion, you first need to register the dataset with Datafusion and then just reference it by the same name in your SQL query.

ctx.table("ffi_lance_table")
ctx.sql("SELECT * FROM ffi_lance_table")
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   vector    β”‚  item   β”‚ price  β”‚ _rowid          β”‚ _rowaddr        β”‚
β”‚   float[]   β”‚ varchar β”‚ double β”‚ bigint unsigned β”‚ bigint unsigned β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ [3.1, 4.1]  β”‚ foo     β”‚   10.0 β”‚               0 β”‚               0 β”‚
β”‚ [5.9, 26.5] β”‚ bar     β”‚   20.0 β”‚               1 β”‚               1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜