Apache Datafusion
In Python, LanceDB tables can also be queried with Apache Datafusion, an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. This means you can write complex SQL queries to analyze your data in LanceDB.
This integration is done via Datafusion FFI, which provides a native integration between LanceDB and Datafusion. The Datafusion FFI allows to pass down column selections and basic filters to LanceDB, reducing the amount of scanned data when executing your query. Additionally, the integration allows streaming data from LanceDB tables which allows to do aggregation larger-than-memory.
We can demonstrate this by first installing datafusion
and lancedb
.
We will re-use the dataset created previously:
import lancedb
from datafusion import SessionContext
from lance import FFILanceTableProvider
db = lancedb.connect("data/sample-lancedb")
data = [
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
]
lance_table = db.create_table("lance_table", data)
ctx = SessionContext()
ffi_lance_table = FFILanceTableProvider(
lance_table.to_lance(), with_row_id=True, with_row_addr=True
)
ctx.register_table_provider("ffi_lance_table", ffi_lance_table)
The to_lance
method converts the LanceDB table to a LanceDataset
, which is accessible to Datafusion through the Datafusion FFI integration layer.
To query the resulting Lance dataset in Datafusion, you first need to register the dataset with Datafusion and then just reference it by the same name in your SQL query.
βββββββββββββββ¬ββββββββββ¬βββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ
β vector β item β price β _rowid β _rowaddr β
β float[] β varchar β double β bigint unsigned β bigint unsigned β
βββββββββββββββΌββββββββββΌβββββββββΌββββββββββββββββββΌββββββββββββββββββ€
β [3.1, 4.1] β foo β 10.0 β 0 β 0 β
β [5.9, 26.5] β bar β 20.0 β 1 β 1 β
βββββββββββββββ΄ββββββββββ΄βββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ