Skip to content

Reading Lance Datasets

Basic Reading

df = (spark.read
    .format("lance")
    .option("db", "/path/to/lance/database")
    .option("dataset", "my_dataset")
    .load())
val df = spark.read.
    format("lance").
    option("db", "/path/to/lance/database").
    option("dataset", "my_dataset").
    load()
Dataset<Row> df = spark.read()
    .format("lance")
    .option("db", "/path/to/lance/database")
    .option("dataset", "my_dataset")
    .load();

Column Selection

Lance is a columnar format. You can specify which columns to read to improve performance:

df = (spark.read
    .format("lance")
    .option("db", "/path/to/lance/database")
    .option("dataset", "my_dataset")
    .load()
    .select("id", "name", "age"))
val df = spark.read.
    format("lance").
    option("db", "/path/to/lance/database").
    option("dataset", "my_dataset").
    load().
    select("id", "name", "age")
Dataset<Row> df = spark.read()
    .format("lance")
    .option("db", "/path/to/lance/database")
    .option("dataset", "my_dataset")
    .load()
    .select("id", "name", "age");

Filters

You can apply filters to a read. The filter is pushed down to reduce the amount of data read:

from pyspark.sql.functions import col

filtered = (spark.read
    .format("lance")
    .option("db", "/path/to/database")
    .option("dataset", "users")
    .load()
    .filter(
        col("age").between(25, 65) &
        col("department") == "Engineering" &
        col("is_active") == True
    ))
import org.apache.spark.sql.functions.col

val filtered = spark.read.
    format("lance").
    option("db", "/path/to/database").
    option("dataset", "users").
    load().
    filter(
        col("age").between(25, 65) &&
        col("department") === "Engineering" &&
        col("is_active") === true
    )
Dataset<Row> filtered = spark.read()
    .format("lance")
    .option("db", "/path/to/database")
    .option("dataset", "users")
    .load()
    .filter("age BETWEEN 25 AND 65 AND department = 'Engineering' AND is_active = true");