Lance ❤️ Spark

Lance can be used as a third party datasource of https://spark.apache.org/docs/latest/sql-data-sources.html

Warning

This feature is experimental and the APIs may change in the future.

Build from source code

git clone https://github.com/lancedb/lance.git
cd lance/java
mvn clean package -DskipTests -Drust.release.build=true

After building the code, the spark related jars are under path lance/java/spark/target/jars/

arrow-c-data-15.0.0.jar
arrow-dataset-15.0.0.jar
jar-jni-1.1.1.jar
lance-core-0.25.0-SNAPSHOT.jar
lance-spark-0.25.0-SNAPSHOT.jar

Download the pre-build jars

If you did not want to get jars from source, you can download these five jars from maven repo.

wget https://repo1.maven.org/maven2/com/lancedb/lance-core/0.23.0/lance-core-0.23.0.jar
wget https://repo1.maven.org/maven2/com/lancedb/lance-spark/0.23.0/lance-spark-0.23.0.jar
wget https://repo1.maven.org/maven2/org/questdb/jar-jni/1.1.1/jar-jni-1.1.1.jar
wget https://repo1.maven.org/maven2/org/apache/arrow/arrow-c-data/12.0.1/arrow-c-data-12.0.1.jar
wget https://repo1.maven.org/maven2/org/apache/arrow/arrow-dataset/12.0.1/arrow-dataset-12.0.1.jar

Configurations for Lance Spark Connector

There are some configurations you have to set in spark-defaults.conf to enable lance datasource.

spark.sql.catalog.lance com.lancedb.lance.spark.LanceCatalog

This config define the LanceCatalog and then the spark will treat lance as a datasource.

If dealing with lance dataset stored in object store, these configurations should be set:

spark.sql.catalog.lance.access_key_id {your object store ak}
spark.sql.catalog.lance.secret_access_key {your object store sk}
spark.sql.catalog.lance.aws_region {your object store region(optional)}
spark.sql.catalog.lance.aws_endpoint {your object store aws_endpoint which should be in virtual host style}
spark.sql.catalog.lance.virtual_hosted_style_request true

Startup the Spark Shell

bin/spark-shell --master "local[56]"  --jars "/path_of_code/lance/java/spark/target/jars/*.jar"

Use --jars to involve the related jars we build or downloaded.

Note

Spark shell console use scala language not python

Using Spark Shell to manipulate lance dataset

  • Write a new dataset named test.lance

val df = Seq(
  ("Alice", 1),
  ("Bob", 2)
).toDF("name", "id")
df.write.format("lance").option("path","./test.lance").save()
  • Overwrite the test.lance dataset

val df = Seq(
  ("Alice", 3),
  ("Bob", 4)
).toDF("name", "id")
df.write.format("lance").option("path","./test.lance").mode("overwrite").save()
  • Append Data into the test.lance dataset

val df = Seq(
  ("Chris", 5),
  ("Derek", 6)
).toDF("name", "id")
df.write.format("lance").option("path","./test.lance").mode("append").save()
  • Use spark data frame to read the test.lance dataset

val data = spark.read.format("lance").option("path", "./test.lance").load();
data.show()
  • Register data frame as table and use sql to query test.lance dataset

data.createOrReplaceTempView("lance_table")
spark.sql("select id, count(*) from lance_table group by id order by id").show()