Skip to content

Blob types

Gevena supports UDFs that take Lance Blobs (large binary objects) as input and has the ability to write out columns with binaries encoded as Lance Blobs. Lance blobs are an optimization intended for large objects (1's MBs -> 100MB's) and provide a file-like object that lazily reads large binary objects.

Reading Blobs

Defining functions that read blob columns is straight forward.

For scalar UDFs, blob columns are expected to be of type BlobFile

from lance.blob import BlobFile

@udf
def work_on_udf(blob: BlobFile) -> int:
    assert isinstance(blob, BlobFile)
    data = blob.read()
    # do something intresting.

    return len(data)

TODO: For batched pa.Array UDFs and for RecordBatch UDFs, blob columns are dereferenced and are presented as bytes.


Writing Blobs

Defining UDFs that write out Blobs to a new column is straightforward. Here we add the standard metadata annotation to the UDF so that Geneva knows to write out Blobs.

For scalar udfs, your udf will return bytes, explicitly set the data_type to pa.large_binary(), and add the field_metadata that specifies blob encoding.

@udf(data_type=pa.large_binary(), field_metadata={"lance-encoding:blob": "true"})
def generate_blob(text: str, multiplier: int) -> bytes:
    """UDF that generates blob data by repeating text."""
    return (text * multiplier).encode("utf-8")

For pa.RecordBatch batched UDFs you the effort is similar:

@udf(data_type=pa.large_binary(), field_metadata={"lance-encoding:blob": "true"})
def batch_to_blob(batch: pa.RecordBatch) -> pa.Array:
    """UDF that converts RecordBatch rows to blob data."""
    import json

    blobs = []
    for i in range(batch.num_rows):
        # do something that returns bytes
        blob_data = ... 
        blobs.append(blob_data)
    return pa.array(blobs, type=pa.large_binary())