Blob As Files

Unlike other data formats, large multimodal data is a first-class citizen in the Lance columnar format. Lance provides a high-level API to store and retrieve large binary objects (blobs) in Lance datasets.

_images/blob.png

Lance serves large binary data using lance.BlobFile, which is a file-like object that lazily reads large binary objects.

class lance.BlobFile(inner: LanceBlobFile)

Bases: RawIOBase

Represents a blob in a Lance dataset as a file-like object.

close() None

Flush and close the IO object.

This method has no effect if the file is already closed.

readable() bool

Return whether object was opened for reading.

If False, read() will raise OSError.

readall() bytes

Read until EOF, using multiple read() call.

seek(offset: int, whence: int = 0) int

Change the stream position to the given byte offset.

offset

The stream position, relative to ‘whence’.

whence

The relative position to seek from.

The offset is interpreted relative to the position indicated by whence. Values for whence are:

  • os.SEEK_SET or 0 – start of stream (the default); offset should be zero or positive

  • os.SEEK_CUR or 1 – current stream position; offset may be negative

  • os.SEEK_END or 2 – end of stream; offset is usually negative

Return the new absolute position.

seekable() bool

Return whether object supports random access.

If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().

size() int

Returns the size of the blob in bytes.

tell() int

Return current stream position.

To fetch blobs from a Lance dataset, you can use lance.dataset.LanceDataset.take_blobs().

For example, it’s easy to use BlobFile to extract frames from a video file without loading the entire video into memory.

# pip install av pylance

import av
import lance

ds = lance.dataset("./youtube.lance")
start_time, end_time = 500, 1000
blobs = ds.take_blobs([5], "video")
with av.open(blobs[0]) as container:
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = "NONKEY"

    start_time = start_time / stream.time_base
    start_time = start_time.as_integer_ratio()[0]
    end_time = end_time / stream.time_base
    container.seek(start_time, stream=stream)

    for frame in container.decode(stream):
        if frame.time > end_time:
            break
        display(frame.to_image())
        clear_output(wait=True)