Blob As Files¶
Unlike other data formats, large multimodal data is a first-class citizen in the Lance columnar format. Lance provides a high-level API to store and retrieve large binary objects (blobs) in Lance datasets.
Lance serves large binary data using lance.BlobFile
, which
is a file-like object that lazily reads large binary objects.
- class lance.BlobFile(inner: LanceBlobFile)
Bases:
RawIOBase
Represents a blob in a Lance dataset as a file-like object.
- close() None
Flush and close the IO object.
This method has no effect if the file is already closed.
- readable() bool
Return whether object was opened for reading.
If False, read() will raise OSError.
- readall() bytes
Read until EOF, using multiple read() call.
- seek(offset: int, whence: int = 0) int
Change the stream position to the given byte offset.
- offset
The stream position, relative to ‘whence’.
- whence
The relative position to seek from.
The offset is interpreted relative to the position indicated by whence. Values for whence are:
os.SEEK_SET or 0 – start of stream (the default); offset should be zero or positive
os.SEEK_CUR or 1 – current stream position; offset may be negative
os.SEEK_END or 2 – end of stream; offset is usually negative
Return the new absolute position.
- seekable() bool
Return whether object supports random access.
If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().
- size() int
Returns the size of the blob in bytes.
- tell() int
Return current stream position.
To fetch blobs from a Lance dataset, you can use lance.dataset.LanceDataset.take_blobs()
.
For example, it’s easy to use BlobFile to extract frames from a video file without loading the entire video into memory.
# pip install av pylance
import av
import lance
ds = lance.dataset("./youtube.lance")
start_time, end_time = 500, 1000
blobs = ds.take_blobs([5], "video")
with av.open(blobs[0]) as container:
stream = container.streams.video[0]
stream.codec_context.skip_frame = "NONKEY"
start_time = start_time / stream.time_base
start_time = start_time.as_integer_ratio()[0]
end_time = end_time / stream.time_base
container.seek(start_time, stream=stream)
for frame in container.decode(stream):
if frame.time > end_time:
break
display(frame.to_image())
clear_output(wait=True)