Video format (`LeRobotLanceVideoDataset`)¶

Keeps the original mp4 bytes verbatim in a Lance column with the blob v2 encoding. At read time, Dataset.take_blobs returns the bytes as a file-like object; we hand them to torchcodec and ask for the specific frames we need.

This is the closest the Lance loader gets to the upstream parquet+mp4 path:

Same pixels (bit-exact).
Same disk size (within 0.5 % on every dataset we measured).
One self-contained Lance dataset — no separate mp4 file tree.

When to use¶

Your source dataset is dtype=video (most lerobot datasets).
You care about byte-exact training-accuracy parity with upstream.
You don't want to ship a separate mp4 directory tree alongside the Lance data.

If your source is dtype=image (lerobot/pusht_image, etc.), there's no mp4 to copy — use convert_to_lance instead.

Schema¶

<table>.lance/                   # one row per frame
  episode_index:     int32
  frame_index:       int32
  index:             int64
  timestamp:         float32
  task_index:        int32
  observation.state: list<float32>[D]
  action:            list<float32>[A]
  ...other tabular features
  (no image columns)

<table>_videos.lance/            # one row per source mp4 file
  video_key:    string           # e.g. "observation.images.cam_high"
  chunk_index:  int32
  file_index:   int32
  video_bytes:  large_binary     # metadata: lance-encoding:blob = true

Each row in the videos table is one unique mp4 file. LeRobot lays out source mp4s as videos/{video_key}/chunk-NNN/file-MMM.mp4 and stores (chunk_index, file_index, from_timestamp) per episode in meta/episodes/*.parquet; the reader joins on those.

Why (video_key, chunk, file) instead of just (chunk, file)?

On bimanual ALOHA datasets all cameras share the same mp4 file for a given episode. On single-arm setups like Koch the laptop and phone cameras can use different file indices for the same episode. The row identity has to include the camera key.

Reader¶

from lerobot_lancedb import LeRobotLanceVideoDataset

ds = LeRobotLanceVideoDataset(root="./aloha_cups_open_lance_video")

sample = ds[0]
sample["observation.images.cam_high"].shape   # (C, H, W) uint8 (or float)
sample["action"].shape                        # (A,) float32

Same constructor surface as the frames format — root=, repo_id=, uri= + meta_root=, plus delta_timestamps.

How a frame read works¶

For each __getitems__([i0, i1, ...]):

Look up tabular fields for the requested frames from the per-frame table.
For each unique (video_key, chunk, file) triple needed by this batch, fetch the blob via take_blobs(blob_column="video_bytes", indices=[...]).
Cache the resulting torchcodec.VideoDecoder per (chunk, file, video_key) in a per-worker LRU.
Compute the in-file frame index as round((timestamp + from_timestamp) * average_fps) and decode.

For shuffled training this means at most one decoder setup per file per worker, regardless of how many frames the batch pulls.

`delta_timestamps`¶

ds = LeRobotLanceVideoDataset(
    root="./aloha_cups_open_lance_video",
    delta_timestamps={
        "observation.images.cam_high": [-0.1, -0.05, 0.0],
        "action":                       [0.0, 0.05, 0.1, 0.15, 0.2],
    },
)

Multiple timestamps for the same camera collapse into a single decoder.get_frames_at(indices=[...]) call.

Decoder cache size¶

LeRobotLanceVideoDataset(root="...", decoder_cache_size=16)

decoder_cache_size bounds the per-worker decoder LRU (default 16). Bigger caches help when many episodes share a few files (typical for ALOHA-style consolidated mp4s).

GPU decode¶

There's no decode_device="cuda" shortcut yet. torchcodec's CUDA decode path needs:

A torchcodec build linked against the NVIDIA Video Codec SDK (the default pip wheel ships the ffmpeg variant, which raises Unsupported device: cuda).
A GPU that NVDECs the source codec (AV1 NVDEC requires RTX 40-series / Hopper or newer).

Once both are in place, wiring it through is a one-line change (we already pass device= to VideoDecoder).

Cloud auth¶

Same env vars as the frames format:

S3 — AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
GCS — GOOGLE_APPLICATION_CREDENTIALS
HF Hub — HF_TOKEN (or huggingface-cli login)

ds = LeRobotLanceVideoDataset(
    uri="s3://bucket/path/aloha_cups_open.lance",
    meta_root="./aloha_cups_open_meta",
)

Because blob v2 columns aren't materialized into Arrow buffers, the reader only pays a byte-range fetch for the specific mp4 file each access touches — well-suited to remote object stores.

Trade-offs vs the frames format¶

	Frames format	Video format
Bit-exact pixels	only with `--jpeg-quality=100 --jpeg-subsampling=0` (near-bit-exact)	always
Disk size	5–25× larger than upstream	~same as upstream
Single-frame throughput	fastest (NVJPEG-eligible)	slower (torchcodec seek + decode)
`delta_timestamps` throughput	fast	~tied with JPEG-95, faster than upstream
GPU decode	NVJPEG (`decode_device="auto"`)	needs CUDA-built torchcodec
Cloud-friendly	yes, byte-range fetches	yes, byte-range fetches

See Benchmarks for measured numbers.

Video format (LeRobotLanceVideoDataset)¶