Benchmarks¶

Hardware: H100 80 GB box, CPU decode (no NVJPEG) for apples-to-apples across formats.

Read pattern: delta_timestamps with 8 frames per sample — the realistic training read shape. Single-frame throughput is much higher across the board, but it's not what training pipelines actually pay for.

Settings:

batch_size=32
num_workers=4
30 batches (5 warmup)
Pixel diffs sampled across 16 random frames per camera, averaged

Reproduce with examples/benchmark_formats.py.

Throughput and size¶

`lerobot/pusht` (synthetic 96×96, 1-cam)¶

format	size MB	delta_ts fps	speedup	bit-exact?
upstream parquet+mp4	7.3	750	1.00×	✓
`convert_to_lance` (JPEG-95)	60.0	3510	4.68×	✗
`convert_to_lance --jpeg-quality=100 --jpeg-subsampling=0`	105.6	2909	3.88×	✗
`convert_to_lance_video`	8.0	2853	3.80×	✓

`lerobot/aloha_static_cups_open` (real 480×640, 4-cam bimanual)¶

format	size MB	delta_ts fps	speedup	bit-exact?
upstream parquet+mp4	485.6	18.7	1.00×	✓
`convert_to_lance` (JPEG-95)	3 626	46.0	2.46×	✗
`convert_to_lance --jpeg-quality=100 --jpeg-subsampling=0`	8 735	32.5	1.74×	✗
`convert_to_lance_video`	487.4	45.6	2.44×	✓

`lerobot/koch_pick_place_5_lego` (real 480×640, 2-cam single-arm)¶

format	size MB	delta_ts fps	speedup	bit-exact?
upstream parquet+mp4	2 014	26.6	1.00×	✓
`convert_to_lance` (JPEG-95)	8 541	70.8	2.66×	✗
`convert_to_lance --jpeg-quality=100 --jpeg-subsampling=0`	17 335	49.0	1.84×	✗
`convert_to_lance_video`	2 016	53.8	2.02×	✓

Pixel fidelity¶

How much each lossy format actually changes pixels (across 16 random frames per camera, averaged):

dataset	jpeg-95 mean abs / visible %	jpeg-100 + 4:4:4 mean abs / visible %	video-blob
pusht	0.0020 / 6.2 %	0.0003 / 0.07 %	bit-exact
aloha cups_open	0.0021 / 1.4 %	0.0012 / 0.06 %	bit-exact
koch lego	0.0047 / 13.5 %	0.0016 / 0.14 %	bit-exact

"Visible %" is the fraction of pixels whose absolute diff exceeds 2/255 vs the upstream source — the threshold where you'd see the difference by eye.

JPEG-95 fidelity varies dramatically with content: ALOHA's natural backgrounds compress cleanly, Koch's high-contrast Lego scenes ring badly. Resolution alone doesn't predict the artifact level.

Training-accuracy parity¶

pusht — `DiffusionPolicy`, 200k steps, env eval¶

Recipe: lerobot/diffusion_pusht, seed=42, eval is 500 gym-pusht rollouts at seed=100000.

storage format	env success rate	avg max overlap
Lance JPEG-95	58.0 %	0.919
Lance video-blob	68.4 %	0.936
upstream parquet+mp4 (head-to-head)	68.0 %	0.9586
HF model card (seed=100000)	65.4 %	0.955

ALOHA cups_open — ACT, 30k steps, held-out action MSE¶

Recipe: ACT defaults + ImageNet image norm + grad-clip 10, seed=42, num_workers=4. Held-out MSE on the last 10 % of episodes.

storage format	train loss @ 30k	held-out RMSE
Lance JPEG-95 (default)	0.0962	0.0927
Lance JPEG-100 + 4:4:4	0.0961	0.0872
Lance video-blob	0.0972	0.0901

Reproducers: examples/train_and_eval_lance.py (pusht) and examples/aloha_loader_parity.py (ALOHA).

Benchmarks¶

Throughput and size¶

lerobot/pusht (synthetic 96×96, 1-cam)¶

lerobot/aloha_static_cups_open (real 480×640, 4-cam bimanual)¶

lerobot/koch_pick_place_5_lego (real 480×640, 2-cam single-arm)¶