Benchmarks¶
Hardware: H100 80 GB box, CPU decode (no NVJPEG) for apples-to-apples across formats.
Read pattern: delta_timestamps with 8 frames per sample — the realistic training read shape. Single-frame throughput is much higher across the board, but it's not what training pipelines actually pay for.
Settings:
batch_size=32num_workers=4- 30 batches (5 warmup)
- Pixel diffs sampled across 16 random frames per camera, averaged
Reproduce with examples/benchmark_formats.py.
Throughput and size¶
lerobot/pusht (synthetic 96×96, 1-cam)¶
| format | size MB | delta_ts fps | speedup | bit-exact? |
|---|---|---|---|---|
| upstream parquet+mp4 | 7.3 | 750 | 1.00× | ✓ |
convert_to_lance (JPEG-95) |
60.0 | 3510 | 4.68× | ✗ |
convert_to_lance --jpeg-quality=100 --jpeg-subsampling=0 |
105.6 | 2909 | 3.88× | ✗ |
convert_to_lance_video |
8.0 | 2853 | 3.80× | ✓ |
lerobot/aloha_static_cups_open (real 480×640, 4-cam bimanual)¶
| format | size MB | delta_ts fps | speedup | bit-exact? |
|---|---|---|---|---|
| upstream parquet+mp4 | 485.6 | 18.7 | 1.00× | ✓ |
convert_to_lance (JPEG-95) |
3 626 | 46.0 | 2.46× | ✗ |
convert_to_lance --jpeg-quality=100 --jpeg-subsampling=0 |
8 735 | 32.5 | 1.74× | ✗ |
convert_to_lance_video |
487.4 | 45.6 | 2.44× | ✓ |
lerobot/koch_pick_place_5_lego (real 480×640, 2-cam single-arm)¶
| format | size MB | delta_ts fps | speedup | bit-exact? |
|---|---|---|---|---|
| upstream parquet+mp4 | 2 014 | 26.6 | 1.00× | ✓ |
convert_to_lance (JPEG-95) |
8 541 | 70.8 | 2.66× | ✗ |
convert_to_lance --jpeg-quality=100 --jpeg-subsampling=0 |
17 335 | 49.0 | 1.84× | ✗ |
convert_to_lance_video |
2 016 | 53.8 | 2.02× | ✓ |
Pixel fidelity¶
How much each lossy format actually changes pixels (across 16 random frames per camera, averaged):
| dataset | jpeg-95 mean abs / visible % | jpeg-100 + 4:4:4 mean abs / visible % | video-blob |
|---|---|---|---|
| pusht | 0.0020 / 6.2 % | 0.0003 / 0.07 % | bit-exact |
| aloha cups_open | 0.0021 / 1.4 % | 0.0012 / 0.06 % | bit-exact |
| koch lego | 0.0047 / 13.5 % | 0.0016 / 0.14 % | bit-exact |
"Visible %" is the fraction of pixels whose absolute diff exceeds 2/255 vs the upstream source — the threshold where you'd see the difference by eye.
JPEG-95 fidelity varies dramatically with content: ALOHA's natural backgrounds compress cleanly, Koch's high-contrast Lego scenes ring badly. Resolution alone doesn't predict the artifact level.
Training-accuracy parity¶
pusht — DiffusionPolicy, 200k steps, env eval¶
Recipe: lerobot/diffusion_pusht, seed=42, eval is 500 gym-pusht rollouts at seed=100000.
| storage format | env success rate | avg max overlap |
|---|---|---|
| Lance JPEG-95 | 58.0 % | 0.919 |
| Lance video-blob | 68.4 % | 0.936 |
| upstream parquet+mp4 (head-to-head) | 68.0 % | 0.9586 |
| HF model card (seed=100000) | 65.4 % | 0.955 |
ALOHA cups_open — ACT, 30k steps, held-out action MSE¶
Recipe: ACT defaults + ImageNet image norm + grad-clip 10, seed=42, num_workers=4. Held-out MSE on the last 10 % of episodes.
| storage format | train loss @ 30k | held-out RMSE |
|---|---|---|
| Lance JPEG-95 (default) | 0.0962 | 0.0927 |
| Lance JPEG-100 + 4:4:4 | 0.0961 | 0.0872 |
| Lance video-blob | 0.0972 | 0.0901 |
Reproducers: examples/train_and_eval_lance.py (pusht) and examples/aloha_loader_parity.py (ALOHA).