Enable `video_reader` backend #220

aliberts · 2024-05-28T17:40:23Z

What this does

This enables torchvision's — still experimental — video_reader backend for faster video decoding.

In order to use it, torchvision has to be built from source using these instructions. This changes a few things in the dev docker image in order to be able to follow these instructions.
Adds video_backend in config and as a LeRobotDataset (and MultiLeRobotDataset) argument to select between pyav and video_reader (defaults to pyav as before).
Refactor the video benchmark.
Adds changes from Add capture_camera_feed #267

How it was tested

python lerobot/common/datasets/_video_benchmark/run_video_benchmark.py

with

BENCHMARKS = {
    "backend": ["pyav", "video_reader"],
}

Quality metrics (avg_per_pixel_l2_error, avg_psnr, avg_ssim, avg_mse) are identical. Loading time is generally improved by a factor of ~1.5.

`1_frame`

backend

repo_id	image_size	backend	compression_factor	load_time_factor	avg_per_pixel_l2_error	avg_psnr	avg_ssim	avg_mse
lerobot/pusht_image	96 x 96	pyav	3.619	0.175	0.0000439	42.819	0.996	0.0000542
lerobot/pusht_image	96 x 96	video_reader	3.619	0.158	0.0000439	42.819	0.996	0.0000542
aliberts/aloha_mobile_shrimp_image	480 x 640	pyav	24.438	0.510	0.0000114	39.316	0.971	0.0001237
aliberts/aloha_mobile_shrimp_image	480 x 640	video_reader	24.438	0.277	0.0000114	39.316	0.971	0.0001237
aliberts/paris_street	720 x 1280	pyav	28.117	0.676	0.0000122	35.622	0.947	0.001
aliberts/paris_street	720 x 1280	video_reader	28.117	0.375	0.0000122	35.622	0.947	0.001
aliberts/kitchen	1080 x 1920	pyav	43.394	0.781	0.0000048	38.808	0.965	0.0001926
aliberts/kitchen	1080 x 1920	video_reader	43.394	0.459	0.0000048	38.808	0.965	0.0001926

`2_frames`

backend

repo_id	image_size	backend	compression_factor	load_time_factor	avg_per_pixel_l2_error	avg_psnr	avg_ssim	avg_mse
lerobot/pusht_image	96 x 96	pyav	3.619	0.476	0.0000492	42.236	0.996	0.0000821
lerobot/pusht_image	96 x 96	video_reader	3.619	0.294	0.0000492	42.236	0.996	0.0000821
aliberts/aloha_mobile_shrimp_image	480 x 640	pyav	24.438	0.820	0.0000121	39.058	0.970	0.0001577
aliberts/aloha_mobile_shrimp_image	480 x 640	video_reader	24.438	0.510	0.0000121	39.058	0.970	0.0001577
aliberts/paris_street	720 x 1280	pyav	28.117	0.955	0.0000124	35.386	0.946	0.001
aliberts/paris_street	720 x 1280	video_reader	28.117	0.619	0.0000124	35.386	0.946	0.001
aliberts/kitchen	1080 x 1920	pyav	43.394	1.103	0.0000053	38.502	0.964	0.0002850
aliberts/kitchen	1080 x 1920	video_reader	43.394	0.763	0.0000053	38.502	0.964	0.0002850

`2_frames_4_space`

backend

repo_id	image_size	backend	compression_factor	load_time_factor	avg_per_pixel_l2_error	avg_psnr	avg_ssim	avg_mse
lerobot/pusht_image	96 x 96	pyav	3.619	0.373	0.0000552	41.785	0.995	0.0001220
lerobot/pusht_image	96 x 96	video_reader	3.619	0.267	0.0000552	41.785	0.995	0.0001220
aliberts/aloha_mobile_shrimp_image	480 x 640	pyav	24.438	0.522	0.0000115	39.246	0.971	0.0001252
aliberts/aloha_mobile_shrimp_image	480 x 640	video_reader	24.438	0.394	0.0000115	39.246	0.971	0.0001252
aliberts/paris_street	720 x 1280	pyav	28.117	0.575	0.0000182	34.399	0.917	0.004
aliberts/paris_street	720 x 1280	video_reader	28.117	0.443	0.0000182	34.399	0.917	0.004
aliberts/kitchen	1080 x 1920	pyav	43.394	0.669	0.0000056	38.204	0.964	0.0003120
aliberts/kitchen	1080 x 1920	video_reader	43.394	0.538	0.0000056	38.204	0.964	0.0003120

`6_frames`

backend

repo_id	image_size	backend	compression_factor	load_time_factor	avg_per_pixel_l2_error	avg_psnr	avg_ssim	avg_mse
lerobot/pusht_image	96 x 96	pyav	3.619	0.895	0.0000538	41.820	0.995	0.0001097
lerobot/pusht_image	96 x 96	video_reader	3.619	0.700	0.0000538	41.820	0.995	0.0001097
aliberts/aloha_mobile_shrimp_image	480 x 640	pyav	24.438	1.115	0.0000124	38.940	0.969	0.0001784
aliberts/aloha_mobile_shrimp_image	480 x 640	video_reader	24.438	0.834	0.0000124	38.940	0.969	0.0001784
aliberts/paris_street	720 x 1280	pyav	28.117	1.344	0.0000164	34.585	0.927	0.003
aliberts/paris_street	720 x 1280	video_reader	28.117	1.050	0.0000164	34.585	0.927	0.003
aliberts/kitchen	1080 x 1920	pyav	43.394	1.543	0.0000061	37.760	0.963	0.0004247
aliberts/kitchen	1080 x 1920	video_reader	43.394	1.308	0.0000061	37.760	0.963	0.0004247

I also did a run to reproduce pretrained act on aloha transfer task with

python lerobot/scripts/train.py \
  hydra.job.name=act_aloha_sim_transfer_cube_human \
  hydra.run.dir=outputs/train/act_aloha_sim_transfer_cube_human \
  policy=act \
  policy.use_vae=true \
  env=aloha \
  env.task=AlohaTransferCube-v0 \
  dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
  training.eval_freq=10000 \
  training.log_freq=250 \
  training.offline_steps=100000 \
  training.save_checkpoint=true \
  training.save_freq=25000 \
  eval.n_episodes=50 \
  eval.batch_size=50 \
  wandb.enable=true \
  device=cuda \
  video_backend=video_reader

The full run is available on wandb here

How to checkout & try? (for the reviewer)

You first need to compile torchvision from source. Original instructions are available there but I'll recap here as some of it is not up-to-date:

Download the latest nvidia-video-codec-sdk and extract the zipped file.
Set these environment variables needed for building torchvision against the sdk:
- TORCHVISION_INCLUDE to the location of the video codec headers(nvcuvid.h and cuviddec.h), which would be under the Interface directory.
- Set TORCHVISION_LIBRARY environment variable to the location of the video codec library(libnvcuvid.so), which would be under Lib/linux/stubs/x86_64 directory.
- Set CUDA_HOME environment variable to the cuda root directory.

You can do all these with this command: (assuming you have unziped the codec sdk in $HOME and that your cuda home is at /usr/local/cuda, which is generally where it's at)

export TORCHVISION_INCLUDE=$HOME/Video_Codec_SDK_12.2.72/Interface/ \
    && export TORCHVISION_LIBRARY=$HOME/Video_Codec_SDK_12.2.72/Lib/linux/stubs/x86_64/ \
    && export CUDA_HOME=/usr/local/cuda

Install ffmpeg<4.3 from conda-forge channel

conda install -c conda-forge "ffmpeg<4.3"

Git-clone torchvision and do the following change in your pyproject.toml

- torchvision = ">=0.17.1"
+ torchvision = { path = "../../path/to/vision" }

Run the following to update your dependencies and build torchvision

poetry lock --no-cache --no-update && poetry install --sync --all-extras

If you don't use poetry, you can simply do this instead (add the extras you need)

pip install .

Ensure that video_reader works with this: (It shouldn't raise any error)

python -c "import torchvision; torchvision.set_video_backend('video_reader');"

Try any run with the option video_backend=video_reader (will default to pyav if not specified), e.g.

python lerobot/scripts/train.py \
    policy=act \
    policy.dim_model=64 \
    env=aloha \
    wandb.enable=False \
    training.offline_steps=2 \
    training.online_steps=0 \
    eval.n_episodes=1 \
    eval.batch_size=1 \
    device=cpu \
    training.save_checkpoint=true \
    training.save_freq=2 \
    policy.n_action_steps=20 \
    policy.chunk_size=20 \
    training.batch_size=2 \
    hydra.run.dir=tests/outputs/act/ \
    video_backend=video_reader

This change is

…_28_compile_torchvision

…dd_cam_capture

…_28_compile_torchvision

lerobot/common/datasets/_video_benchmark/capture_camera_feed.py

lerobot/common/datasets/_video_benchmark/run_video_benchmark.py

alexander-soare · 2024-06-18T15:56:18Z

lerobot/common/datasets/_video_benchmark/run_video_benchmark.py

+These parameters and theirs values are specified in the BENCHMARKS dict.
+
+All of these benchmarks are evaluated within different timestamps modes corresponding to different frame-loading scenarios:
+    - `1_frame`: 1 single frame is loaded.


Could you explain what you are trying to achieve with these variations? I'd want to know why I shouldn't add 2_frames_6_spaces. I'd want to know why 6_frames tests something fundamentally different to 2_frames and why you didn't also do 20_frames.

These were already present in the script and were done by Rémi, I just added a bit of documentation.

I think the idea is to have different common scenarios that can reflect a typical workload during training (e.g. with delta_timestamps), but I can't answer as to why these values specifically.
@Cadene care to shed some light?

It's arbitrary based on possible future usage.

Okay then nit: if it were me, I'd add Remi's statement as a comment.

Done 0f1986e

lerobot/common/datasets/_video_benchmark/run_video_benchmark.py

lerobot/scripts/push_dataset_to_hub.py

pyproject.toml

lerobot/common/datasets/push_dataset_to_hub/cam_png_format.py

Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>

Cadene · 2024-06-18T21:57:53Z

@aliberts Thanks for this great PR that benchmark pyav versus video_reader + other very needed additions.

Could you double check that video_reader is faster than pyav?
For load_time_factor, I think higher is better. I also got confused. I am wondering if we could find a more explicit variable name than load_time_factor and compression_factor.

lerobot/lerobot/common/datasets/_video_benchmark/run_video_benchmark.py

Lines 252 to 253 in c345056

    
           "compression_factor": sum_original_frames_size_bytes / video_size_bytes, 
        
           "load_time_factor": avg_load_time_from_images / avg_load_time,

https://github.com/huggingface/lerobot/tree/main/lerobot/common/datasets/_video_benchmark#metrics

Cadene

Please ping for a second review ;) Thanks!

docker/lerobot-gpu-dev/Dockerfile

Cadene · 2024-06-18T22:03:42Z

lerobot/common/datasets/_video_benchmark/run_video_benchmark.py

+These parameters and theirs values are specified in the BENCHMARKS dict.
+
+All of these benchmarks are evaluated within different timestamps modes corresponding to different frame-loading scenarios:
+    - `1_frame`: 1 single frame is loaded.


It's arbitrary based on possible future usage.

lerobot/common/datasets/_video_benchmark/run_video_benchmark.py

lerobot/common/datasets/push_dataset_to_hub/cam_png_format.py

Cadene · 2024-06-18T22:08:47Z

lerobot/common/datasets/video_utils.py

+    The backend can be either "pyav" (default) or "video_reader".
+    "video_reader" requires installing torchvision from source, see:
+    https://github.com/pytorch/vision/blob/main/torchvision/csrc/io/decoder/gpu/README.rst
+    (note that you need to compile against ffmpeg<4.3)
+


Worth mentioning that we expect video_reader for be faster (or slower?) than pyav, and point to benchmark README

Fixed bf3dbbd (I'll do a proper benchmark and link to it in a future PR)

(note that you need to compile against ffmpeg<4.3) + While both use cpu, "video_reader" is faster than "pyav" but requires additional setup. + See our benchmark results for more info on performance: + https://github.com/huggingface/lerobot/pull/220 + See torchvision doc for more info on these two backends: + https://pytorch.org/vision/0.18/index.html?highlight=backend#torchvision.set_video_backend Note: Video benefits from inter-frame compression. Instead of storing every frame individually,

lerobot/scripts/push_dataset_to_hub.py

lerobot/common/datasets/_video_benchmark/capture_camera_feed.py

aliberts · 2024-06-19T09:33:52Z

Could you double check that video_reader is faster than pyav? For load_time_factor, I think higher is better. I also got confused.

~~It is, lower is better. (I also was confused)~~
Wow, okay so it actually is higher is better. Well, that changes quite a few things 😅

I am wondering if we could find a more explicit variable name than load_time_factor and compression_factor.

Already on it in #282, I'll go with something like avg_load_time_ms.
Generally I think it's worth it to keep this file clean and functional as this video benchmark will be useful more than once in the near future.

…_28_compile_torchvision

Cadene

LGTM

Use cuda devel image, add zip/unzip, remove apt-installed ffmpeg

3059e25

aliberts added 🗃️ Dataset Something dataset-related ⚙️ Infra/CI Infra / CI-related labels May 28, 2024

aliberts self-assigned this May 28, 2024

aliberts added 5 commits May 30, 2024 11:29

Add video_backend in config

fcfa202

Refactor video_benchmark

849830f

Add video_backend config default

da348c6

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_05…

af19a89

…_28_compile_torchvision

Add video_backend arg in MultiLeRobotDataset

c86670f

aliberts marked this pull request as ready for review May 30, 2024 17:42

aliberts requested review from Cadene and alexander-soare May 30, 2024 17:42

aliberts added the ⚡️ Performance Performance-related label May 30, 2024

aliberts added 10 commits June 3, 2024 13:41

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_05…

6e46de6

…_28_compile_torchvision

Change 'best' backend to video_reader

330c535

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_05…

d8e1b72

…_28_compile_torchvision

WIP Refactor & add docstring

eb902d9

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_05…

c5aa43f

…_28_compile_torchvision

Add capture_camera_feed

29dbd01

Merge remote-tracking branch 'origin' into user/aliberts/2024_06_13_a…

f906983

…dd_cam_capture

Fix cv2.imwrite

4eb18c0

Add cam_png_format

bb95a8e

Fix check_format

ae0d5c9

aliberts force-pushed the user/aliberts/2024_05_28_compile_torchvision branch from 794b413 to ae0d5c9 Compare June 14, 2024 08:49

aliberts and others added 5 commits June 14, 2024 12:58

Fix tests

a428176

Add more quality metrics

a6ab12c

Add skimage

c43ac7d

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_05…

550577e

…_28_compile_torchvision

Merge branch 'main' into user/aliberts/2024_05_28_compile_torchvision

461b673

alexander-soare suggested changes Jun 18, 2024

View reviewed changes

aliberts and others added 2 commits June 18, 2024 20:25

Add suggestion

43b2e26

Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>

Apply code review suggestions

c345056

Cadene requested changes Jun 18, 2024

View reviewed changes

Add code review suggestions

bf3dbbd

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_05…

f29d24a

…_28_compile_torchvision

alexander-soare approved these changes Jun 19, 2024

View reviewed changes

Add nit comment

0f1986e

Cadene approved these changes Jun 19, 2024

View reviewed changes

Cadene merged commit 2abef3b into main Jun 19, 2024
8 checks passed

Cadene deleted the user/aliberts/2024_05_28_compile_torchvision branch June 19, 2024 15:15

aliberts mentioned this pull request Jun 24, 2024

Improve video benchmark #282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable `video_reader` backend #220

Enable `video_reader` backend #220

aliberts commented May 28, 2024 •

edited

Loading

alexander-soare Jun 18, 2024

aliberts Jun 18, 2024

Cadene Jun 18, 2024

alexander-soare Jun 19, 2024

aliberts Jun 19, 2024

Cadene commented Jun 18, 2024

Cadene left a comment

Cadene Jun 18, 2024

Cadene Jun 18, 2024

aliberts Jun 19, 2024

aliberts commented Jun 19, 2024 •

edited

Loading

Cadene left a comment

Enable video_reader backend #220

Enable video_reader backend #220

Conversation

aliberts commented May 28, 2024 • edited Loading

What this does

How it was tested

1_frame

2_frames

2_frames_4_space

6_frames

How to checkout & try? (for the reviewer)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cadene commented Jun 18, 2024

Cadene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aliberts commented Jun 19, 2024 • edited Loading

Cadene left a comment

Choose a reason for hiding this comment

Enable `video_reader` backend #220

Enable `video_reader` backend #220

aliberts commented May 28, 2024 •

edited

Loading

`1_frame`

`2_frames`

`2_frames_4_space`

`6_frames`

aliberts commented Jun 19, 2024 •

edited

Loading