Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable video_reader backend #220

Merged
merged 26 commits into from
Jun 19, 2024
Merged

Conversation

aliberts
Copy link
Collaborator

@aliberts aliberts commented May 28, 2024

What this does

This enables torchvision's — still experimental — video_reader backend for faster video decoding.

  • In order to use it, torchvision has to be built from source using these instructions. This changes a few things in the dev docker image in order to be able to follow these instructions.
  • Adds video_backend in config and as a LeRobotDataset (and MultiLeRobotDataset) argument to select between pyav and video_reader (defaults to pyav as before).
  • Refactor the video benchmark.
  • Adds changes from Add capture_camera_feed #267

How it was tested

python lerobot/common/datasets/_video_benchmark/run_video_benchmark.py 

with

BENCHMARKS = {
    "backend": ["pyav", "video_reader"],
}

Quality metrics (avg_per_pixel_l2_error, avg_psnr, avg_ssim, avg_mse) are identical. Loading time is generally improved by a factor of ~1.5.

1_frame

backend

repo_id image_size backend compression_factor load_time_factor avg_per_pixel_l2_error avg_psnr avg_ssim avg_mse
lerobot/pusht_image 96 x 96 pyav 3.619 0.175 0.0000439 42.819 0.996 0.0000542
lerobot/pusht_image 96 x 96 video_reader 3.619 0.158 0.0000439 42.819 0.996 0.0000542
aliberts/aloha_mobile_shrimp_image 480 x 640 pyav 24.438 0.510 0.0000114 39.316 0.971 0.0001237
aliberts/aloha_mobile_shrimp_image 480 x 640 video_reader 24.438 0.277 0.0000114 39.316 0.971 0.0001237
aliberts/paris_street 720 x 1280 pyav 28.117 0.676 0.0000122 35.622 0.947 0.001
aliberts/paris_street 720 x 1280 video_reader 28.117 0.375 0.0000122 35.622 0.947 0.001
aliberts/kitchen 1080 x 1920 pyav 43.394 0.781 0.0000048 38.808 0.965 0.0001926
aliberts/kitchen 1080 x 1920 video_reader 43.394 0.459 0.0000048 38.808 0.965 0.0001926

2_frames

backend

repo_id image_size backend compression_factor load_time_factor avg_per_pixel_l2_error avg_psnr avg_ssim avg_mse
lerobot/pusht_image 96 x 96 pyav 3.619 0.476 0.0000492 42.236 0.996 0.0000821
lerobot/pusht_image 96 x 96 video_reader 3.619 0.294 0.0000492 42.236 0.996 0.0000821
aliberts/aloha_mobile_shrimp_image 480 x 640 pyav 24.438 0.820 0.0000121 39.058 0.970 0.0001577
aliberts/aloha_mobile_shrimp_image 480 x 640 video_reader 24.438 0.510 0.0000121 39.058 0.970 0.0001577
aliberts/paris_street 720 x 1280 pyav 28.117 0.955 0.0000124 35.386 0.946 0.001
aliberts/paris_street 720 x 1280 video_reader 28.117 0.619 0.0000124 35.386 0.946 0.001
aliberts/kitchen 1080 x 1920 pyav 43.394 1.103 0.0000053 38.502 0.964 0.0002850
aliberts/kitchen 1080 x 1920 video_reader 43.394 0.763 0.0000053 38.502 0.964 0.0002850

2_frames_4_space

backend

repo_id image_size backend compression_factor load_time_factor avg_per_pixel_l2_error avg_psnr avg_ssim avg_mse
lerobot/pusht_image 96 x 96 pyav 3.619 0.373 0.0000552 41.785 0.995 0.0001220
lerobot/pusht_image 96 x 96 video_reader 3.619 0.267 0.0000552 41.785 0.995 0.0001220
aliberts/aloha_mobile_shrimp_image 480 x 640 pyav 24.438 0.522 0.0000115 39.246 0.971 0.0001252
aliberts/aloha_mobile_shrimp_image 480 x 640 video_reader 24.438 0.394 0.0000115 39.246 0.971 0.0001252
aliberts/paris_street 720 x 1280 pyav 28.117 0.575 0.0000182 34.399 0.917 0.004
aliberts/paris_street 720 x 1280 video_reader 28.117 0.443 0.0000182 34.399 0.917 0.004
aliberts/kitchen 1080 x 1920 pyav 43.394 0.669 0.0000056 38.204 0.964 0.0003120
aliberts/kitchen 1080 x 1920 video_reader 43.394 0.538 0.0000056 38.204 0.964 0.0003120

6_frames

backend

repo_id image_size backend compression_factor load_time_factor avg_per_pixel_l2_error avg_psnr avg_ssim avg_mse
lerobot/pusht_image 96 x 96 pyav 3.619 0.895 0.0000538 41.820 0.995 0.0001097
lerobot/pusht_image 96 x 96 video_reader 3.619 0.700 0.0000538 41.820 0.995 0.0001097
aliberts/aloha_mobile_shrimp_image 480 x 640 pyav 24.438 1.115 0.0000124 38.940 0.969 0.0001784
aliberts/aloha_mobile_shrimp_image 480 x 640 video_reader 24.438 0.834 0.0000124 38.940 0.969 0.0001784
aliberts/paris_street 720 x 1280 pyav 28.117 1.344 0.0000164 34.585 0.927 0.003
aliberts/paris_street 720 x 1280 video_reader 28.117 1.050 0.0000164 34.585 0.927 0.003
aliberts/kitchen 1080 x 1920 pyav 43.394 1.543 0.0000061 37.760 0.963 0.0004247
aliberts/kitchen 1080 x 1920 video_reader 43.394 1.308 0.0000061 37.760 0.963 0.0004247

I also did a run to reproduce pretrained act on aloha transfer task with

python lerobot/scripts/train.py \
  hydra.job.name=act_aloha_sim_transfer_cube_human \
  hydra.run.dir=outputs/train/act_aloha_sim_transfer_cube_human \
  policy=act \
  policy.use_vae=true \
  env=aloha \
  env.task=AlohaTransferCube-v0 \
  dataset_repo_id=lerobot/aloha_sim_transfer_cube_human \
  training.eval_freq=10000 \
  training.log_freq=250 \
  training.offline_steps=100000 \
  training.save_checkpoint=true \
  training.save_freq=25000 \
  eval.n_episodes=50 \
  eval.batch_size=50 \
  wandb.enable=true \
  device=cuda \
  video_backend=video_reader

The full run is available on wandb here

image

How to checkout & try? (for the reviewer)

You first need to compile torchvision from source. Original instructions are available there but I'll recap here as some of it is not up-to-date:

  • Download the latest nvidia-video-codec-sdk and extract the zipped file.
  • Set these environment variables needed for building torchvision against the sdk:
    • TORCHVISION_INCLUDE to the location of the video codec headers(nvcuvid.h and cuviddec.h), which would be under the Interface directory.
    • Set TORCHVISION_LIBRARY environment variable to the location of the video codec library(libnvcuvid.so), which would be under Lib/linux/stubs/x86_64 directory.
    • Set CUDA_HOME environment variable to the cuda root directory.

You can do all these with this command: (assuming you have unziped the codec sdk in $HOME and that your cuda home is at /usr/local/cuda, which is generally where it's at)

export TORCHVISION_INCLUDE=$HOME/Video_Codec_SDK_12.2.72/Interface/ \
    && export TORCHVISION_LIBRARY=$HOME/Video_Codec_SDK_12.2.72/Lib/linux/stubs/x86_64/ \
    && export CUDA_HOME=/usr/local/cuda
  • Install ffmpeg<4.3 from conda-forge channel
conda install -c conda-forge "ffmpeg<4.3"
  • Git-clone torchvision and do the following change in your pyproject.toml
- torchvision = ">=0.17.1"
+ torchvision = { path = "../../path/to/vision" }
  • Run the following to update your dependencies and build torchvision
poetry lock --no-cache --no-update && poetry install --sync --all-extras

If you don't use poetry, you can simply do this instead (add the extras you need)

pip install .
  • Ensure that video_reader works with this: (It shouldn't raise any error)
python -c "import torchvision; torchvision.set_video_backend('video_reader');"

Try any run with the option video_backend=video_reader (will default to pyav if not specified), e.g.

python lerobot/scripts/train.py \
    policy=act \
    policy.dim_model=64 \
    env=aloha \
    wandb.enable=False \
    training.offline_steps=2 \
    training.online_steps=0 \
    eval.n_episodes=1 \
    eval.batch_size=1 \
    device=cpu \
    training.save_checkpoint=true \
    training.save_freq=2 \
    policy.n_action_steps=20 \
    policy.chunk_size=20 \
    training.batch_size=2 \
    hydra.run.dir=tests/outputs/act/ \
    video_backend=video_reader

This change is Reviewable

@aliberts aliberts added 🗃️ Dataset Something dataset-related ⚙️ Infra/CI Infra / CI-related labels May 28, 2024
@aliberts aliberts self-assigned this May 28, 2024
@aliberts aliberts marked this pull request as ready for review May 30, 2024 17:42
@aliberts aliberts added the ⚡️ Performance Performance-related label May 30, 2024
@aliberts aliberts force-pushed the user/aliberts/2024_05_28_compile_torchvision branch from 794b413 to ae0d5c9 Compare June 14, 2024 08:49
These parameters and theirs values are specified in the BENCHMARKS dict.

All of these benchmarks are evaluated within different timestamps modes corresponding to different frame-loading scenarios:
- `1_frame`: 1 single frame is loaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what you are trying to achieve with these variations? I'd want to know why I shouldn't add 2_frames_6_spaces. I'd want to know why 6_frames tests something fundamentally different to 2_frames and why you didn't also do 20_frames.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were already present in the script and were done by Rémi, I just added a bit of documentation.

I think the idea is to have different common scenarios that can reflect a typical workload during training (e.g. with delta_timestamps), but I can't answer as to why these values specifically.
@Cadene care to shed some light?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's arbitrary based on possible future usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then nit: if it were me, I'd add Remi's statement as a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 0f1986e

lerobot/scripts/push_dataset_to_hub.py Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
aliberts and others added 2 commits June 18, 2024 20:25
Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>
@Cadene
Copy link
Collaborator

Cadene commented Jun 18, 2024

@aliberts Thanks for this great PR that benchmark pyav versus video_reader + other very needed additions.

Could you double check that video_reader is faster than pyav?
For load_time_factor, I think higher is better. I also got confused. I am wondering if we could find a more explicit variable name than load_time_factor and compression_factor.

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ping for a second review ;) Thanks!

docker/lerobot-gpu-dev/Dockerfile Show resolved Hide resolved
These parameters and theirs values are specified in the BENCHMARKS dict.

All of these benchmarks are evaluated within different timestamps modes corresponding to different frame-loading scenarios:
- `1_frame`: 1 single frame is loaded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's arbitrary based on possible future usage.

Comment on lines +75 to +79
The backend can be either "pyav" (default) or "video_reader".
"video_reader" requires installing torchvision from source, see:
https://github.com/pytorch/vision/blob/main/torchvision/csrc/io/decoder/gpu/README.rst
(note that you need to compile against ffmpeg<4.3)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that we expect video_reader for be faster (or slower?) than pyav, and point to benchmark README

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed bf3dbbd (I'll do a proper benchmark and link to it in a future PR)

(note that you need to compile against ffmpeg<4.3)

+ While both use cpu, "video_reader" is faster than "pyav" but requires additional setup.
+ See our benchmark results for more info on performance:
+ https://github.com/huggingface/lerobot/pull/220

+ See torchvision doc for more info on these two backends:
+ https://pytorch.org/vision/0.18/index.html?highlight=backend#torchvision.set_video_backend

Note: Video benefits from inter-frame compression. Instead of storing every frame individually,

lerobot/scripts/push_dataset_to_hub.py Show resolved Hide resolved
@aliberts
Copy link
Collaborator Author

aliberts commented Jun 19, 2024

Could you double check that video_reader is faster than pyav? For load_time_factor, I think higher is better. I also got confused.

It is, lower is better. (I also was confused)
Wow, okay so it actually is higher is better. Well, that changes quite a few things 😅

I am wondering if we could find a more explicit variable name than load_time_factor and compression_factor.

Already on it in #282, I'll go with something like avg_load_time_ms.
Generally I think it's worth it to keep this file clean and functional as this video benchmark will be useful more than once in the near future.

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Cadene Cadene merged commit 2abef3b into main Jun 19, 2024
8 checks passed
@Cadene Cadene deleted the user/aliberts/2024_05_28_compile_torchvision branch June 19, 2024 15:15
@aliberts aliberts mentioned this pull request Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ Dataset Something dataset-related ⚙️ Infra/CI Infra / CI-related ⚡️ Performance Performance-related
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants