Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Checkpointing #549

Merged
merged 78 commits into from
Oct 28, 2022
Merged
Show file tree
Hide file tree
Changes from 64 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
f981c46
init
vmoens Oct 4, 2022
db67060
Update HABITAT.md
vmoens Oct 4, 2022
db6dde3
init
vmoens Oct 11, 2022
0cce72b
[Feature] checkpointing RewardNormalization (#550)
vmoens Oct 11, 2022
8645e42
amend
vmoens Oct 11, 2022
09c033c
amend
vmoens Oct 11, 2022
a768503
Merge branch 'main' into checkpointing
vmoens Oct 11, 2022
2e117e0
[Feature] checkpointing ReplayBuffer (#553)
vmoens Oct 11, 2022
9af1759
[Feature] checkpointing SelectKeys (#551)
vmoens Oct 11, 2022
f130d02
[Feature] checkpointing SubSampler (#555)
vmoens Oct 11, 2022
8096f7d
Merge branch 'main' into checkpointing
vmoens Oct 12, 2022
0e0c563
Merge branch 'main' into checkpointing
vmoens Oct 19, 2022
05c3669
tests
vmoens Oct 21, 2022
4e25b39
Merge branch 'main' into habitat
vmoens Oct 21, 2022
6b25fdf
Merge branch 'habitat' of github.com:pytorch/rl into habitat
vmoens Oct 21, 2022
5664651
typo
vmoens Oct 21, 2022
64fa046
permission
vmoens Oct 21, 2022
f5d61cf
pip install
vmoens Oct 21, 2022
69c5522
egl
vmoens Oct 21, 2022
30d06b2
egl
vmoens Oct 21, 2022
378051d
habitat.utils.gym_definitions
vmoens Oct 21, 2022
1254345
time to impact: 5 mins
vmoens Oct 21, 2022
752195d
hope
vmoens Oct 21, 2022
33c4842
amend
vmoens Oct 21, 2022
f98ca00
amend
vmoens Oct 21, 2022
a58b3c4
LD_PRELOAD
vmoens Oct 21, 2022
72d3aeb
amend
vmoens Oct 23, 2022
3792dc3
amend
vmoens Oct 23, 2022
bcb2348
amend
vmoens Oct 23, 2022
b6e5f93
amend
vmoens Oct 23, 2022
98639e6
amend
vmoens Oct 23, 2022
3083356
amend
vmoens Oct 23, 2022
7829d6f
amend
vmoens Oct 23, 2022
a2ea7c1
revert yum
vmoens Oct 23, 2022
6d74518
amend
vmoens Oct 23, 2022
4486051
amend
vmoens Oct 23, 2022
ae28c3e
amend
vmoens Oct 23, 2022
68b3855
amend
vmoens Oct 24, 2022
0638fdf
amend
vmoens Oct 24, 2022
63ade6d
LD_LIBRARY_PATH
vmoens Oct 24, 2022
fefb8c7
amend
vmoens Oct 24, 2022
46c8359
amend
vmoens Oct 24, 2022
b327dc8
amend
vmoens Oct 24, 2022
292d554
amend
vmoens Oct 24, 2022
ad43e03
amend
vmoens Oct 24, 2022
6d17ff5
amend
vmoens Oct 24, 2022
45d99b3
amend
vmoens Oct 24, 2022
5fefa3f
amend
vmoens Oct 24, 2022
4e4ec61
amend
vmoens Oct 24, 2022
68ccdf1
amend
vmoens Oct 24, 2022
b2812ff
amend
vmoens Oct 24, 2022
0b0d6d7
amend
vmoens Oct 24, 2022
9acefdc
amend
vmoens Oct 24, 2022
8bee40a
amend
vmoens Oct 24, 2022
ffd8d41
amend
vmoens Oct 24, 2022
ee066b7
amend
vmoens Oct 24, 2022
e4b0e3e
amend
vmoens Oct 24, 2022
a65c78a
amend
vmoens Oct 24, 2022
205e035
amend
vmoens Oct 24, 2022
3fd44b3
amend
vmoens Oct 24, 2022
3c916be
amend
vmoens Oct 24, 2022
9127f1c
amend
vmoens Oct 24, 2022
bd23d2c
amend
vmoens Oct 25, 2022
1e82714
amend
vmoens Oct 25, 2022
49ec01b
--force-reinstall
vmoens Oct 25, 2022
2930abc
amend
vmoens Oct 25, 2022
ccd41a0
amend
vmoens Oct 25, 2022
3fa1440
removing habitat baselines
vmoens Oct 25, 2022
238403f
Merge branch 'main' into checkpointing
vmoens Oct 27, 2022
b5d295d
amend
vmoens Oct 28, 2022
fbdf18f
VP suggestions
vmoens Oct 28, 2022
7d7fa76
Merge remote-tracking branch 'origin/main' into checkpointing
vmoens Oct 28, 2022
9ba0873
Checkpointing -- TorchSnapshot (#577)
vmoens Oct 28, 2022
b906b89
Merge branch 'main' into habitat
vmoens Oct 28, 2022
704ad06
amend
vmoens Oct 28, 2022
92fed9a
amend
vmoens Oct 28, 2022
1a5d6c3
Merge branch 'habitat' into checkpointing
vmoens Oct 28, 2022
0ef22d0
Merge branch 'main' into checkpointing
vmoens Oct 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 68 additions & 67 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,61 @@ jobs:
- store_test_results:
path: test-results

unittest_linux_habitat_gpu:
<<: *binary_common
machine:
image: ubuntu-2004-cuda-11.4:202110-01
resource_class: gpu.nvidia.medium
environment:
image_name: "nvidia/cudagl:11.4.0-base"
TAR_OPTIONS: --no-same-owner
PYTHON_VERSION: << parameters.python_version >>
CU_VERSION: << parameters.cu_version >>

steps:
- checkout
- designate_upload_channel
- run:
name: Generate cache key
# This will refresh cache on Sundays, nightly build should generate new cache.
command: echo "$(date +"%Y-%U")" > .circleci-weekly
- restore_cache:
keys:
- env-v3-linux-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/linux_libs/scripts_habitat/environment.yml" }}-{{ checksum ".circleci-weekly" }}
- run:
name: Setup
command: docker run -e PYTHON_VERSION -t --gpus all -v $PWD:$PWD -w $PWD "${image_name}" .circleci/unittest/linux_libs/scripts_habitat/setup_env.sh
- save_cache:

key: env-v3-linux-{{ arch }}-py<< parameters.python_version >>-{{ checksum ".circleci/unittest/linux_libs/scripts_habitat/environment.yml" }}-{{ checksum ".circleci-weekly" }}

paths:
- conda
- env
- run:
# Here we create an envlist file that contains some env variables that we want the docker container to be aware of.
# Normally, the CIRCLECI variable is set and available on all CI workflows: https://circleci.com/docs/2.0/env-vars/#built-in-environment-variables.
# They're available in all the other workflows (OSX and Windows).
# But here, we're running the unittest_linux_gpu workflows in a docker container, where those variables aren't accessible.
# So instead we dump the variables we need in env.list and we pass that file when invoking "docker run".
name: export CIRCLECI env var
command: echo "CIRCLECI=true" >> ./env.list
- run:
name: Install torchrl
command: docker run -e PYTHON_VERSION -t --gpus all -v $PWD:$PWD -w $PWD "${image_name}" .circleci/unittest/linux_libs/scripts_habitat/install.sh
- run:
name: Run tests
command: docker run --env-file ./env.list -t --gpus all -v $PWD:$PWD -w $PWD "${image_name}" .circleci/unittest/linux_libs/scripts_habitat/run_test.sh
- run:
name: Codecov upload
command: |
bash <(curl -s https://codecov.io/bash) -Z -F habitat-gpu
- run:
name: Post Process
command: docker run -t --gpus all -v $PWD:$PWD .circleci/unittest/linux_libs/scripts_habitat/post_process.sh
- store_test_results:
path: test-results

unittest_linux_optdeps_gpu:
<<: *binary_common
machine:
Expand Down Expand Up @@ -546,20 +601,17 @@ jobs:
- conda
- env
- run:
name: Install torchrl, run tests, upload codecov
name: Install torchrl, run tests
command: |
docker run -t --env=CUDA_VISIBLE_DEVICES="" --gpus all -v $PWD:$PWD -w $PWD -e UPLOAD_CHANNEL -e CU_VERSION "${image_name}" .circleci/unittest/linux_olddeps/scripts_gym_0_13/batch_scripts.sh
# docker run -t --gpus all -v $PWD:$PWD -w $PWD -e UPLOAD_CHANNEL -e CU_VERSION "${image_name}" .circleci/unittest/linux_olddeps/scripts_gym_0_13/batch_scripts.sh
# - run:
# name: Run tests
# command: docker run -t --gpus all -v $PWD:$PWD -w $PWD -e UPLOAD_CHANNEL -e CU_VERSION "${image_name}" .circleci/unittest/linux_olddeps/scripts_gym_0_13/run_test.sh
# - run:
# name: Codecov upload
# command: |
# docker run -t --gpus all -v $PWD:$PWD -w $PWD -e UPLOAD_CHANNEL -e CU_VERSION "${image_name}" <(curl -s https://codecov.io/bash) -Z -F linux-stable-cpu
# - run:
# name: Post process
# command: docker run -t --gpus all -v $PWD:$PWD -w $PWD -e UPLOAD_CHANNEL -e CU_VERSION "${image_name}" .circleci/unittest/linux_olddeps/scripts_gym_0_13/post_process.sh
- run:
name: Codecov upload
command: |
bash <(curl -s https://codecov.io/bash) -Z -F olddeps-gpu
- run:
name: Post process
command: docker run -t --gpus all -v $PWD:$PWD -w $PWD -e UPLOAD_CHANNEL -e CU_VERSION "${image_name}" .circleci/unittest/linux_olddeps/scripts_gym_0_13/post_process.sh
- store_test_results:
path: test-results

Expand Down Expand Up @@ -649,62 +701,6 @@ workflows:
python_version: '3.10'
wheel_docker_image: pytorch/manylinux-cuda102

# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda102
# cu_version: cu102
# name: binary_linux_wheel_py3.7_cu102
# python_version: '3.7'
# wheel_docker_image: pytorch/manylinux-cuda102
#
# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda102
# cu_version: cu102
# name: binary_linux_wheel_py3.8_cu102
# python_version: '3.8'
# wheel_docker_image: pytorch/manylinux-cuda102
#
# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda102
# cu_version: cu102
# name: binary_linux_wheel_py3.9_cu102
# python_version: '3.9'
# wheel_docker_image: pytorch/manylinux-cuda102
#
# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda102
# cu_version: cu102
# name: binary_linux_wheel_py3.10_cu102
# python_version: '3.10'
# wheel_docker_image: pytorch/manylinux-cuda102

# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda113
# cu_version: cu113
# name: binary_linux_wheel_py3.7_cu113
# python_version: '3.7'
# wheel_docker_image: pytorch/manylinux-cuda113
#
# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda113
# cu_version: cu113
# name: binary_linux_wheel_py3.8_cu113
# python_version: '3.8'
# wheel_docker_image: pytorch/manylinux-cuda113
#
# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda113
# cu_version: cu113
# name: binary_linux_wheel_py3.9_cu113
# python_version: '3.9'
# wheel_docker_image: pytorch/manylinux-cuda113
#
# - binary_linux_wheel:
# conda_docker_image: pytorch/conda-builder:cuda113
# cu_version: cu113
# name: binary_linux_wheel_py3.10_cu113
# python_version: '3.10'
# wheel_docker_image: pytorch/manylinux-cuda113

- binary_macos_wheel:
conda_docker_image: pytorch/conda-builder:cpu
cu_version: cpu
Expand Down Expand Up @@ -784,6 +780,11 @@ workflows:
cu_version: cu113
name: unittest_linux_stable_gpu_py3.8
python_version: '3.8'
# we test supported libs for 3.8 only
- unittest_linux_habitat_gpu:
cu_version: cu113
name: unittest_linux_habitat_gpu_py3.8
python_version: '3.8'

- unittest_macos_cpu:
cu_version: cpu
Expand Down
17 changes: 17 additions & 0 deletions .circleci/unittest/linux_libs/scripts_habitat/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
channels:
- pytorch
- defaults
dependencies:
- pip
- pip:
- hypothesis
- future
- cloudpickle
- pytest
- pytest-cov
- pytest-mock
- pytest-instafail
- expecttest
- pyyaml
- scipy
- hydra-core
47 changes: 47 additions & 0 deletions .circleci/unittest/linux_libs/scripts_habitat/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/usr/bin/env bash

unset PYTORCH_VERSION
# For unittest, nightly PyTorch is used as the following section,
# so no need to set PYTORCH_VERSION.
# In fact, keeping PYTORCH_VERSION forces us to hardcode PyTorch version in config.
apt-get update && apt-get install -y git wget gcc g++

set -e

eval "$(./conda/bin/conda shell.bash hook)"
conda activate ./env

if [ "${CU_VERSION:-}" == cpu ] ; then
version="cpu"
else
if [[ ${#CU_VERSION} -eq 4 ]]; then
CUDA_VERSION="${CU_VERSION:2:1}.${CU_VERSION:3:1}"
elif [[ ${#CU_VERSION} -eq 5 ]]; then
CUDA_VERSION="${CU_VERSION:2:2}.${CU_VERSION:4:1}"
fi
echo "Using CUDA $CUDA_VERSION as determined by CU_VERSION ($CU_VERSION)"
version="$(python -c "print('.'.join(\"${CUDA_VERSION}\".split('.')[:2]))")"
fi


# submodules
git submodule sync && git submodule update --init --recursive

printf "Installing PyTorch with %s\n" "${CU_VERSION}"
if [ "${CU_VERSION:-}" == cpu ] ; then
# conda install -y pytorch torchvision cpuonly -c pytorch-nightly
# use pip to install pytorch as conda can frequently pick older release
# conda install -y pytorch cpuonly -c pytorch-nightly
pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu --force-reinstall
else
pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cu116 --force-reinstall
fi

# smoke test
python -c "import functorch"

printf "* Installing torchrl\n"
pip3 install -e .

# smoke test
python -c "import torchrl"
6 changes: 6 additions & 0 deletions .circleci/unittest/linux_libs/scripts_habitat/post_process.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env bash

set -e

eval "$(./conda/bin/conda shell.bash hook)"
conda activate ./env
Loading