New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Feature] Distributed data collector (ray) #930

Merged

vmoens merged 86 commits into pytorch:main from PyTorchRL:distributed_collector

Apr 5, 2023

Contributor

albertbou92 commented Feb 21, 2023 •

edited

Loading

Description

new class DistributedCollector, which allows to do distributed data collection with TorchRL collectors using Ray.

@vmoens This approach required adding a new function to the collectors, one that takes a single step of the iterator. I have done it with SyncDataCollector. Do you think it is fine?

The PR also includes an example of distributed data collection for the SyncDataCollector case.

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.

albertbou92 added 4 commits

February 21, 2023 09:55


          distributed collector with ray

bb952d2


          distributed collector with ray

27c3a76


          sync collector example

ba073fd


          sync collector example

fc9331f

facebook-github-bot added the CLA Signed label

albertbou92 marked this pull request as draft

February 21, 2023 10:17

albertbou92 added 5 commits

February 21, 2023 11:20


          sync collector example

13d566b


          train example


          train example fix

3ae4f02


          train example fix

247aa45


          train example fix

bd7b214

vmoens changed the title ~~[Feature] Distributed data collector~~ [Feature] Distributed data collector (ray)

vmoens reviewed

View reviewed changes

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

torchrl/collectors/distributed/distributed_collector.py Outdated

+                          # Broadcast agent weights
+                          self.local_collector().update_policy_weights_()
+                          state_dict = {"policy_state_dict": self._local_collector.policy.state_dict()}
+                          state_dict = ray.put(state_dict)

Contributor

vmoens Feb 22, 2023

Q: in regular data collectors we leave this to the user or ask explicitly if the update should be done at each iter.
Do we want to do the same here? ie isolating the weight updating in a dedicated method that can or won't be called at each iteration.

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

torchrl/collectors/distributed/distributed_collector.py Outdated

+                          samples_ready = []
+                          while len(samples_ready) < self.num_collectors:
+                              samples_ready, samples_not_ready = ray.wait(
+                                  pending_samples, num_returns=len(pending_samples), timeout=0.001)

Contributor

vmoens Feb 22, 2023

what's the role of timeout here? Is the value standard? Should it be a hyperparam or a global variable?

Contributor Author

albertbou92 Feb 22, 2023 •

edited

Loading

If timeout is set, the function returns either when the requested number of IDs are ready or when the timeout is reached, whichever occurs first. If it is not set, the function simply waits until that number of objects is ready and returns that exact number of object refs. Default value is None.

Actually for now we can just set timeout=None, but if we want to implement the preemption mechanism, we will need to check regularly how many of the tasks have ended. We can either set it to a reasonable value of allow the users to define it.

torchrl/collectors/distributed/distributed_collector.py Outdated Show resolved Hide resolved

albertbou92 and others added 17 commits

February 22, 2023 10:54


          Update torchrl/collectors/distributed/distributed_collector.py

24f75f6

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>


          Update torchrl/collectors/distributed/distributed_collector.py

73a008e

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>


          Update torchrl/collectors/distributed/distributed_collector.py

7fff199

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>


          Update torchrl/collectors/distributed/distributed_collector.py

1d4b2f0

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>


          Update torchrl/collectors/distributed/distributed_collector.py

42202ca

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>


          Update torchrl/collectors/distributed/distributed_collector.py

87c0f23

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>


          added suggested changes

69f691e


          implemented abstract methods

fbe7cca


          policy as input parameter

b81aa77


          policy as input parameter

28b1a62


          class abstract method iterator

3960ce3


          example fixes

754c458


          example fixes

bc888b3


          example fixes

d529202


          example fixes

29a4c6a


          change method names

f6b8c6e


          abstract methods

e895b6c

albertbou92 added 13 commits

March 27, 2023 13:10


          tests

2f20128


          tests

ca4800f


          format

23dc0bb


          tests

b3f7668


          tests

00275d0


          tests

d2ee9d3


          tests

f6f6aa8


          tests

fedf318


          Merge branch 'distributed_collector' of https://github.com/PyTorchRL/rl…

9a4baae

… into distributed_collector


          format

6b71dbf


          docs

74e3e36


          docs

33a954f


          tests

fc2e2fc

Contributor

vmoens commented Mar 29, 2023

Sorry for the late reply

@vmoens This approach required adding a new function to the collectors, one that takes a single step of the iterator. I have done it with SyncDataCollector. Do you think it is fine?

I can't spot it, what is it that you changed?
We have a collector.next() method now if that's what you need.

Also why aren't the test following the same pattern than other distributed collectors? I think it's fine if there is a good reason.

vmoens approved these changes

View reviewed changes

Contributor

vmoens left a comment

LGTM, see my comment in the PR

torchrl/collectors/distributed/ray.py Outdated Show resolved Hide resolved

albertbou92 and others added 11 commits

March 29, 2023 18:06


          Update torchrl/collectors/distributed/ray.py

42dbfbb

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>


          Merge branch 'main' into distributed_collector

13c4f02


          update weights fix and tests

66063dc


          update weights fix and tests

2eb1e40


          format

320977c


          document tests

17c9404


          format

1003ac7


          fix test deps

ca056fc


          format

95afd2f


          fix test deps

b02eb34


          fix test deps

1cd34d7

vmoens approved these changes

View reviewed changes

Contributor

vmoens left a comment

Awesome!

examples/distributed/collectors/multi_nodes/ray_train.py

Contributor

vmoens Apr 5, 2023

Lovely! You went the extra length with this!

vmoens merged commit 60d2dc5 into pytorch:main

albertbou92 added a commit to PyTorchRL/rl that referenced this pull request


          [Feature] Distributed data collector (ray) (pytorch#930)

a9d6668

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>

albertbou92 deleted the distributed_collector branch

January 18, 2024 10:08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed enhancement