Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tutorial] PettingZoo Parallel competitive tutorial #2047

Merged
merged 31 commits into from
Apr 24, 2024

Conversation

matteobettini
Copy link
Contributor

@matteobettini matteobettini commented Mar 29, 2024

PettingZoo tutorial

So the idea is that we need a tutorial to show people how to do comeptitive MARL in TorchRL (also showing how to use PettingZoo parallel environments)

I would like it to be complementary to the MultiAgent PPO tutorial (https://pytorch.org/rl/tutorials/multiagent_ppo.html) in VMAS so:

  • as the algorithm i would use MADDPG (to show an off-policy algo)
  • as the env i was thinking MPE environments. particularly "simple_tag" so we can show training of 2 competitive agent groups

This would however be a synchronous environment so the tutorial would NOT SHOW:

  • turn-based environments
  • dead agents
  • action masking

It would SHOW:

  • parallel petting zoo environments
  • competitive marl
  • training multiple MARL groups in PettingZoo and VMAS

Copy link

pytorch-bot bot commented Mar 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2047

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 16 Unrelated Failures

As of commit 7b5785d with merge base 09c934d (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 29, 2024
@matteobettini matteobettini marked this pull request as draft March 29, 2024 08:18
@matteobettini matteobettini changed the title [Tutorial] PettingZoo Parallel competitive environment [Tutorial] PettingZoo Parallel competitive tutorial Mar 29, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vmoens can you upload on s3 and give the link pls?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately no, not atm :(

@vmoens
Copy link
Contributor

vmoens commented Mar 29, 2024

I'd rather have them as a separate PR if possible

@matteobettini
Copy link
Contributor Author

This would however be a synchronous environment so the tutorial would NOT SHOW:

Moved the changes to #2048

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this needs s3

@matteobettini matteobettini marked this pull request as ready for review April 2, 2024 16:54
@matteobettini
Copy link
Contributor Author

@vmoens this is ready for review

@matteobettini
Copy link
Contributor Author

matteobettini commented Apr 2, 2024

@elliottower @coderSupreme789 here is the first draft of the tuto in case you are curious and want to have a look.

@vmoens vmoens added documentation Improvements or additions to documentation tutorials labels Apr 3, 2024
# Conflicts:
#	torchrl/envs/libs/pettingzoo.py
@matteobettini
Copy link
Contributor Author

matteobettini commented Apr 22, 2024

Thanks for this, very complete!

I'm wondering if it isn't a bit too long. The code is about 1k lines of code long, there's a lot of theory and code. When someone goes through this tuto, what should they get out of it exactly (a theory about MARL, a way to code DDPG in MARL, how to use the env?) Can't this be split in more tutos? Or can we remove some extra bits that aren't 100% required to train the algo?

What the tuto is doing is showing the env and the grouping mechanism and training it.

  • Regarding the code: All the python code is necessary for training, except the section where i print the env specs and rollout which i think is needed to show readers the agent groups
  • Regarding the text: this follows the PPO and multi-agent PPO format. I can remove some theory, this was there as it motivates some choices on how to centralize the critic and so on, but if you think too verbous we can cut. I can also cut the DDPG anf off-policy explainations

@vmoens
Copy link
Contributor

vmoens commented Apr 22, 2024

I should have access to S3 soon, bear with me

EDIT: feel free to send me the files via PM and I'll store them on S3

@vmoens
Copy link
Contributor

vmoens commented Apr 22, 2024

  • Regarding the text: this follows the PPO and multi-agent PPO format. I can remove some theory, this was there as it motivates some choices on how to centralize the critic and so on, but if you think too verbous we can cut. I can also cut the DDPG anf off-policy explainations

I guessed so, but most feedback I got even from very close, top collaborators such as yourself is that our tutorials are very long and hard to follow, if not worse than that.
If we can make sure that tutorials that are aimed at a large audience (which I suspect is the case here since it starts with the very basics like off-policy vs on-policy) are also completed in 10 mins top I think they will be more impactful.

Concretely, I would either make this tutorial advanced and presuppose that users have an extensive knowledge of MARL and RL in general, and just point them to refs in case context is needed, or split the tutorial in 2-5 sub-tutorials

@matteobettini
Copy link
Contributor Author

  • Regarding the text: this follows the PPO and multi-agent PPO format. I can remove some theory, this was there as it motivates some choices on how to centralize the critic and so on, but if you think too verbous we can cut. I can also cut the DDPG anf off-policy explainations

I guessed so, but most feedback I got even from very close, top collaborators such as yourself is that our tutorials are very long and hard to follow, if not worse than that. If we can make sure that tutorials that are aimed at a large audience (which I suspect is the case here since it starts with the very basics like off-policy vs on-policy) are also completed in 10 mins top I think they will be more impactful.

Concretely, I would either make this tutorial advanced and presuppose that users have an extensive knowledge of MARL and RL in general, and just point them to refs in case context is needed, or split the tutorial in 2-5 sub-tutorials

I have removed basically all the RL background intro from the tutorial and am making it advanced since it is kind of a more complex case than the PPO one

Copy link
Contributor

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the status of the RB/Optimizer/Loss grouping? Can we unify that? If not what's the blocker?

env = TransformedEnv(
env,
RewardSum(
in_keys=env.reward_keys, reset_keys=["_reset"] * len(env.group_map.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!
Not super clean that we ask people to pass _reset, which is private
We should ask for done instead

tutorials/sphinx-tutorials/multiagent_competitive_ddpg.py Outdated Show resolved Hide resolved
@matteobettini
Copy link
Contributor Author

Can we unify that? If not what's the blocker?

I answered that in the thread. We want to keep those separate to allow different choices across groups. Plus losses cannot physicall work with multiple groups

@matteobettini
Copy link
Contributor Author

Got it!
Not super clean that we ask people to pass _reset, which is private
We should ask for done instead

The problem is that the dones are still one per group plus the shared one. The _reset on the other hand is just one

@vmoens
Copy link
Contributor

vmoens commented Apr 22, 2024

Got it!
Not super clean that we ask people to pass _reset, which is private
We should ask for done instead

The problem is that the dones are still one per group plus the shared one. The _reset on the other hand is just one

IIRC the global reset/done precedes over the local ones.

Am I right that what you're trying to do is override the local resets with the global one then?
Couldn't we use "done" like you're using reset?

@vmoens
Copy link
Contributor

vmoens commented Apr 22, 2024

Can we unify that? If not what's the blocker?

I answered that in the thread. We want to keep those separate to allow different choices across groups. Plus losses cannot physicall work with multiple groups

Sorry I missed this, the thread must have been closed.

I don't understand this statement though, can you elaborate?
In particular

losses cannot physicall work with multiple groups

what does "physicall" mean here?
What is blocking us to use a single loss across groups?

We want to keep those separate to allow different choices across groups

What "choices"? Hyperparameters?
What I see is that they are all the same so unless we use different ones it isn't immediate to understand.

@matteobettini
Copy link
Contributor Author

Am I right that what you're trying to do is override the local resets with the global one then? Couldn't we use "done" like you're using reset?

So in pettingzoo we have the global done and the local dones but only the global _reset. So the local resets just do not exist.

since the class needs a reset for each agent key, we are telling all reward keys to track the global reset.

@matteobettini
Copy link
Contributor Author

Can we unify that? If not what's the blocker?

I answered that in the thread. We want to keep those separate to allow different choices across groups. Plus losses cannot physicall work with multiple groups

Sorry I missed this, the thread must have been closed.

I don't understand this statement though, can you elaborate? In particular

losses cannot physicall work with multiple groups

what does "physicall" mean here? What is blocking us to use a single loss across groups?

We want to keep those separate to allow different choices across groups

What "choices"? Hyperparameters? What I see is that they are all the same so unless we use different ones it isn't immediate to understand.

Ok let me elaborate more, because this is quite a fundamental concept about the whole MARL api and BenchMARL.

  • losses are stateful components that only take one set of keys (reward,state,done, and so on). Therefore they cannot operate across multiple groups
  • each group can have different global values, like global critic output, that sit at root. Therefore if you kept multiple groups in the same loss it is unclear how to handle these
  • in the example of this tutorial, we stop training one group after x iterations. This would not be possible with a shared buffer/loss/optimizer

The core of the MARL API is that if you want a shared loss/buffer/optimizer you can keep agents in the same group. But when agents are in different groups they can be decoupled into individual components so that you can

  • use different losses/buffers(/optimizers for each group
  • Train them asynchronously
  • not pollute the data of other groups when writing global keys for your group

@vmoens
Copy link
Contributor

vmoens commented Apr 23, 2024

Thanks for the clarification
I can see why it isn't easy to have one loss for multiple groups atm, but don't we want that to be a feature, rather than asking users to repeat the tedious construction of N different buffers, N different optimizers and N different losses?
It feels like there's a lot of boilerplate that we could get rid of without too much trouble from within the library.

Since we can use param_groups in the optimizers, we could obtain the same behaviour as having many optimizers by just splitting the param groups carefully. Note that this will come with significant speedups on cuda too since you'll be able to update your params asynchronously.

Train them asynchronously

Isn't the training synchronous here? They're trained sequentially for a given number of steps no?

in the example of this tutorial, we stop training one group after x iterations. This would not be possible with a shared buffer/loss/optimizer

Can you point me to the part of the tutorial that does that? What I see is that there's a constant number of steps for each component and the learning rates as well as other hyperparameters are all shared. From a non MARL trained eye it seems like there's a bunch of stuff that could be simplified (for instance why not having one replay buffer from which you sample and then get the relevant keys with indexing / select?)

@matteobettini
Copy link
Contributor Author

Can you point me to the part of the tutorial that does that? What I see is that there's a constant number of steps for each component and the learning rates as well as other hyperparameters are all shared. From a non MARL trained eye it seems like there's a bunch of stuff that could be simplified (for instance why not having one replay buffer from which you sample and then get the relevant keys with indexing / select?)

  # Stop training a certain group when a condition is met (e.g., number of training iterations)
    if iteration == iteration_when_stop_training_evaders:
        del train_group_map["agent"]

Again, at some point we may sample more from the buffer of one group or stop completely sampling from one group like in this tuto.

Isn't the training synchronous here? They're trained sequentially for a given number of steps no?

I mean train them at different times. In this tuto for the seond half of the iterations we are trianing just one

I can see why it isn't easy to have one loss for multiple groups atm, but don't we want that to be a feature, rather than asking users to repeat the tedious construction of N different buffers, N different optimizers and N different losses?
It feels like there's a lot of boilerplate that we could get rid of without too much trouble from within the library.

I doi not agree. I think that this is not a feature we want as it is good to have the clear separation and have losses being single group.

Repeating the construction in a for loop is peanuts compared to the complexity that we would introduce by moving the for loop inside the losses (because at the end of the day the for loop will be somewhere since it is all heterogeneous networks).

I do not think that the current way is boiler plate as users can just experiment with setting different values for groups.

  • What if they want a different learning rate for the evaders?
  • What if they want MADDPG for chasers and MAPPO for evaders? (at this point one will have a buffer and the other no)
  • What if they want to use a curriculum of evaders to train the chasers?

All these cases benefit from the flexibility of training groups independently and I currently see no benefits in making groups share training components (because the for loops will be just nested inside the loss)

But since this is maybe the most important MARL choice of the lib and kind of the core of my internship let's maybe discuss it more also offline

@vmoens vmoens merged commit 934100f into pytorch:main Apr 24, 2024
3 of 4 checks passed
@matteobettini matteobettini deleted the pettingzoo_tuto branch April 24, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. documentation Improvements or additions to documentation tutorials
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants