-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Add and enhance fault-tolerance tests for APPO. #40743
[RLlib] Add and enhance fault-tolerance tests for APPO. #40743
Conversation
…t_tolerance_tests_for_appo
@@ -222,6 +222,48 @@ py_test( | |||
args = ["--dir=tuned_examples/appo"] | |||
) | |||
|
|||
# Tests against crashing or hanging environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these new fault-tolerant tests for APPO.
Old fault-tolerance tests were for PG, which should be moved to rllib_contrib
AND which is synchronous, not asynchronous. We should probably use PPO in the near future to cover that case again.
@@ -11,45 +11,89 @@ | |||
|
|||
|
|||
class CartPoleCrashing(CartPoleEnv): | |||
"""A CartPole env that crashes from time to time. | |||
"""A CartPole env that crashes (or stalls) from time to time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added option to also just make this env stall (not crash) for a while.
@@ -140,7 +140,12 @@ def __init__( | |||
|
|||
@override(LearnerThread) | |||
def step(self) -> None: | |||
assert self.loader_thread.is_alive() | |||
if not self.loader_thread.is_alive(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better error message.
@@ -1,53 +0,0 @@ | |||
multi-agent-cartpole-crashing-appo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will move away from supporting restarting individual vector envs as we are moving to gym.vector.Env anyways. We could only support restarting individual envs if we remained in RLlib's VectorEnv API.
@@ -376,6 +380,16 @@ def set_actor_state(self, actor_id: int, healthy: bool) -> None: | |||
""" | |||
if actor_id not in self.__remote_actor_states: | |||
raise ValueError(f"Unknown actor id: {actor_id}") | |||
|
|||
was_healthy = self.__remote_actor_states[actor_id].is_healthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a bug:
The Algorithm.on_workers_recreated
callback would NOT fire in case an actor got restarted b/c of another remote request made to it (other than the ping()
in this method here, which is only called once per training iteration).
So if an actor died e.g. during sampling, but then was restarted by Ray core and was re-discovered by the manager via an attempted call to sample
, then this restart would NOT have been captured by the Algorithm's on_workers_recreated
callbacks b/c the succeeding once-per-iteration ping()
would have already been performed on the restarted and healthy actor.
""" | ||
# Collect recently restored actors (from `self.__fetch_result` calls other than | ||
# the one triggered here via the `ping`). | ||
restored_actors = list(self.__restored_actors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually a bug:
The Algorithm.on_workers_recreated
callback would NOT fire in case an actor got restarted b/c of another remote request made to it (other than the ping()
in this method here, which called once per training iteration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more explanations: see comment above
@@ -1,45 +0,0 @@ | |||
cartpole-crashing-pg: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests had been already removed in a previous PR.
@@ -102,39 +103,48 @@ def on_episode_created( | |||
class TestCallbacks(unittest.TestCase): | |||
@classmethod | |||
def setUpClass(cls): | |||
ray.init() | |||
ray.init(num_cpus=12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don’t add num cpus. On workspaces you cannot run this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
for _ in range(3): | ||
algo.train() | ||
for _ in range(5): | ||
print(algo.train()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clearer to see the results when you are debugging/watching the test run, no? Leaving this in.
raise EnvError( | ||
"Simulated env crash in `reset()`! Feel free to use any " | ||
"other exception type here instead." | ||
# f"Simulated env crash on worker={self.config.worker_index} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. This was a problem with the EnvContext being passed in with an empty dict and then in the env code:
self.config = config or {}
Even if config
is a EnvContext (with no dict settings), python would still chose the empty dict here, which then does NOT have a worker_index
property (b/c it's a dict, not an EnvContext).
""" | ||
# Collect recently restored actors (from `self.__fetch_result` calls other than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
RLlib tests are failing. Please don’t merge unless those are resolved. |
…t_tolerance_tests_for_appo
on_workers_recreated
callback, which might NOT trigger in case an actor has been restored due to a remote call different from the once-per-iterationping()
.Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.