Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Add and enhance fault-tolerance tests for APPO. #40743

Merged
merged 22 commits into from
Dec 8, 2023

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Oct 27, 2023

  • Enhance CartPoleCrashing environment to add option for stalling (not crashing) for n seconds from time to time.
  • Add and enhance fault-tolerance tests for APPO (including env stalling tests).
  • Fix a bug in ActorManager and the on_workers_recreated callback, which might NOT trigger in case an actor has been restored due to a remote call different from the once-per-iteration ping().

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@@ -222,6 +222,48 @@ py_test(
args = ["--dir=tuned_examples/appo"]
)

# Tests against crashing or hanging environments.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these new fault-tolerant tests for APPO.

Old fault-tolerance tests were for PG, which should be moved to rllib_contrib AND which is synchronous, not asynchronous. We should probably use PPO in the near future to cover that case again.

@@ -11,45 +11,89 @@


class CartPoleCrashing(CartPoleEnv):
"""A CartPole env that crashes from time to time.
"""A CartPole env that crashes (or stalls) from time to time.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added option to also just make this env stall (not crash) for a while.

@@ -140,7 +140,12 @@ def __init__(

@override(LearnerThread)
def step(self) -> None:
assert self.loader_thread.is_alive()
if not self.loader_thread.is_alive():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better error message.

@@ -1,53 +0,0 @@
multi-agent-cartpole-crashing-appo:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will move away from supporting restarting individual vector envs as we are moving to gym.vector.Env anyways. We could only support restarting individual envs if we remained in RLlib's VectorEnv API.

@@ -376,6 +380,16 @@ def set_actor_state(self, actor_id: int, healthy: bool) -> None:
"""
if actor_id not in self.__remote_actor_states:
raise ValueError(f"Unknown actor id: {actor_id}")

was_healthy = self.__remote_actor_states[actor_id].is_healthy
Copy link
Contributor Author

@sven1977 sven1977 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a bug:
The Algorithm.on_workers_recreated callback would NOT fire in case an actor got restarted b/c of another remote request made to it (other than the ping() in this method here, which is only called once per training iteration).

So if an actor died e.g. during sampling, but then was restarted by Ray core and was re-discovered by the manager via an attempted call to sample, then this restart would NOT have been captured by the Algorithm's on_workers_recreated callbacks b/c the succeeding once-per-iteration ping() would have already been performed on the restarted and healthy actor.

"""
# Collect recently restored actors (from `self.__fetch_result` calls other than
# the one triggered here via the `ping`).
restored_actors = list(self.__restored_actors)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually a bug:
The Algorithm.on_workers_recreated callback would NOT fire in case an actor got restarted b/c of another remote request made to it (other than the ping() in this method here, which called once per training iteration).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more explanations: see comment above

Signed-off-by: sven1977 <svenmika1977@gmail.com>
@@ -1,45 +0,0 @@
cartpole-crashing-pg:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests had been already removed in a previous PR.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
@@ -102,39 +103,48 @@ def on_episode_created(
class TestCallbacks(unittest.TestCase):
@classmethod
def setUpClass(cls):
ray.init()
ray.init(num_cpus=12)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don’t add num cpus. On workspaces you cannot run this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for _ in range(3):
algo.train()
for _ in range(5):
print(algo.train())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to print?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clearer to see the results when you are debugging/watching the test run, no? Leaving this in.

raise EnvError(
"Simulated env crash in `reset()`! Feel free to use any "
"other exception type here instead."
# f"Simulated env crash on worker={self.config.worker_index} "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. This was a problem with the EnvContext being passed in with an empty dict and then in the env code:

self.config = config or {}

Even if config is a EnvContext (with no dict settings), python would still chose the empty dict here, which then does NOT have a worker_indexproperty (b/c it's a dict, not an EnvContext).

"""
# Collect recently restored actors (from `self.__fetch_result` calls other than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@kouroshHakha
Copy link
Contributor

kouroshHakha commented Dec 4, 2023

RLlib tests are failing. Please don’t merge unless those are resolved.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 merged commit 7001982 into ray-project:master Dec 8, 2023
8 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants