[RLlib] Add and enhance fault-tolerance tests for APPO. #40743

sven1977 · 2023-10-27T10:13:47Z

Enhance CartPoleCrashing environment to add option for stalling (not crashing) for n seconds from time to time.
Add and enhance fault-tolerance tests for APPO (including env stalling tests).
Fix a bug in ActorManager and the on_workers_recreated callback, which might NOT trigger in case an actor has been restored due to a remote call different from the once-per-iteration ping().

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…t_tolerance_tests_for_appo

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 · 2023-12-04T14:59:33Z

rllib/BUILD

@@ -222,6 +222,48 @@ py_test(
    args = ["--dir=tuned_examples/appo"]
 )

+# Tests against crashing or hanging environments.


Added these new fault-tolerant tests for APPO.

Old fault-tolerance tests were for PG, which should be moved to rllib_contrib AND which is synchronous, not asynchronous. We should probably use PPO in the near future to cover that case again.

sven1977 · 2023-12-04T15:00:18Z

rllib/examples/env/cartpole_crashing.py

@@ -11,45 +11,89 @@


 class CartPoleCrashing(CartPoleEnv):
-    """A CartPole env that crashes from time to time.
+    """A CartPole env that crashes (or stalls) from time to time.


Added option to also just make this env stall (not crash) for a while.

sven1977 · 2023-12-04T15:00:46Z

rllib/execution/multi_gpu_learner_thread.py

@@ -140,7 +140,12 @@ def __init__(

    @override(LearnerThread)
    def step(self) -> None:
-        assert self.loader_thread.is_alive()
+        if not self.loader_thread.is_alive():


Better error message.

sven1977 · 2023-12-04T15:01:41Z

rllib/tuned_examples/appo/multi-agent-cartpole-crashing-restart-env-appo.yaml

@@ -1,53 +0,0 @@
-multi-agent-cartpole-crashing-appo:


We will move away from supporting restarting individual vector envs as we are moving to gym.vector.Env anyways. We could only support restarting individual envs if we remained in RLlib's VectorEnv API.

sven1977 · 2023-12-04T15:02:58Z

rllib/utils/actor_manager.py

@@ -376,6 +380,16 @@ def set_actor_state(self, actor_id: int, healthy: bool) -> None:
        """
        if actor_id not in self.__remote_actor_states:
            raise ValueError(f"Unknown actor id: {actor_id}")
+
+        was_healthy = self.__remote_actor_states[actor_id].is_healthy


This was actually a bug:
The Algorithm.on_workers_recreated callback would NOT fire in case an actor got restarted b/c of another remote request made to it (other than the ping() in this method here, which is only called once per training iteration).

So if an actor died e.g. during sampling, but then was restarted by Ray core and was re-discovered by the manager via an attempted call to sample, then this restart would NOT have been captured by the Algorithm's on_workers_recreated callbacks b/c the succeeding once-per-iteration ping() would have already been performed on the restarted and healthy actor.

sven1977 · 2023-12-04T15:03:51Z

rllib/utils/actor_manager.py

        """
+        # Collect recently restored actors (from `self.__fetch_result` calls other than
+        # the one triggered here via the `ping`).
+        restored_actors = list(self.__restored_actors)


This was actually a bug:
The Algorithm.on_workers_recreated callback would NOT fire in case an actor got restarted b/c of another remote request made to it (other than the ping() in this method here, which called once per training iteration).

more explanations: see comment above

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 · 2023-12-04T15:28:21Z

rllib/tuned_examples/pg/cartpole-crashing-pg.yaml

@@ -1,45 +0,0 @@
-cartpole-crashing-pg:


These tests had been already removed in a previous PR.

Signed-off-by: sven1977 <svenmika1977@gmail.com>

kouroshHakha · 2023-12-04T16:29:21Z

rllib/algorithms/tests/test_callbacks.py

@@ -102,39 +103,48 @@ def on_episode_created(
 class TestCallbacks(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
-        ray.init()
+        ray.init(num_cpus=12)


Please don’t add num cpus. On workspaces you cannot run this code.

kouroshHakha · 2023-12-04T16:30:34Z

rllib/algorithms/tests/test_callbacks.py

-            for _ in range(3):
-                algo.train()
+            for _ in range(5):
+                print(algo.train())


No need to print?

It's clearer to see the results when you are debugging/watching the test run, no? Leaving this in.

kouroshHakha · 2023-12-04T16:36:53Z

rllib/examples/env/cartpole_crashing.py

            raise EnvError(
-                "Simulated env crash in `reset()`! Feel free to use any "
-                "other exception type here instead."
+                # f"Simulated env crash on worker={self.config.worker_index} "


Fixed. This was a problem with the EnvContext being passed in with an empty dict and then in the env code:

self.config = config or {}

Even if config is a EnvContext (with no dict settings), python would still chose the empty dict here, which then does NOT have a worker_indexproperty (b/c it's a dict, not an EnvContext).

kouroshHakha · 2023-12-04T16:41:10Z

rllib/utils/actor_manager.py

        """
+        # Collect recently restored actors (from `self.__fetch_result` calls other than


Remove comment?

kouroshHakha · 2023-12-04T16:42:32Z

RLlib tests are failing. Please don’t merge unless those are resolved.

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…t_tolerance_tests_for_appo

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

11fa93f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 requested review from avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and kouroshHakha as code owners October 27, 2023 10:13

sven1977 added 10 commits October 27, 2023 13:36

wip

3515c90

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

a1c9228

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

d040ae0

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into faul…

5ccd02a

…t_tolerance_tests_for_appo

wip

a018866

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

af2346e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

c46778d

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

510823c

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

b45d4c2

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

e7af56f

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 commented Dec 4, 2023

View reviewed changes

wip

6523878

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 commented Dec 4, 2023

View reviewed changes

wip

e5d03d4

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 assigned kouroshHakha Dec 4, 2023

kouroshHakha reviewed Dec 4, 2023

View reviewed changes

kouroshHakha approved these changes Dec 4, 2023

View reviewed changes

wip

f25f4ac

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 added 8 commits December 5, 2023 11:29

wip

7f9d024

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

ebfca45

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

a5cb79e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

43fa7f7

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

d1a9e7a

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into faul…

77a7140

…t_tolerance_tests_for_appo

wip

a3824dd

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

b0bebed

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 merged commit 7001982 into ray-project:master Dec 8, 2023
8 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Add and enhance fault-tolerance tests for APPO. #40743

[RLlib] Add and enhance fault-tolerance tests for APPO. #40743

sven1977 commented Oct 27, 2023 •

edited

Loading

sven1977 Dec 4, 2023

sven1977 Dec 4, 2023

sven1977 Dec 4, 2023

sven1977 Dec 4, 2023

sven1977 Dec 4, 2023 •

edited

Loading

sven1977 Dec 4, 2023

sven1977 Dec 4, 2023

sven1977 Dec 4, 2023

kouroshHakha Dec 4, 2023

sven1977 Dec 5, 2023

kouroshHakha Dec 4, 2023

sven1977 Dec 5, 2023

kouroshHakha Dec 4, 2023

sven1977 Dec 5, 2023

kouroshHakha Dec 4, 2023

sven1977 Dec 5, 2023

kouroshHakha commented Dec 4, 2023 •

edited

Loading

		"""
		# Collect recently restored actors (from `self.__fetch_result` calls other than

[RLlib] Add and enhance fault-tolerance tests for APPO. #40743

[RLlib] Add and enhance fault-tolerance tests for APPO. #40743

Conversation

sven1977 commented Oct 27, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha commented Dec 4, 2023 • edited Loading

sven1977 commented Oct 27, 2023 •

edited

Loading

sven1977 Dec 4, 2023 •

edited

Loading

kouroshHakha commented Dec 4, 2023 •

edited

Loading