[data] Improve stall detection for StreamingOutputsBackpressurePolicy #41637

raulchen · 2023-12-05T22:08:37Z

Why are these changes needed?

When there is non-Data code running in the same clusters. Data StreamExecutor will consider all submitted tasks as active, while they may not actually have resources to run.
#41603 is an attempt to fix the data+train workload by excluding training resources.

While this PR is a more general fix for other workloads, with two main changes:

Besides detecting active tasks, we also detect if the downstream is not making any progress for a specific interval.
Introduce a new reserved_resources option to allow specifying non-Data resources.

This PR along can also fix #41496

Related issue number

Closes #41496

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <chenh1024@gmail.com>

stephanie-wang

Let's add unit tests for the backpressure policy itself (no Dataset execution)?

python/ray/data/tests/test_backpressure_policies.py

python/ray/data/_internal/execution/backpressure_policy/streaming_output_backpressure_policy.py

Signed-off-by: Hao Chen <chenh1024@gmail.com>

stephanie-wang

Can you also address the comment about adding a unit test for the policy?

stephanie-wang · 2023-12-06T23:00:30Z

python/ray/data/_internal/execution/interfaces/execution_options.py

+        reserved_resources: Amount of reserved resources for non-Ray-Data
+            workloads. Ray Data will exlcude these resources when scheduling tasks,
+            unless resource_limits is manually set.


Suggested change

reserved_resources: Amount of reserved resources for non-Ray-Data

workloads. Ray Data will exlcude these resources when scheduling tasks,

unless resource_limits is manually set.

exclude_resources: Amount of reserved resources for non-Ray-Data

workloads. Ray Data will exlcude these resources when scheduling tasks,

unless resource_limits is manually set.

I just realized reserved_resources sounds a bit like resources reserved for Data, which is the opposite of what we want.

stephanie-wang · 2023-12-06T23:00:49Z

python/ray/data/_internal/execution/interfaces/execution_options.py

@@ -105,6 +108,10 @@ class ExecutionOptions:

    resource_limits: ExecutionResources = field(default_factory=ExecutionResources)

+    reserved_resources: ExecutionResources = field(


Can we add a check that this is not set if resource_limits is manually specified?

stephanie-wang · 2023-12-06T23:16:36Z

python/ray/data/_internal/execution/backpressure_policy/streaming_output_backpressure_policy.py

+            "Temporarily unblocking backpressure."
+            f" Because some tasks of operator {op} have been submitted,"
+            f" but no outputs are generated for {no_output_time} seconds."
+            " Ignore this warning if your UDF is expected to be slow."
+            " This may also be because some resources are preempted by"
+            " non-Ray-Data workloads."
+            " If this is the case, set `ExecutionOptions.reserved_resources`."
+            " This message will only be printed once."


Can we make the error message expose fewer implementation details? I don't think users will necessarily know what "backpressure" means, for example.

Maybe something like:

"Operator {op} is running, but has not produced outputs for {no_output_time}s. Ignore this warning if your UDF is expected to be slow.

Otherwise, this can happen when there are fewer cluster resources available to Ray Data than expected. If you have non-Data tasks or actors running in the cluster, reserve resources for them with ray.data.ExecutionOptions.reserved_resources = {"num_cpus": <CPUs to exclude>}."

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen · 2023-12-07T01:44:32Z

@stephanie-wang thanks. unit test is added and other comments are addressed too.

Signed-off-by: Hao Chen <chenh1024@gmail.com>

python/ray/data/_internal/execution/backpressure_policy/streaming_output_backpressure_policy.py

python/ray/data/_internal/execution/interfaces/execution_options.py

python/ray/train/_internal/data_config.py

stephanie-wang · 2023-12-07T19:11:25Z

python/ray/data/tests/test_backpressure_policies.py

+            assert res == {
+                up_state: 0,
+                down_state: self._max_blocks_in_op_output_queue,
+            }


Nice unit tests!

stephanie-wang · 2023-12-07T19:12:52Z

python/ray/data/_internal/execution/interfaces/execution_options.py

+        exclude_resources: Amount of resources to exclude from Ray Data.
+            Set this if you have other workloads running on the same cluster.
+            For Ray Data + Ray Train, this should be automatically set.
+            Note for each resource type, resource_limits and exclude_resources can


Just comment to update the doc here that resources from DataConfig add to exclude_resources, instead of overwriting.

Might be good to add a test for this case also.

yeah, updated the unit test to reflect this change.

…ing_output_backpressure_policy.py Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…ns.py Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Signed-off-by: Hao Chen <chenh1024@gmail.com>

…chen/ray into streaming-backpressure-detect-stall

Signed-off-by: Hao Chen <chenh1024@gmail.com>

…ray-project#41637) When there is non-Data code running in the same clusters. Data StreamExecutor will consider all submitted tasks as active, while they may not actually have resources to run. ray-project#41603 is an attempt to fix the data+train workload by excluding training resources. While this PR is a more general fix for other workloads, with two main changes: 1. Besides detecting active tasks, we also detect if the downstream is not making any progress for a specific interval. 2. Introduce a new `reserved_resources` option to allow specifying non-Data resources. This PR along can also fix ray-project#41496 --------- Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Hao Chen <chenh1024@gmail.com>

…#41637) (#41720) When there is non-Data code running in the same clusters. Data StreamExecutor will consider all submitted tasks as active, while they may not actually have resources to run. #41603 is an attempt to fix the data+train workload by excluding training resources. While this PR is a more general fix for other workloads, with two main changes: 1. Besides detecting active tasks, we also detect if the downstream is not making any progress for a specific interval. 2. Introduce a new `reserved_resources` option to allow specifying non-Data resources. This PR along can also fix #41496 --------- Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

raulchen added 2 commits December 5, 2023 12:33

detect downstream outputs

88eefb1

Signed-off-by: Hao Chen <chenh1024@gmail.com>

add test

c1228fe

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, stephanie-wang and Zandew as code owners December 5, 2023 22:08

This was referenced Dec 5, 2023

[data][train] default ingest resource limits should exclude resources used by training #41603

Merged

Release test ray-data-resnet50-ingest-file-size-benchmark.aws failed #41496

Closed

raulchen assigned stephanie-wang and c21 Dec 5, 2023

stephanie-wang requested changes Dec 5, 2023

View reviewed changes

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 5, 2023

raulchen added 5 commits December 6, 2023 10:11

Merge branch 'master' into streaming-backpressure-detect-stall

d50e46c

add timeout and print warning

5359887

Signed-off-by: Hao Chen <chenh1024@gmail.com>

add reserved_resources

e75a40b

Signed-off-by: Hao Chen <chenh1024@gmail.com>

typo

e346cc5

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

8f93a68

Signed-off-by: Hao Chen <chenh1024@gmail.com>

stephanie-wang reviewed Dec 6, 2023

View reviewed changes

raulchen added 4 commits December 6, 2023 15:35

rename and validate

de3a1ce

Signed-off-by: Hao Chen <chenh1024@gmail.com>

warning message

f9f5849

Signed-off-by: Hao Chen <chenh1024@gmail.com>

unit test

d1f9931

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

5f531c0

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen added 2 commits December 7, 2023 10:24

fix

6c86f98

Signed-off-by: Hao Chen <chenh1024@gmail.com>

unit test for streaming executor

06b762b

Signed-off-by: Hao Chen <chenh1024@gmail.com>

stephanie-wang approved these changes Dec 7, 2023

View reviewed changes

stephanie-wang and others added 10 commits December 7, 2023 11:13

Update python/ray/data/_internal/execution/backpressure_policy/stream…

376ad17

…ing_output_backpressure_policy.py Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update python/ray/data/_internal/execution/interfaces/execution_optio…

bc6693b

…ns.py Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update python/ray/data/_internal/execution/interfaces/execution_optio…

f29679c

…ns.py Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update python/ray/data/_internal/execution/interfaces/execution_optio…

8d56a21

…ns.py Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Update python/ray/train/_internal/data_config.py

acc8574

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

DataConfig adds extra eclucde_resources

6f9e423

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

06078f5

Signed-off-by: Hao Chen <chenh1024@gmail.com>

Merge branch 'streaming-backpressure-detect-stall' of github.com:raul…

ce0a670

…chen/ray into streaming-backpressure-detect-stall

lint

2fcd0fe

Signed-off-by: Hao Chen <chenh1024@gmail.com>

lint

68451e3

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen merged commit 9ea2e8b into ray-project:master Dec 8, 2023
15 of 16 checks passed

raulchen deleted the streaming-backpressure-detect-stall branch December 8, 2023 04:07

raulchen mentioned this pull request Dec 8, 2023

[cherry-pick][data] Improve stall detection for StreamingOutputsBackpressurePolicy #41720

Merged

8 tasks

Zandew mentioned this pull request Dec 14, 2023

[CI] linux://python/ray/data:test_streaming_integration is failing/flaky on master. #41176

Closed

raulchen mentioned this pull request Dec 15, 2023

[data] Fix test_autoshutdown_dangling_executors #41940

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Improve stall detection for StreamingOutputsBackpressurePolicy #41637

[data] Improve stall detection for StreamingOutputsBackpressurePolicy #41637

raulchen commented Dec 5, 2023 •

edited

Loading

stephanie-wang left a comment

stephanie-wang left a comment

stephanie-wang Dec 6, 2023

stephanie-wang Dec 6, 2023

stephanie-wang Dec 6, 2023

raulchen commented Dec 7, 2023

stephanie-wang Dec 7, 2023

stephanie-wang Dec 7, 2023

stephanie-wang Dec 7, 2023

raulchen Dec 7, 2023

		@@ -105,6 +108,10 @@ class ExecutionOptions:

		resource_limits: ExecutionResources = field(default_factory=ExecutionResources)

		reserved_resources: ExecutionResources = field(

[data] Improve stall detection for StreamingOutputsBackpressurePolicy #41637

[data] Improve stall detection for StreamingOutputsBackpressurePolicy #41637

Conversation

raulchen commented Dec 5, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang left a comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

stephanie-wang Dec 6, 2023

Choose a reason for hiding this comment

stephanie-wang Dec 6, 2023

Choose a reason for hiding this comment

stephanie-wang Dec 6, 2023

Choose a reason for hiding this comment

raulchen commented Dec 7, 2023

stephanie-wang Dec 7, 2023

Choose a reason for hiding this comment

stephanie-wang Dec 7, 2023

Choose a reason for hiding this comment

stephanie-wang Dec 7, 2023

Choose a reason for hiding this comment

raulchen Dec 7, 2023

Choose a reason for hiding this comment

raulchen commented Dec 5, 2023 •

edited

Loading