[core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects allowing for fast failure #49444

kevin85421 · 2024-12-26T08:20:10Z

Why are these changes needed?

Issue statement

https://gist.github.com/kevin85421/a7f14ea38d64420b105fbd79fd31fb8a

Without this PR, both SynchronousReader._read_list and AwaitableBackgroundReader._run call the read function of each input channel sequentially. If the first input channel involves a long-running task and the second one fails immediately, the reader still has to wait until all channels have been read before failing.

Based on #46337,

This is a problem for tensor-parallel inference, because all of the workers execute in lockstep, and if one actor throws an exception, the others may hang. Depending on the order of the actors, ray.get() may never return.

Implementation details

experimental_wait_and_get_mutable_objects / CoreWorker::WaitAndGetExperimentalMutableObjects
- WaitAndGetExperimentalMutableObjects iterates through a list of mutable object references and retrieves the data when the objects are ready. The function returns when either num_objects mutable objects are acquired or the operation times out.
SynchronousReader._read_list / AwaitableBackgroundReader._run
- _get_all_waitables_to_num_consumers: Iterate through self._input_channels and call get_ray_waitables to retrieve all underlying mutable object references.
- worker.experimental_wait_and_get_mutable_objects: Attempt to retrieve a single mutable object from a list of mutable object references. If the return value is a RayTaskError, immediately return and raise an exception. Otherwise, write the return value into ChannelContext.
- After all mutable objects have been retrieved, iterate through self._input_channels and call channel.read().
Channel.read (shared_memory_channel.py)
- Because the mutable objects are retrieved in _read_list, Channel.read retrieves the data from ChannelContext instead of object store.
get_ray_waitables
- Retrieve the underlying mutable object references that the "next" read operation plans to access. Therefore, get_ray_waitables may return different results for different read function calls on the same channel. For example, BufferedSharedMemoryChannel's get_ray_waitables:
```
def get_ray_waitables(self) -> List[Tuple[ObjectRef, bool]]:
    self.ensure_registered_as_reader()
    return self._buffers[self._next_read_index].get_ray_waitables()
```
Special case: TorchTensorNcclChannel: See the comments in the file for more details.
- Call CPU write before NCCL write. Since the channel's read is only called after all required mutable objects have been retrieved, NCCL read will only be invoked after the reader has already retrieved the mutable object. Therefore, CPU write must occur before NCCL write to avoid a deadlock.
- _read_list and _run will skip deserialization for the TorchTensorNcclChannel's mutable object because TorchTensorNcclChannel relies on a custom serializer, which replaces placeholders in the CPU data with tensors read from the NCCL channel during deserialization.
  - If we deserialize the mutable object in _read_list or _run before the reader has retrieved the GPU tensors via the NCCL channel and placed the out-of-band tensors into the serialization context, issues may arise.
  - Instead, the reader will deserialize the CPU data after the out-of-band tensors are ready in the channel's read operation.

Related issue number

Closes #46337

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>

python/ray/experimental/channel/common.py

Signed-off-by: kaihsun <kaihsun@anyscale.com>

src/ray/object_manager/common.cc

src/ray/core_worker/experimental_mutable_object_manager.cc

Signed-off-by: kaihsun <kaihsun@anyscale.com>

python/ray/experimental/channel/shared_memory_channel.py

python/ray/experimental/channel/torch_tensor_nccl_channel.py

src/ray/core_worker/core_worker.h

python/ray/_private/worker.py

python/ray/_raylet.pyx

python/ray/experimental/channel/common.py

src/ray/core_worker/core_worker.cc

python/ray/experimental/channel/common.py

Signed-off-by: kaihsun <kaihsun@anyscale.com>

python/ray/experimental/channel/common.py

Signed-off-by: kaihsun <kaihsun@anyscale.com>

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

python/ray/experimental/compiled_dag_ref.py

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

ruisearch42

Initial pass, partial review

python/ray/experimental/channel/common.py

ruisearch42 · 2024-12-31T17:35:55Z

python/ray/experimental/channel/common.py

+        (
+            waitables_to_num_consumers,
+            skip_deserialization_waitables_to_num_consumers,
+        ) = self._get_all_waitables_to_num_consumers()
+        normal_waitables = list(waitables_to_num_consumers.keys())
+        skip_deserialization_waitables = list(
+            skip_deserialization_waitables_to_num_consumers.keys()
+        )


looks like these are static? should we do it at init time?

move to ReaderInterface constructor.

After giving it a second thought, I realized it is not static. For example, the get_ray_waitables method of BufferedSharedMemoryChannel should return the buffer that will be read in the current read operation. Therefore, the return value of get_ray_waitables is not always the same.

def get_ray_waitables(self) -> List[Tuple[ObjectRef, bool]]: self.ensure_registered_as_reader() return self._buffers[self._next_read_index].get_ray_waitables()

python/ray/_private/worker.py

python/ray/_raylet.pyx

src/ray/core_worker/core_worker.cc

python/ray/experimental/channel/common.py

kevin85421 · 2024-12-31T19:05:10Z

python/ray/experimental/channel/torch_tensor_nccl_channel.py

@@ -193,12 +198,31 @@ def _send_cpu_and_gpu_data(self, value: Any, timeout: Optional[float]):
            # normally.
            self.serialization_ctx.set_use_external_transport(False)

-        # First send the extracted tensors through a GPU-specific channel.


NCCL write -> NCCL read -> all mutable objects are ready -> _cpu_data_channel.write -> NCCL write

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

python/ray/experimental/channel/torch_tensor_nccl_channel.py

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

src/ray/core_worker/core_worker.cc

dayshah · 2025-01-03T20:32:04Z

src/ray/core_worker/core_worker.cc

+  int64_t remaining_timeout = timeout_ms == -1 ? 1e9 : timeout_ms;
+  auto timeout_point = ToTimeoutPoint(remaining_timeout);
+  int64_t iteration_timeout =
+      std::min(remaining_timeout, RayConfig::instance().get_timeout_milliseconds());


I think this env variable should be renamed, kind of misleading for both core and cgraph, maybe in a separate pr, but it's should be like get_iteration_timeout_milliseconds

Agree, the name is misleading.

dayshah · 2025-01-03T20:40:17Z

src/ray/core_worker/core_worker.cc

+
+      // Try to acquire the object.
+      Status s = experimental_mutable_object_provider_->ReadAcquire(
+          ids[i], results[i], iteration_timeout);


Also now that timeout is always guaranteed to be there, ReadAcquire shouldn't be taking an int that could be -1, just pass it a non-optional timeout_point from here

Good point. I will address it in a separate PR to avoid making this PR bigger.

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

stephanie-wang · 2025-01-07T20:22:51Z

Will this approach work? Modify SynchronousReader._read_list to try to read each channel up to a set timeout (say 100ms), then try the next one?

kevin85421 · 2025-01-07T20:52:36Z

Will this approach work? Modify SynchronousReader._read_list to try to read each channel up to a set timeout (say 100ms), then try the next one?

NcclCommunicator does not seem to support send/recv with a timeout, and we need to ensure that other channels in the future also support timeouts.

If a read operation in a channel has multiple points where a timeout exception can be thrown, we need to distinguish them and handle them differently to recover from any side effects, especially if we still want to reuse the DAG.

kevin85421 · 2025-01-07T21:35:51Z

I synced with @stephanie-wang offline. I will try using the channel read with a timeout and ensure that the channels are idempotent.

If a read operation in a channel has multiple points where a timeout exception can be thrown, we need to distinguish them and handle them differently to recover from any side effects, especially if we still want to reuse the DAG.

For this question, we just relied on the e2e timeout to handle it.

kevin85421 · 2025-01-11T06:03:49Z

We decided to proceed with #49711 instead of this PR.

kevin85421 added 4 commits December 26, 2024 08:19

pass tests

e631160

Signed-off-by: kaihsun <kaihsun@anyscale.com>

pass tests, refactor

bf585fa

Signed-off-by: kaihsun <kaihsun@anyscale.com>

pass tests, waitables

bf7a26e

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

93bff72

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 commented Dec 26, 2024

View reviewed changes

python/ray/experimental/channel/common.py Outdated Show resolved Hide resolved

kevin85421 added 3 commits December 27, 2024 01:57

pass tests

d68f8db

Signed-off-by: kaihsun <kaihsun@anyscale.com>

pass tests, retrieve obj one by one in sync reader

aa7b831

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

a22bbaa

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 changed the title ~~WIP - 2~~ [core][experimental] ray.get of accelerated DAG result may not throw exception for MultiOutputNode Dec 30, 2024

kevin85421 commented Dec 30, 2024

View reviewed changes

src/ray/object_manager/common.cc Outdated Show resolved Hide resolved

kevin85421 commented Dec 30, 2024

View reviewed changes

src/ray/core_worker/experimental_mutable_object_manager.cc Outdated Show resolved Hide resolved

kevin85421 added 3 commits December 30, 2024 18:56

update comment and move import to top-level

d1aac6c

Signed-off-by: kaihsun <kaihsun@anyscale.com>

remove logs and update comments for WaitAndGetExperimentalMutableObjects

a84e561

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update comments

9e44591

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 commented Dec 30, 2024

View reviewed changes

add some utils

3ddad3a

Signed-off-by: kaihsun <kaihsun@anyscale.com>

kevin85421 commented Dec 30, 2024

View reviewed changes

python/ray/experimental/channel/common.py Outdated Show resolved Hide resolved

kevin85421 added 6 commits December 30, 2024 22:45

fix test_channel tests

4ca28e9

Signed-off-by: kaihsun <kaihsun@anyscale.com>

update

d8dadb4

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

Merge remote-tracking branch 'upstream/master' into 20241224-2

5d18be7

fix test_channel

cddaef7

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update type hint

abd4d28

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

remove c++ log

a684329

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 commented Dec 31, 2024

View reviewed changes

python/ray/experimental/compiled_dag_ref.py Outdated Show resolved Hide resolved

kevin85421 added 4 commits December 31, 2024 06:52

update _read_list

2a1b3fc

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

refactor

9a3ae27

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

remove comment for visualize tests

9f2f1d1

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

add tests

ba489c3

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 changed the title ~~[core][experimental] ray.get of accelerated DAG result may not throw exception for MultiOutputNode~~ [core][compiled-graphs] ray.get of accelerated DAG result may not throw exception for MultiOutputNode Dec 31, 2024

kevin85421 changed the title ~~[core][compiled-graphs] ray.get of accelerated DAG result may not throw exception for MultiOutputNode~~ [core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects, allowing for fast failure Dec 31, 2024

kevin85421 marked this pull request as ready for review December 31, 2024 09:23

kevin85421 assigned ruisearch42 Dec 31, 2024

kevin85421 changed the title ~~[core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects, allowing for fast failure~~ [core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects allowing for fast failure Dec 31, 2024

ruisearch42 reviewed Dec 31, 2024

View reviewed changes

kevin85421 commented Dec 31, 2024

View reviewed changes

python/ray/experimental/channel/common.py Show resolved Hide resolved

kevin85421 commented Dec 31, 2024

View reviewed changes

kevin85421 added 4 commits December 31, 2024 21:47

address comments

ebbee6f

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

move retrieve_obj_refs to util

8ebe78c

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

fix lint error

d958b24

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

fix nccl channel tests

551bacb

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 commented Jan 1, 2025

View reviewed changes

python/ray/experimental/channel/torch_tensor_nccl_channel.py Outdated Show resolved Hide resolved

kevin85421 added 3 commits January 1, 2025 07:14

fix test

1ae976b

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

fix typo

1cc8210

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

refactor

8c4e61d

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 mentioned this pull request Jan 3, 2025

[core] Move ToTimeoutPoint to util.cc #49560

Merged

8 tasks

dayshah reviewed Jan 3, 2025

View reviewed changes

kevin85421 added 2 commits January 6, 2025 04:13

Merge remote-tracking branch 'upstream/master' into 20241224-2

6d22187

fix lint

38c8652

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 requested a review from ruisearch42 January 6, 2025 23:07

kevin85421 mentioned this pull request Jan 7, 2025

[core] Rename get_timeout_milliseconds to get_check_signal_interval_milliseconds #49677

Merged

8 tasks

stephanie-wang self-assigned this Jan 7, 2025

kevin85421 mentioned this pull request Jan 8, 2025

[core][compiled-graphs] Read input channels in a round-robin manner with a short timeout, and fail fast if a RayTaskError is found #49711

Merged

8 tasks

kevin85421 closed this Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects allowing for fast failure #49444

[core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects allowing for fast failure #49444

kevin85421 commented Dec 26, 2024 •

edited

Loading

ruisearch42 left a comment

ruisearch42 Dec 31, 2024

kevin85421 Dec 31, 2024

kevin85421 Dec 31, 2024

kevin85421 Dec 31, 2024

kevin85421 Dec 31, 2024

dayshah Jan 3, 2025

kevin85421 Jan 6, 2025

dayshah Jan 3, 2025

kevin85421 Jan 6, 2025

stephanie-wang commented Jan 7, 2025

kevin85421 commented Jan 7, 2025

kevin85421 commented Jan 7, 2025

kevin85421 commented Jan 11, 2025

[core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects allowing for fast failure #49444

[core][compiled-graphs] Support wait-and-get to round-robin the acquisition of mutable objects allowing for fast failure #49444

Conversation

kevin85421 commented Dec 26, 2024 • edited Loading

Why are these changes needed?

Issue statement

Implementation details

Related issue number

Checks

ruisearch42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Jan 7, 2025

kevin85421 commented Jan 7, 2025

kevin85421 commented Jan 7, 2025

kevin85421 commented Jan 11, 2025

kevin85421 commented Dec 26, 2024 •

edited

Loading