[core] Support mutable plasma objects #41515

stephanie-wang · 2023-11-30T01:16:03Z

Why are these changes needed?

This is a first pass on introducing an experimental "channel" concept that can be used for direct worker-worker communication, bypassing the usual Ray Core components like the driver and raylet.

Channels are implemented as mutable plasma objects. The object can be written multiple times by a client. The writer must specify the number of reads that can be made before the written object value is no longer valid. Reads block until the specified version or a later one is available. Writes block until all readers are available. Synchronization between a single writer and multiple readers is performed through a new header for plasma objects that is stored in shared memory.

API:

channel: Channel = ray.experimental.channel.Channel(buf_size): Client uses the normal ray.put path to create a mutable plasma object. Once created and sealed for the first time, the plasma store synchronously reads and releases the object. At this point, the object may be written by the original client and read by others.
channel.write(val): Use the handle returned by the above to send a value through the channel. The caller waits until all readers of the previous version have released the object, then writes a new version.
val = channel.begin_read(): Blocks until a value is available. Equivalent to ray.get. This is the beginning of the client's read.
channel.end_read(): End the client's read, marking the channel as available to write again.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

python/ray/_private/worker.py

ericl

One question is around readability--- since these new code paths are pretty different from the rest of the plasma handling, perhaps we should by convention prefix them with MutableObj or some other naming convention? Experimental is another option.

ericl · 2023-12-01T00:19:01Z

python/ray/_raylet.pyx

@@ -3465,11 +3474,40 @@ cdef class CoreWorker:
                            generator_id=CObjectID.Nil(),
                            owner_address=c_owner_address))

+    def put_serialized_object_to_mutable_plasma_object(self, serialized_object,
+                                                       ObjectRef object_ref,
+                                                       num_readers,


Document this arg as experimental?

I think this one's okay because this method is only called by the new experimental path (renamed though).

python/ray/_raylet.pyx

src/ray/core_worker/store_provider/plasma_store_provider.cc

src/ray/core_worker/store_provider/plasma_store_provider.h

python/ray/_private/worker.py

src/ray/object_manager/plasma/client.cc

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

rkooo567 · 2023-12-01T02:17:39Z

btw do we not use eventfd anymore? Cannot find code.

rkooo567

Re; API.

Maybe it is better hiding object ref from the channel? As the returned object ref may have different semantics from a regular object ref (it is mutable)?

channel = ray._create_channel(byte_size=1000)
channel._read()
channel._write()
channel._end()

With the current semantic, if you just ray.get()

python/ray/_private/worker.py

rkooo567 · 2023-12-01T02:24:12Z

python/ray/_private/worker.py

+@PublicAPI
+def _write_channel(value: Any, object_ref: ObjectRef, num_readers: int):
+    worker = global_worker
+    worker.check_connected()


Btw this call is pretty expensive. Maybe we should skip it.

In my macbook it takes 10us

I moved it to a constructor but that seems off, it's just a bool check...

python/ray/_private/worker.py

rkooo567 · 2023-12-01T02:45:07Z

src/ray/object_manager/plasma/client.cc

+      // Increment the count of the number of instances of this object that this
+      // client is using. Cache the reference to the object.
+      IncrementObjectCount(received_object_ids[i], object, true);
+      auto &object_entry = objects_in_use_[received_object_ids[i]];


objects_in_use_[received_object_ids[i]]; will create a new empty obj if received_object_ids[i] doesn't exist. Should we RAY_CHECK if received_object_ids[i] exists here?

IncrementObjectCount will create the object first.

Added some code cleanup here to make it a bit nicer. Previously IncrementObjectCount both added and incremented the count, which is not very clear.

rkooo567 · 2023-12-01T02:46:17Z

src/ray/object_manager/plasma/client.cc

+  std::unique_lock<std::recursive_mutex> guard(client_mutex_);
+
+  auto object_entry = objects_in_use_.find(object_id);
+  if (object_entry == objects_in_use_.end()) {


can you add tests for these cases?

looks like client.cc doesn't have unit tests... we should probably add one in the future. Maybe we can do that in the python test?

I don't think there is a meaningful python test here.

src/ray/object_manager/plasma/client.cc

rkooo567 · 2023-12-01T02:49:59Z

src/ray/object_manager/plasma/client.cc

+
+    // The data and metadata size may have changed, so update here before we
+    // create the Get buffer to return.
+    object_entry->object.data_size = plasma_header->data_size;


we don't handle different data size in this PR yet right?

It's handled.

Hmm looking at our impl, I am not sure if we can handle write in the driver and read in the worker with different size.

Can you try adding a test like this? (it is what I wrote in the dag branch)

""" Verify put in 2 different processes work. """ ray.init(num_cpus=1) print("Test Write input from driver -> Read & Write from worker -> Read output from driver") expected_input = b"000000000000" ref = ray.put(expected_input, max_readers=1) print(ref) @ray.remote class A: def f(self, refs, expected_input, output_val): ref = refs[0] val = ray.get(refs[0]) assert val == expected_input, val ray.release(ref) ray.worker.global_worker.put_object(output_val, object_ref=ref, max_readers=1) a = A.remote() time.sleep(1) output_val = b"0" b = a.f.remote([ref], expected_input, output_val) ray.get(b) val = ray.get(ref) assert output_val == val ray.release(ref) print("Test Write input from driver twice -> Read & Write from worker -> Read output from driver") # Test write twice. ref = ray.put(b"000000000000", max_readers=1) assert b"000000000000" == ray.get(ref) ray.release(ref) print(ref) expected_input = b"1" ray.worker.global_worker.put_object(b"1", object_ref=ref, max_readers=1) a = A.remote() time.sleep(1) expected_output = b"23" b = a.f.remote([ref], expected_input, expected_output) ray.get(b) val = ray.get(ref) assert expected_output == val ray.release(ref)

It does work, but the test is a good idea. Added.

src/ray/object_manager/plasma/client.cc

stephanie-wang · 2023-12-01T05:19:10Z

btw do we not use eventfd anymore? Cannot find code.

Yup I got rid of it!

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

python/ray/experimental/channel.py

src/ray/core_worker/core_worker.cc

ericl · 2023-12-01T19:14:54Z

python/ray/experimental/channel.py

+        )
+
+    def begin_read(self) -> Any:
+        """


Shall we add some sanity checks on the channel? For example, raising errors if trying to operate on the channel again before end_read is called after a begin_read.

Or other unsupported scenarios (such as deserializing the channel to a different node id). These might help development / documenting the current limitations.

python/ray/experimental/channel.py

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

rkooo567

One last question for different data size

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ericl · 2023-12-04T19:27:13Z

LGTM

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

This reverts commit cb5bb4e.

)" (ray-project#41784)" This reverts commit d7926fa.

See #41515. This updates to only compile new code on linux. OSX does not support shared memory semaphores, only named semaphores. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang added 3 commits November 28, 2023 22:09

initial commit

12b977d

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Add special calls for create and put mutable objects

1c935b9

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

feature flag for shared mem seal, only acquire once per ray.get

c2dbf1f

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang assigned ericl and rkooo567 Nov 30, 2023

stephanie-wang added 5 commits November 29, 2023 17:35

put-get

6d4aa94

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

rm shared mem seal

bc4f1e9

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix num_readers on first version, unit tests pass now

c4a2378

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

mutable object -> channel

e40d3c8

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

micro

b79b7d1

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

ericl reviewed Dec 1, 2023

View reviewed changes

python/ray/_private/worker.py Outdated Show resolved Hide resolved

ericl reviewed Dec 1, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2023

stephanie-wang added 2 commits November 30, 2023 17:07

support different metadata

5ea0fe3

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

better error message

cbe257f

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

rkooo567 reviewed Dec 1, 2023

View reviewed changes

cleanup

a68cefd

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2023

ericl reviewed Dec 1, 2023

View reviewed changes

python/ray/experimental/channel.py Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2023

Test for errors, better error handling when too many readers

ea57894

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 2, 2023

stephanie-wang added 3 commits December 1, 2023 16:11

remove unneeded

5bbf379

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

java build

1e16e09

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

rename

580b3ad

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

rkooo567 reviewed Dec 2, 2023

View reviewed changes

test metadata change in remote reader

fe11cc3

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

rkooo567 approved these changes Dec 2, 2023

View reviewed changes

ericl approved these changes Dec 4, 2023

View reviewed changes

stephanie-wang added 18 commits December 4, 2023 13:55

build

e11b614

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix

99a38c2

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix

204bb9b

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

compile?

4703f34

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

build

420bd1c

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

x

4cabbc5

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

fix

b44ef8a

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

copyright

881d5ff

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

test

ef2cfb7

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into mutable-objects-2

ca22a63

Only allocate PlasmaObjectHeader if is_mutable=true

dbbb3d6

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Only call Read/Write Acquire/Release if is_mutable=true

9078776

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

x

2e677c3

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

cpp test

f06b543

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

skip tests on windows

4dfa31e

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge remote-tracking branch 'upstream/master' into mutable-objects-2

126296f

larger CI machine

03f4fbd

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge branch 'master' into mutable-objects-2

3e7dfa2

stephanie-wang merged commit cb5bb4e into ray-project:master Dec 9, 2023
14 of 15 checks passed

stephanie-wang added a commit to stephanie-wang/ray that referenced this pull request Dec 10, 2023

Revert "[core] Support mutable plasma objects (ray-project#41515)"

13a8310

This reverts commit cb5bb4e.

jjyao added a commit to jjyao/ray that referenced this pull request Dec 10, 2023

Revert "[core] Support mutable plasma objects (ray-project#41515)"

79cc15b

This reverts commit cb5bb4e.

jjyao added a commit that referenced this pull request Dec 11, 2023

Revert "[core] Support mutable plasma objects (#41515)" (#41784)

d7926fa

This reverts commit cb5bb4e.

stephanie-wang added a commit to stephanie-wang/ray that referenced this pull request Dec 11, 2023

Revert "Revert "[core] Support mutable plasma objects (ray-project#41515

4413e33

)" (ray-project#41784)" This reverts commit d7926fa.

stephanie-wang mentioned this pull request Dec 11, 2023

Re-merge mutable objects (#41515) #41789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Support mutable plasma objects #41515

[core] Support mutable plasma objects #41515

stephanie-wang commented Nov 30, 2023 •

edited

Loading

ericl left a comment

ericl Dec 1, 2023

stephanie-wang Dec 1, 2023 •

edited

Loading

rkooo567 commented Dec 1, 2023

rkooo567 left a comment

rkooo567 Dec 1, 2023

stephanie-wang Dec 1, 2023

rkooo567 Dec 1, 2023

stephanie-wang Dec 1, 2023

stephanie-wang Dec 1, 2023

rkooo567 Dec 1, 2023

rkooo567 Dec 1, 2023

stephanie-wang Dec 1, 2023

rkooo567 Dec 1, 2023

stephanie-wang Dec 1, 2023

rkooo567 Dec 2, 2023

stephanie-wang Dec 2, 2023

stephanie-wang commented Dec 1, 2023

ericl Dec 1, 2023

rkooo567 left a comment

ericl commented Dec 4, 2023

[core] Support mutable plasma objects #41515

[core] Support mutable plasma objects #41515

Conversation

stephanie-wang commented Nov 30, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

rkooo567 commented Dec 1, 2023

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Dec 1, 2023

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

ericl commented Dec 4, 2023

stephanie-wang commented Nov 30, 2023 •

edited

Loading

stephanie-wang Dec 1, 2023 •

edited

Loading