Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Repeated calls to create an Actor with get_or_create=True and max_restarts != 0 can fail #48856

Closed
jfaust-fy opened this issue Nov 22, 2024 · 4 comments · Fixed by #49480
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@jfaust-fy
Copy link

jfaust-fy commented Nov 22, 2024

What happened + What you expected to happen

In upgrading from Ray 2.24 to 2.39, I am running into:

ValueError: Failed to look up actor with name 'actor'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

occasionally when calling Actor.options(get_or_create=True, max_restarts=-1, ...).remote(). This did not happen in 2.24. This only happens when max_restarts is non-zero.

I've managed to reproduce it, but the repro script does not exactly match our code, so I'm not 100% sure that fixing the repro script will also fix our code. I have not yet figured out a workaround for our actual code.

Versions / Dependencies

Python 3.11.9
Ray 2.37, 2.38 and 2.39 exhibit this behavior. 2.24 and 2.36 do not. This seems to have been introduced in Ray 2.37.

Reproduction script

import ray


@ray.remote
class Actor:
    pass


ray.init()

for _ in range(0, 1000):
    Actor.options(
        name="actor",
        get_if_exists=True,
        max_restarts=-1,
    ).remote()

Issue Severity

High: It blocks me from completing my task.

@jfaust-fy jfaust-fy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 22, 2024
@jfaust-fy
Copy link
Author

If I change the repro script to:

import ray


@ray.remote
class Actor:
    pass


ray.init()

actors = []
for _ in range(0, 1000):
    actors.append(Actor.options(
        name="actor",
        get_if_exists=True,
        max_restarts=1,
    ).remote())

It succeeds.

@kevin85421 kevin85421 self-assigned this Nov 22, 2024
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Nov 22, 2024
@jfaust-fy jfaust-fy changed the title [Core] Repeated calls to create an Actor with get_or_create=True and max_restarts != 1 can fail [Core] Repeated calls to create an Actor with get_or_create=True and max_restarts != 0 can fail Nov 22, 2024
@jfaust-fy
Copy link
Author

I mis-titled this initially, it's when max_restarts != 0

@kevin85421
Copy link
Member

What is your expected behavior for the script below? Should only the first Actor.options call create the actor, while all the following 999 calls to Actor.options retrieve the existing actor?

The expected behavior is somewhat difficult to define because there is no reference to the actor. As a result, the reference count will be 0, leading to the actor being destroyed.

import ray

@ray.remote
class Actor:
    pass

ray.init()

for _ in range(0, 1000):
    Actor.options(
        name="actor",
        get_if_exists=True,
        max_restarts=-1,
    ).remote()

@jfaust-fy
Copy link
Author

jfaust-fy commented Nov 22, 2024

@kevin85421 like I said, this is not exactly our use case - in our case we do have a reference to the Actor, but the reference goes away and is then re-created (we're seeing this in tests, where a test is fine when run on its own, but fails when run right after a previous test. The previous test will have created the Actor, used it, then all references disappear, and then we get this exception when the subsequent test tries to create the Actor).

So I don't really have a good answer about what this test case should do - I guess I'd expect it to either create or retrieve the Actor in each iteration, depending on how async actor destruction is, and not to throw an exception.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants