-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Ray cluster nodes underutilization during autoscaling #47355
Comments
Thanks - this is a really good bug report @aljeshishe ... out of curiosity are you able to workaround by spinning up a fresh Ray Cluster (or, put in another way, any reason not to)? @jjyao on the attempt to repro... |
@aljeshishe it's probably best to set up a call to do a live debug when the cluster is in this bad state. |
I can help coordinate some time here > @aljeshishe can you Slack me and Jiajun on Ray Slack and we can go from there? |
Usually, I kill the docker container with head and then I do ray up(to avoid head instance recreation). |
This is super interesting. afaik, we should only download this once and reuse, but based on what you are saying, it seems like this cause issues I am curious if you use "pip" instead of pymodules (runtime_env={"pip": ["pandas"]}), you have the same issue ? Or as a workaround, you can probably just add pandas to your docker images. |
@rkooo567 @jjyao finally found out how to reproduce the issue To degrade cluster run:
run this script 200 times to create 200 envs:
Now lets check degraded state:
As you may see in script output, lots of environments degrade cluster, so scheduling almost stops for some time |
@jjyao have you managed to reproduce ? |
@kevin85421 did you get a chance to look into this deeper? |
Looks vaguely similar to #45373 |
Hello, guys, is it going to be fixed? |
I noticed in the repro script that the job is submitted from outside the cluster using Ray Client (the service running at port 10001 on the head node). I've heard that this submission method is not well-supported (though perhaps this has changed, I'm not sure) I'm curious -- Does the issue persist if you run the job directly on the head node, or use Ray Job Submission (the other way of submitting jobs, using the dashboard port 8265 by default.) |
What happened + What you expected to happen
We have ray cluster with ec2 cluster with autoscaling (min:2 max: 10 nodes).
Our load is ~3000 tasks.
After we start our load we expect cluster utilization to rise until all tasks are processed.
But that is not the case.
Running tasks:
Nodes:
It looks like there is a scheduling slowdown during autoscaling even on already running nodes.
Our observations:
Here is how utilization looks on fresh cluster:
Slowdown develops smoothly, we are trying to understand the exact reason, but all we can say it begins to manifest itself after ~1 day or ~500 jobs on cluster.
It is probably related to runtime environment, as if I don't pass
runtime_env = dict(py_modules=[pandas])
to ray.init, there is no issue.Nothing wrong in logs except lots of messages during autoscaling
We would really appreciate if anyone could provide any suggestions. I’m glad to provide any further information if needed.
Versions / Dependencies
I use ray==2.9.2. It's less pronounced on ray==2.34 but issue still exist.
Reproduction script
Issue Severity
High: It blocks me from completing my task, because after several days nodes become dramatically underutilized: only 10-50 CPUs from ~700 CPUs cluster are used.
The text was updated successfully, but these errors were encountered: