[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373
Open
Description
What happened + What you expected to happen
We run Ray jobs in production. Right after upgrading Ray version from 2.3.0 to 2.20.0, we saw a significant increase in job latency. Upon investigation, we found that autoscaler wasn't spinning up new nodes even when majority of the tasks were in queue waiting to be scheduled which resulted in latency increase overall. We only schedule by memory and these jobs weren't using full memory they requested for. However, we expect autoscaler to still spin up new nodes to serve the demand. This issues does not occur with SPREAD scheduling strategy however (not sure why!).
Versions / Dependencies
Ray: 2.20.0
OS: Ubuntu 20.04
Python: 3.10
Reproduction script
You might need to install deltacat by pip3 install deltacat
. Each worker has 31 CPU, 220GB memory and 10000 max_tasks.
from deltacat.utils.ray_utils.concurrency import (
invoke_parallel,
)
import ray
import time
ray.init(address='auto')
@ray.remote
def i_will_return_after_t_secs(t):
import numpy as np
all_arrs = []
for _ in range(10):
all_arrs.append(np.random.rand(100000000)) # 0.8GB
# 8 GB overall
import time
time.sleep(t)
def options_provider(*args, **kwargs):
return {"num_cpus": 0.01, "memory": 23387571131, "resources": {'max_tasks': 0.001} }
items_to_run = [120 for _ in range(14000)]
start = time.monotonic()
print(f"Now starting: {start}")
pending = invoke_parallel(items_to_run, i_will_return_after_t_secs, max_parallelism=4096, options_provider=options_provider)
end_invoke = time.monotonic()
print(f"Now ended invoke: {end_invoke - start}")
ray.get(pending)
end = time.monotonic()
print(f"Complete: {end - start}")
# 2.3 used 460 nodes while 2.20 used 364 nodes at max.
Issue Severity
None