Skip to content

[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast  #45373

Open
@raghumdani

Description

What happened + What you expected to happen

We run Ray jobs in production. Right after upgrading Ray version from 2.3.0 to 2.20.0, we saw a significant increase in job latency. Upon investigation, we found that autoscaler wasn't spinning up new nodes even when majority of the tasks were in queue waiting to be scheduled which resulted in latency increase overall. We only schedule by memory and these jobs weren't using full memory they requested for. However, we expect autoscaler to still spin up new nodes to serve the demand. This issues does not occur with SPREAD scheduling strategy however (not sure why!).

Versions / Dependencies

Ray: 2.20.0
OS: Ubuntu 20.04
Python: 3.10

Reproduction script

You might need to install deltacat by pip3 install deltacat. Each worker has 31 CPU, 220GB memory and 10000 max_tasks.

from deltacat.utils.ray_utils.concurrency import (
    invoke_parallel,
)
import ray
import time

ray.init(address='auto')

@ray.remote
def i_will_return_after_t_secs(t):
    import numpy as np

    all_arrs = []

    for _ in range(10):
        all_arrs.append(np.random.rand(100000000)) # 0.8GB

    # 8 GB overall

    import time
    time.sleep(t)

def options_provider(*args, **kwargs):
    return {"num_cpus": 0.01, "memory": 23387571131, "resources": {'max_tasks': 0.001} }

items_to_run = [120 for _ in range(14000)]

start = time.monotonic()
print(f"Now starting: {start}")

pending = invoke_parallel(items_to_run, i_will_return_after_t_secs, max_parallelism=4096, options_provider=options_provider)

end_invoke = time.monotonic()

print(f"Now ended invoke: {end_invoke - start}")

ray.get(pending)

end = time.monotonic()
print(f"Complete: {end - start}")

# 2.3 used 460 nodes while 2.20 used 364 nodes at max.

Issue Severity

None

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesperformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions