Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Ray runtime_env works with ray core but doesn't work with ray.data when using working_dir #49356

Open
pradipneupane opened this issue Dec 19, 2024 · 4 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@pradipneupane
Copy link

pradipneupane commented Dec 19, 2024

What happened + What you expected to happen

The same code works for ray.core with setting runtime_env to my working dir but it doesn't work for ray.data during ray.init()

Versions / Dependencies

2.36.1

Reproduction script

I am using ray.data modules and I have many ray objects and Im creating ray dataset
dataset = ray.data.from_pandas_refs(ray_objs)

After that; Im using that dataset as:

for bs in datset_ref.iter_batches(prefetch_batches=0, batch_size=1000, batch_format="pandas", drop_last=False):
    pass

When doing that it tries to calculate metrics of those dataset and I keep getting these error:

ModuleNotFoundError: No module named 'MY_WORKING_DIR'
(Actor pid=xxx) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::_StatsActor.update_metrics()

At least one of the input arguments for this task could not be computed:
(Actor pid=xxx) ray.exceptions.RaySystemError: System error: No module named 'MY_WORKING_DIR'

I have set my working_dir as ray-runtime env during ray.init() by using this:

ray.init(address=MY_CLUSTER_ADDRESS, ignore_reinit_error=True,
                 runtime_env={working_dir:'.'})

I don't get any error regarding No module named to my Project folder when using ray.core library but when I use ray.data library, it doesn't work and I keep getting that exception:

Is there a workaround to disable the metrics only from ray.data

Issue Severity

High: It blocks me from completing my task.

@pradipneupane pradipneupane added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 19, 2024
@pradipneupane pradipneupane changed the title [<Ray component: Data>] [Data] Dec 19, 2024
@pradipneupane pradipneupane changed the title [Data] [Data] Ray runtime_env works with ray core but doesn't work with ray.data when using working_dir Dec 19, 2024
@jcotant1 jcotant1 added the data Ray Data-related issues label Dec 19, 2024
@Jay-ju
Copy link
Contributor

Jay-ju commented Dec 23, 2024

I ran a sample and there were no problems occurred.

import ray
from typing import Callable


class Op12(Callable):
    def __call__(self, record) -> str:
        return {"value": "op1"*1000000}


ray.init(ignore_reinit_error=True, runtime_env={
         "working_dir": './', "excludes": ["CC*", "part*", "PMC*"]})
ray.data.from_items(range(1, 10000)).map(Op12,  concurrency=10).take_all()

@pradipneupane
Copy link
Author

@Jay-ju you are using local ray address; you need to test this with ray cluster address

@Jay-ju
Copy link
Contributor

Jay-ju commented Jan 2, 2025

@Jay-ju you are using local ray address; you need to test this with ray cluster address

i am using ray cluster mode, I actually don't quite get what you mean.

@pradipneupane
Copy link
Author

pradipneupane commented Jan 2, 2025

@Jay-ju Sorry If I was unclear; in your code, address = None for the ray.init() so; I thought it was running as a local cluster in same machine where the code is; in that case that runtime env for the working dir is not needed as local cluster has access to those folder; Issues was occurred when the ray cluster was setup in different machine and then actual code is another machine. I have a distributed cluster setup in azure and the same working dir passing as runtime env works for ray core; but it wasn't working for those ray.data; if it is not too much trouble; could you please try with this part of code:

It is possible that it is trying to calculate metrics for this pandas_refs from ray.data but may not be calculating for that above part of your code.

import pandas as pd 
data= { 'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'c', 'd', 'e'] }
df = pd.DataFrame(data)

obj1 = ray.put(df )
obj2 = ray.put(df)
ray_objs = [obj1, obj2]

dataset = ray.data.from_pandas_refs(ray_objs)

for bs in datset.iter_batches(prefetch_batches=0, batch_size=1000, batch_format="pandas", drop_last=False):
    pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants