Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dask Integration: vineyard as the data source for dask #409

Merged
merged 1 commit into from
Aug 5, 2021

Conversation

andydiwenzhu
Copy link
Collaborator

@andydiwenzhu andydiwenzhu commented Aug 4, 2021

Signed-off-by: Diwen Zhu diwen.zdw@alibaba-inc.com

What do these changes do?

Add two resolvers for dask: GlobalTensor --> dask.Array, GlobalDataFrame --> dask.dataframe

Related issue number

Part of #412.

@codecov
Copy link

codecov bot commented Aug 4, 2021

Codecov Report

Merging #409 (ad4859e) into main (22055c2) will increase coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #409      +/-   ##
==========================================
+ Coverage   69.64%   69.68%   +0.04%     
==========================================
  Files          63       63              
  Lines        5452     5453       +1     
==========================================
+ Hits         3797     3800       +3     
+ Misses       1655     1653       -2     
Impacted Files Coverage Δ
src/server/async/socket_server.cc 67.81% <0.00%> (+0.50%) ⬆️

Copy link
Member

@sighingnow sighingnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a package called vineyard-dask? under vineyard/contrib/dask.

It is quite strange to put it under ml/distributed....

def get_partition(socket, obj_id):
client = vineyard.connect(socket)
np_value = client.get(obj_id)
return da.from_array(np_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a copy? (I'm not sure.

@sighingnow
Copy link
Member

Seems the resolver assumes both the tensor and dataframe are chunked along the x ax, it not alway be true, as the tensor/dataframe may come from mars.

We haven't put enough efforts in other ML integrations, I know it is not trivial, but it does need to be fixed. As the dask already natively supports split chunks along both the x and y axis, and we have already record enough information in the metadata, could we do right thing here for dask? Otherwise it would be depreacated and replaced by a new implementation someday.

@andydiwenzhu andydiwenzhu reopened this Aug 5, 2021
@sighingnow
Copy link
Member

Andy @andydiwenzhu I didn't mean closing the issue. We could

  1. put those things into a vineyard-dask package
  2. leave a check to raise an NotImplementedError exception when the source is not chunked as what we want

Then we could get this pull request landed.

@andydiwenzhu
Copy link
Collaborator Author

Andy @andydiwenzhu I didn't mean closing the issue. We could

  1. put those things into a vineyard-dask package
  2. leave a check to raise an NotImplementedError exception when the source is not chunked as what we want

Then we could get this pull request landed.

OK. That was a bad click, don't worry :)

@sighingnow sighingnow mentioned this pull request Aug 5, 2021
6 tasks
@sighingnow sighingnow changed the title Add Dask Integration Add Dask Integration: vineyard as the data source for dask Aug 5, 2021
Signed-off-by: Diwen Zhu <diwen.zdw@alibaba-inc.com>
@sighingnow sighingnow merged commit 5b16173 into v6d-io:main Aug 5, 2021
sighingnow added a commit to sighingnow/v6d that referenced this pull request Aug 6, 2021
Fixes for v6d-io#409.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
sighingnow added a commit that referenced this pull request Aug 8, 2021
…ge. (#417)

* Add missing __init__.py to dask module otherwise it is an empty package, fixes for #409.
* Revisit the CI script.
* Set LD_LIBRARY_PATH for running tests.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Sijie-L pushed a commit to Sijie-L/v6d that referenced this pull request Aug 14, 2021
Add two resolvers for dask:

- GlobalTensor --> dask.Array
- GlobalDataFrame --> dask.dataframe

Signed-off-by: Diwen Zhu <diwen.zdw@alibaba-inc.com>
Signed-off-by: Sijie <lsjrosej@gmail.com>
Sijie-L pushed a commit to Sijie-L/v6d that referenced this pull request Aug 14, 2021
…ge. (v6d-io#417)

* Add missing __init__.py to dask module otherwise it is an empty package, fixes for v6d-io#409.
* Revisit the CI script.
* Set LD_LIBRARY_PATH for running tests.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Sijie <lsjrosej@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants