-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740
GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. Returning a tuple for this makes sense to me. Can you look at my notes and also add tests?
python/pyarrow/_dataset.pyx
Outdated
result_value = GetResultValue(result) | ||
|
||
return frombytes(result_value.directory), frombytes(result_value.filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the remainder of this be simplified with CPartitionPathFormat.wrap
? I'm not very familiar with Cython so let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amoeba , I am fairly new to arrow cython api as well, so I might be wrong. CPartitionPathFormat
is just cython's name (by arrow convention) for the PartitionPathFormat
C struct. The wrap
and unwrap
methods are methods of the _WeakRefable
class, which convert cpp shared pointers to cython wrappers or vice versa. Cpp shared pointers cannot be given to python, and cython classes cannot be given to C, but both can interact in cython code, hence the wrapping. C structs of builtin types, unlike shared pointers to objects, can be automatically converted by cython into python dicts. Here I have chosen to return a tuple instead, because it feels more pythonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we aim for symmetry with the existing Parse
method, we could join the strings, probably less ambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add more docs and test over the weekend. |
Co-authored-by: Bryce Mecum <petridish@gmail.com>
dd18fde
to
79bc16c
Compare
a712916
to
79bc16c
Compare
Hi @Feiyang472, this is looking good. Thanks for continuing to work on it. I left two notes and can you fix the numpydoc issue in https://github.com/apache/arrow/actions/runs/10670333956/job/29702635856?pr=43740#step:6:9090? |
Co-authored-by: Bryce Mecum <petridish@gmail.com>
@jorisvandenbossche and/or @pitrou: The changes here look good to me, would either of you like to review? |
@amoeba Hi, seems like we got no further comments, is this MR ready to be approved? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Feiyang472, sorry for dropping this.
I re-reviewed this since a second reviewer hasn't taken a look and came up with two minor changes. Once you make those and CI passes, I'll merge.
@@ -1239,6 +1241,71 @@ def test_dataset_partitioning(tempdir): | |||
assert result.column_names == ["a", "year", "month", "day"] | |||
|
|||
|
|||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should put this in the parent test_dataset.py
(pyarrow/tests/test_dataset.py) since this isn't specific to ParquetDataset. Sorry I missed this earlier.
Co-authored-by: Bryce Mecum <petridish@gmail.com>
Thanks so much @Feiyang472. I've merged this and slotted it in for the v18 release. |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44fb439. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
@github-actions crossbow submit example-python-minimal-* |
This PR seems to have broken tests if we run with minimal pyarrow (i.e no dataset) as pytest
I'll try to fix for 18.0.0 |
Revision: 1c8364b Submitted crossbow builds: ursacomputing/crossbow @ actions-aee53b4c69
|
Thanks @raulcd. |
See
#43684