Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740

Merged
merged 15 commits into from
Oct 8, 2024

Conversation

Feiyang472
Copy link
Contributor

@Feiyang472 Feiyang472 commented Aug 17, 2024

Copy link

⚠️ GitHub issue #43684 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Returning a tuple for this makes sense to me. Can you look at my notes and also add tests?

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
python/pyarrow/_dataset.pyx Show resolved Hide resolved
Comment on lines 2526 to 2543
result_value = GetResultValue(result)

return frombytes(result_value.directory), frombytes(result_value.filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the remainder of this be simplified with CPartitionPathFormat.wrap? I'm not very familiar with Cython so let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amoeba , I am fairly new to arrow cython api as well, so I might be wrong. CPartitionPathFormat is just cython's name (by arrow convention) for the PartitionPathFormat C struct. The wrap and unwrap methods are methods of the _WeakRefable class, which convert cpp shared pointers to cython wrappers or vice versa. Cpp shared pointers cannot be given to python, and cython classes cannot be given to C, but both can interact in cython code, hence the wrapping. C structs of builtin types, unlike shared pointers to objects, can be automatically converted by cython into python dicts. Here I have chosen to return a tuple instead, because it feels more pythonic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aim for symmetry with the existing Parse method, we could join the strings, probably less ambiguous

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amoeba amoeba changed the title GH-43684: [Integration, Parquet, Python] Python / Cython interface to C++ arrow::dataset::Partitioning::Format GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format Aug 20, 2024
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 20, 2024
@Feiyang472
Copy link
Contributor Author

I will add more docs and test over the weekend.

@Feiyang472 Feiyang472 marked this pull request as ready for review August 24, 2024 16:46
@Feiyang472 Feiyang472 force-pushed the user/feiyang/expose_cython_api branch from dd18fde to 79bc16c Compare August 24, 2024 17:00
@Feiyang472 Feiyang472 requested a review from amoeba August 28, 2024 06:11
@Feiyang472 Feiyang472 force-pushed the user/feiyang/expose_cython_api branch 3 times, most recently from a712916 to 79bc16c Compare September 2, 2024 16:09
@amoeba
Copy link
Member

amoeba commented Sep 5, 2024

Hi @Feiyang472, this is looking good. Thanks for continuing to work on it. I left two notes and can you fix the numpydoc issue in https://github.com/apache/arrow/actions/runs/10670333956/job/29702635856?pr=43740#step:6:9090?

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
python/pyarrow/tests/parquet/test_dataset.py Outdated Show resolved Hide resolved
Feiyang472 and others added 3 commits September 5, 2024 20:08
@amoeba
Copy link
Member

amoeba commented Sep 6, 2024

@jorisvandenbossche and/or @pitrou: The changes here look good to me, would either of you like to review?

@Feiyang472
Copy link
Contributor Author

@amoeba Hi, seems like we got no further comments, is this MR ready to be approved?

Copy link
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Feiyang472, sorry for dropping this.

I re-reviewed this since a second reviewer hasn't taken a look and came up with two minor changes. Once you make those and CI passes, I'll merge.

@@ -1239,6 +1241,71 @@ def test_dataset_partitioning(tempdir):
assert result.column_names == ["a", "year", "month", "day"]


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should put this in the parent test_dataset.py (pyarrow/tests/test_dataset.py) since this isn't specific to ParquetDataset. Sorry I missed this earlier.

python/pyarrow/tests/parquet/test_dataset.py Outdated Show resolved Hide resolved
@Feiyang472 Feiyang472 requested a review from amoeba October 7, 2024 18:02
@amoeba amoeba merged commit 44fb439 into apache:main Oct 8, 2024
14 checks passed
@amoeba amoeba removed the awaiting committer review Awaiting committer review label Oct 8, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Oct 8, 2024
@amoeba
Copy link
Member

amoeba commented Oct 8, 2024

Thanks so much @Feiyang472. I've merged this and slotted it in for the v18 release.

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44fb439.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

@raulcd
Copy link
Member

raulcd commented Oct 10, 2024

@github-actions crossbow submit example-python-minimal-*

@raulcd
Copy link
Member

raulcd commented Oct 10, 2024

This PR seems to have broken tests if we run with minimal pyarrow (i.e no dataset) as pytest @pytest.mark.parametrize fails if dataset is not enabled:

 _ ERROR collecting miniconda-for-arrow/envs/pyarrow-3.10/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py _
miniconda-for-arrow/envs/pyarrow-3.10/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py:740: in <module>
    (ds.HivePartitioning, (r"foo=A/bar=ant%20bee", ""), ("", "")),
E   AttributeError: 'NoneType' object has no attribute 'HivePartitioning'

I'll try to fix for 18.0.0

Copy link

Revision: 1c8364b

Submitted crossbow builds: ursacomputing/crossbow @ actions-aee53b4c69

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions

@amoeba
Copy link
Member

amoeba commented Oct 10, 2024

Thanks @raulcd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants