GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740

Feiyang472 · 2024-08-17T22:06:13Z

See
#43684

GitHub Issue: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43684

github-actions · 2024-08-17T22:06:40Z

⚠️ GitHub issue #43684 has been automatically assigned in GitHub to PR creator.

amoeba

I think this looks good. Returning a tuple for this makes sense to me. Can you look at my notes and also add tests?

python/pyarrow/_dataset.pyx

amoeba · 2024-08-20T05:19:03Z

python/pyarrow/_dataset.pyx

+        result_value = GetResultValue(result)
+
+        return frombytes(result_value.directory), frombytes(result_value.filename)


Could the remainder of this be simplified with CPartitionPathFormat.wrap? I'm not very familiar with Cython so let me know.

Hi @amoeba , I am fairly new to arrow cython api as well, so I might be wrong. CPartitionPathFormat is just cython's name (by arrow convention) for the PartitionPathFormat C struct. The wrap and unwrap methods are methods of the _WeakRefable class, which convert cpp shared pointers to cython wrappers or vice versa. Cpp shared pointers cannot be given to python, and cython classes cannot be given to C, but both can interact in cython code, hence the wrapping. C structs of builtin types, unlike shared pointers to objects, can be automatically converted by cython into python dicts. Here I have chosen to return a tuple instead, because it feels more pythonic.

If we aim for symmetry with the existing Parse method, we could join the strings, probably less ambiguous

cc @jorisvandenbossche

Feiyang472 · 2024-08-24T08:42:17Z

I will add more docs and test over the weekend.

Co-authored-by: Bryce Mecum <petridish@gmail.com>

python/pyarrow/_dataset.pyx

amoeba · 2024-09-05T02:20:32Z

Hi @Feiyang472, this is looking good. Thanks for continuing to work on it. I left two notes and can you fix the numpydoc issue in https://github.com/apache/arrow/actions/runs/10670333956/job/29702635856?pr=43740#step:6:9090?

python/pyarrow/_dataset.pyx

python/pyarrow/tests/parquet/test_dataset.py

Co-authored-by: Bryce Mecum <petridish@gmail.com>

amoeba · 2024-09-06T00:39:47Z

@jorisvandenbossche and/or @pitrou: The changes here look good to me, would either of you like to review?

Feiyang472 · 2024-09-28T07:56:54Z

@amoeba Hi, seems like we got no further comments, is this MR ready to be approved?

amoeba

Hey @Feiyang472, sorry for dropping this.

I re-reviewed this since a second reviewer hasn't taken a look and came up with two minor changes. Once you make those and CI passes, I'll merge.

amoeba · 2024-10-02T04:23:06Z

python/pyarrow/tests/parquet/test_dataset.py

@@ -1239,6 +1241,71 @@ def test_dataset_partitioning(tempdir):
    assert result.column_names == ["a", "year", "month", "day"]


+@pytest.mark.parametrize(


I think we should put this in the parent test_dataset.py (pyarrow/tests/test_dataset.py) since this isn't specific to ParquetDataset. Sorry I missed this earlier.

python/pyarrow/tests/parquet/test_dataset.py

Co-authored-by: Bryce Mecum <petridish@gmail.com>

amoeba · 2024-10-08T04:10:03Z

Thanks so much @Feiyang472. I've merged this and slotted it in for the v18 release.

conbench-apache-arrow · 2024-10-08T18:22:55Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44fb439.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

raulcd · 2024-10-10T10:41:53Z

@github-actions crossbow submit example-python-minimal-*

raulcd · 2024-10-10T10:43:44Z

This PR seems to have broken tests if we run with minimal pyarrow (i.e no dataset) as pytest @pytest.mark.parametrize fails if dataset is not enabled:

 _ ERROR collecting miniconda-for-arrow/envs/pyarrow-3.10/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py _
miniconda-for-arrow/envs/pyarrow-3.10/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py:740: in <module>
    (ds.HivePartitioning, (r"foo=A/bar=ant%20bee", ""), ("", "")),
E   AttributeError: 'NoneType' object has no attribute 'HivePartitioning'

I'll try to fix for 18.0.0

github-actions · 2024-10-10T10:44:09Z

Revision: 1c8364b

Submitted crossbow builds: ursacomputing/crossbow @ actions-aee53b4c69

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv

amoeba · 2024-10-10T21:31:48Z

Thanks @raulcd.

github-actions bot added Component: Python awaiting review Awaiting review labels Aug 17, 2024

Feiyang472 mentioned this pull request Aug 17, 2024

[Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43684

Closed

amoeba requested changes Aug 20, 2024

View reviewed changes

amoeba changed the title ~~GH-43684: [Integration, Parquet, Python] Python / Cython interface to C++ arrow::dataset::Partitioning::Format~~ GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format Aug 20, 2024

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 20, 2024

Feiyang472 marked this pull request as ready for review August 24, 2024 16:46

Feiyang472 and others added 7 commits August 24, 2024 17:58

extern declare Format method and CPartitionPathFormat struct

2b88049

wrap c results in python

aaf171e

Take suggested changes to docs

58cad90

Co-authored-by: Bryce Mecum <petridish@gmail.com>

add some unittests

1291e22

add comments to test

3c14282

add more thorough unittests

a407b38

add an example

79bc16c

Feiyang472 force-pushed the user/feiyang/expose_cython_api branch from dd18fde to 79bc16c Compare August 24, 2024 17:00

buaazhwb reviewed Aug 25, 2024

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

Feiyang472 requested a review from amoeba August 28, 2024 06:11

Feiyang472 force-pushed the user/feiyang/expose_cython_api branch 3 times, most recently from a712916 to 79bc16c Compare September 2, 2024 16:09

Merge branch 'main' into user/feiyang/expose_cython_api

2d810e6

amoeba requested changes Sep 5, 2024

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

python/pyarrow/tests/parquet/test_dataset.py Outdated Show resolved Hide resolved

Feiyang472 and others added 3 commits September 5, 2024 20:08

Update python/pyarrow/_dataset.pyx

4b4f3cf

Co-authored-by: Bryce Mecum <petridish@gmail.com>

revert formatting changes

dd0a624

doc test

6f424d4

amoeba approved these changes Sep 6, 2024

View reviewed changes

amoeba requested review from jorisvandenbossche and pitrou September 6, 2024 00:38

amoeba requested changes Oct 2, 2024

View reviewed changes

Feiyang472 and others added 4 commits October 4, 2024 15:53

fix test failure due to recent development

e2bc6f6

Co-authored-by: Bryce Mecum <petridish@gmail.com>

move test away

dbcd53e

test partitioning format moved to test dataset

17317ff

Fix f8

1c8364b

Feiyang472 requested a review from amoeba October 7, 2024 18:02

amoeba approved these changes Oct 8, 2024

View reviewed changes

amoeba merged commit 44fb439 into apache:main Oct 8, 2024
14 checks passed

amoeba removed the awaiting committer review Awaiting committer review label Oct 8, 2024

github-actions bot added the awaiting committer review Awaiting committer review label Oct 8, 2024

raulcd mentioned this pull request Oct 10, 2024

[Python][CI] Python minimal examples fail due to dataset not being present #44369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740

GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740

Feiyang472 commented Aug 17, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Aug 17, 2024

amoeba left a comment

amoeba Aug 20, 2024

Feiyang472 Aug 24, 2024

Feiyang472 Aug 24, 2024

pitrou Aug 26, 2024

Feiyang472 commented Aug 24, 2024

amoeba commented Sep 5, 2024 •

edited

Loading

amoeba commented Sep 6, 2024

Feiyang472 commented Sep 28, 2024

amoeba left a comment

amoeba Oct 2, 2024

amoeba commented Oct 8, 2024

conbench-apache-arrow bot commented Oct 8, 2024

raulcd commented Oct 10, 2024

raulcd commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

amoeba commented Oct 10, 2024

		result_value = GetResultValue(result)

		return frombytes(result_value.directory), frombytes(result_value.filename)

		@@ -1239,6 +1241,71 @@ def test_dataset_partitioning(tempdir):
		assert result.column_names == ["a", "year", "month", "day"]


		@pytest.mark.parametrize(

GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740

GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format #43740

Conversation

Feiyang472 commented Aug 17, 2024 • edited by github-actions bot Loading

github-actions bot commented Aug 17, 2024

amoeba left a comment

Choose a reason for hiding this comment

amoeba Aug 20, 2024

Choose a reason for hiding this comment

Feiyang472 Aug 24, 2024

Choose a reason for hiding this comment

Feiyang472 Aug 24, 2024

Choose a reason for hiding this comment

pitrou Aug 26, 2024

Choose a reason for hiding this comment

Feiyang472 commented Aug 24, 2024

amoeba commented Sep 5, 2024 • edited Loading

amoeba commented Sep 6, 2024

Feiyang472 commented Sep 28, 2024

amoeba left a comment

Choose a reason for hiding this comment

amoeba Oct 2, 2024

Choose a reason for hiding this comment

amoeba commented Oct 8, 2024

conbench-apache-arrow bot commented Oct 8, 2024

raulcd commented Oct 10, 2024

raulcd commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

amoeba commented Oct 10, 2024

Feiyang472 commented Aug 17, 2024 •

edited by github-actions bot

Loading

amoeba commented Sep 5, 2024 •

edited

Loading