Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs/scaling - Bring Pandera to Spark and Dask #588

Merged
merged 6 commits into from
Sep 1, 2021

Conversation

kvnkho
Copy link
Contributor

@kvnkho kvnkho commented Aug 18, 2021

Hi @cosmicBboy,

Here is a first pass of the scaling.rst we talked about. Here, we show how to scale pandera code to Spark and Dask using Fugue. Thanks for agreeing to collaborate. Feedback is appreciated.

I don't know if you want to take it from here and find a place for this but @goodwanghan and I can also take care of making changes from your feedback. Just let us know.

@kvnkho kvnkho changed the title Docs/scaling Docs/scaling - Bring Pandera to Spark and Dask Aug 18, 2021
@kvnkho kvnkho changed the base branch from master to dev August 18, 2021 05:16
@kvnkho
Copy link
Contributor Author

kvnkho commented Aug 18, 2021

Sorry, this PR became a mess because I changed the branch to dev. Will re-open.

@kvnkho kvnkho closed this Aug 18, 2021
@kvnkho kvnkho reopened this Aug 21, 2021
@kvnkho kvnkho changed the base branch from dev to master August 21, 2021 19:23
@cosmicBboy cosmicBboy changed the base branch from master to dev August 26, 2021 13:06
@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

Merging #588 (5d33ef5) into dev (2779015) will increase coverage by 0.18%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #588      +/-   ##
==========================================
+ Coverage   98.55%   98.73%   +0.18%     
==========================================
  Files          26       29       +3     
  Lines        3257     3327      +70     
==========================================
+ Hits         3210     3285      +75     
+ Misses         47       42       -5     
Impacted Files Coverage Δ
pandera/checks.py 98.50% <0.00%> (-0.06%) ⬇️
pandera/io.py 100.00% <0.00%> (ø)
pandera/errors.py 100.00% <0.00%> (ø)
pandera/error_formatters.py 95.45% <0.00%> (ø)
pandera/engines/numpy_engine.py 100.00% <0.00%> (ø)
pandera/engines/type_aliases.py 100.00% <0.00%> (ø)
pandera/check_utils.py 100.00% <0.00%> (ø)
pandera/engines/utils.py 100.00% <0.00%> (ø)
pandera/engines/pandas_engine.py 99.29% <0.00%> (+<0.01%) ⬆️
pandera/engines/engine.py 98.82% <0.00%> (+5.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2779015...5d33ef5. Read the comment docs.

docs/source/scaling.rst Outdated Show resolved Hide resolved
@cosmicBboy cosmicBboy changed the base branch from dev to master September 1, 2021 18:05
@cosmicBboy cosmicBboy merged commit 84ea3c2 into unionai-oss:master Sep 1, 2021
cosmicBboy added a commit that referenced this pull request Sep 5, 2021
* add support for Any annotation in schema model (#594)

* add support for Any annotation in schema model

the motivation behind this feature is to support column annotations
that can have any type, to support use cases like the one described
in #592, where
custom checks can be applied to any column except for ones that
are explicitly defined in the schema model class attributes

* update pylint, fix lint

* Docs/scaling - Bring Pandera to Spark and Dask (#588)

* scaling.rst

* edited conf

* finished first pass

* removing FugueWorkflow

* Update index.rst

* Update docs/source/scaling.rst

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add support for timezone-aware datetime strategies

* fix le/ge strategies with datetime

* fix mypy errors

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
Co-authored-by: Kevin Kho <kdykho@gmail.com>
cosmicBboy added a commit that referenced this pull request Sep 9, 2021
* Unique keyword arg (#580)

* add copy button to docs (#448)

* Add missing inplace arg to SchemaModel's validate (#450)

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* intermediate commit for review by @cosmicBboy

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* intermediate commit for review by @cosmicBboy

* WIP

* fix test errors, re-factor allow_duplicates handling

* fix io tests

* fix docs, remove _allow_duplicates private var

* update unique type signature in strategies

* completing tests for setters and lazy evaluation of unique kw

* small fix for the linting errors

* support dataframe-level uniqueness in strategies

* add docs, fix error formatting, add multiindex support

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Co-authored-by: tfwillems <tfwillems@users.noreply.github.com>
Co-authored-by: fkroll8 <13244820+fkrull8@users.noreply.github.com>
Co-authored-by: fkroll8 <kent.troutman@tuta.io>

* Add support for timezone-aware datetime strategies (#595)

* add support for Any annotation in schema model (#594)

* add support for Any annotation in schema model

the motivation behind this feature is to support column annotations
that can have any type, to support use cases like the one described
in #592, where
custom checks can be applied to any column except for ones that
are explicitly defined in the schema model class attributes

* update pylint, fix lint

* Docs/scaling - Bring Pandera to Spark and Dask (#588)

* scaling.rst

* edited conf

* finished first pass

* removing FugueWorkflow

* Update index.rst

* Update docs/source/scaling.rst

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add support for timezone-aware datetime strategies

* fix le/ge strategies with datetime

* fix mypy errors

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
Co-authored-by: Kevin Kho <kdykho@gmail.com>

* support frictionless primary keys with multiple fields

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Co-authored-by: tfwillems <tfwillems@users.noreply.github.com>
Co-authored-by: fkroll8 <13244820+fkrull8@users.noreply.github.com>
Co-authored-by: fkroll8 <kent.troutman@tuta.io>
Co-authored-by: Kevin Kho <kdykho@gmail.com>
cosmicBboy added a commit that referenced this pull request Sep 10, 2021
* Unique keyword arg (#580)

* add copy button to docs (#448)

* Add missing inplace arg to SchemaModel's validate (#450)

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* intermediate commit for review by @cosmicBboy

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* intermediate commit for review by @cosmicBboy

* WIP

* fix test errors, re-factor allow_duplicates handling

* fix io tests

* fix docs, remove _allow_duplicates private var

* update unique type signature in strategies

* completing tests for setters and lazy evaluation of unique kw

* small fix for the linting errors

* support dataframe-level uniqueness in strategies

* add docs, fix error formatting, add multiindex support

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Co-authored-by: tfwillems <tfwillems@users.noreply.github.com>
Co-authored-by: fkroll8 <13244820+fkrull8@users.noreply.github.com>
Co-authored-by: fkroll8 <kent.troutman@tuta.io>

* Add support for timezone-aware datetime strategies (#595)

* add support for Any annotation in schema model (#594)

* add support for Any annotation in schema model

the motivation behind this feature is to support column annotations
that can have any type, to support use cases like the one described
in #592, where
custom checks can be applied to any column except for ones that
are explicitly defined in the schema model class attributes

* update pylint, fix lint

* Docs/scaling - Bring Pandera to Spark and Dask (#588)

* scaling.rst

* edited conf

* finished first pass

* removing FugueWorkflow

* Update index.rst

* Update docs/source/scaling.rst

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add support for timezone-aware datetime strategies

* fix le/ge strategies with datetime

* fix mypy errors

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
Co-authored-by: Kevin Kho <kdykho@gmail.com>

* schemas with multi-index columns correctly report errors (#600)

fixes #589

* strategies module supports undefined checks in regex columns (#599)

* Add support for empty data type annotation in SchemaModel (#602)

* remove artifacts of py3.6 support

* add support for empty data type annotation in SchemaModel

* fix frictionless version in dev dependencies

* fix setuptools version instead of frictionless

* fix setuptools pinning

* remove frictionless from core pandera deps (#609)

* support frictionless primary keys with multiple fields (#608)

* fix validation of check raising error without message (#613)

* docs/requirements.txt pin setuptools (#611)

* bump version 0.7.1

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Co-authored-by: tfwillems <tfwillems@users.noreply.github.com>
Co-authored-by: fkroll8 <13244820+fkrull8@users.noreply.github.com>
Co-authored-by: fkroll8 <kent.troutman@tuta.io>
Co-authored-by: Kevin Kho <kdykho@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants