Skip to content

Commit

Permalink
Frictionless primary key (#597)
Browse files Browse the repository at this point in the history
* Unique keyword arg (#580)

* add copy button to docs (#448)

* Add missing inplace arg to SchemaModel's validate (#450)

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* intermediate commit for review by @cosmicBboy

* link documentation to github (#449)

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* intermediate commit for review by @cosmicBboy

* WIP

* fix test errors, re-factor allow_duplicates handling

* fix io tests

* fix docs, remove _allow_duplicates private var

* update unique type signature in strategies

* completing tests for setters and lazy evaluation of unique kw

* small fix for the linting errors

* support dataframe-level uniqueness in strategies

* add docs, fix error formatting, add multiindex support

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Co-authored-by: tfwillems <tfwillems@users.noreply.github.com>
Co-authored-by: fkroll8 <13244820+fkrull8@users.noreply.github.com>
Co-authored-by: fkroll8 <kent.troutman@tuta.io>

* Add support for timezone-aware datetime strategies (#595)

* add support for Any annotation in schema model (#594)

* add support for Any annotation in schema model

the motivation behind this feature is to support column annotations
that can have any type, to support use cases like the one described
in #592, where
custom checks can be applied to any column except for ones that
are explicitly defined in the schema model class attributes

* update pylint, fix lint

* Docs/scaling - Bring Pandera to Spark and Dask (#588)

* scaling.rst

* edited conf

* finished first pass

* removing FugueWorkflow

* Update index.rst

* Update docs/source/scaling.rst

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>

* add support for timezone-aware datetime strategies

* fix le/ge strategies with datetime

* fix mypy errors

Co-authored-by: Niels Bantilan <niels.bantilan@gmail.com>
Co-authored-by: Kevin Kho <kdykho@gmail.com>

* support frictionless primary keys with multiple fields

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Co-authored-by: tfwillems <tfwillems@users.noreply.github.com>
Co-authored-by: fkroll8 <13244820+fkrull8@users.noreply.github.com>
Co-authored-by: fkroll8 <kent.troutman@tuta.io>
Co-authored-by: Kevin Kho <kdykho@gmail.com>
  • Loading branch information
6 people authored Sep 9, 2021
1 parent 84ea3c2 commit 86a0e19
Show file tree
Hide file tree
Showing 16 changed files with 552 additions and 133 deletions.
32 changes: 32 additions & 0 deletions docs/source/dataframe_schemas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,38 @@ To validate the order of the Dataframe columns, specify ``ordered=True``:

.. _index:

Validating the joint uniqueness of columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In some cases you might want to ensure that a group of columns are unique:

.. testcode:: joint_column_uniqueness

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
columns={col: pa.Column(int) for col in ["a", "b", "c"]},
unique=["a", "c"],
)
df = pd.DataFrame.from_records([
{"a": 1, "b": 2, "c": 3},
{"a": 1, "b": 2, "c": 3},
])
schema.validate(df)

.. testoutput:: joint_column_uniqueness

Traceback (most recent call last):
...
SchemaError: columns '('a', 'c')' not unique:
column index failure_case
0 a 0 1
1 a 1 1
2 c 0 3
3 c 1 3


Index Validation
----------------

Expand Down
13 changes: 7 additions & 6 deletions docs/source/schema_inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
Check.less_than_or_equal_to(max_value=20.0),
],
nullable=False,
allow_duplicates=True,
unique=False,
coerce=False,
required=True,
regex=False,
Expand All @@ -116,7 +116,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
dtype=pandera.engines.numpy_engine.Object,
checks=None,
nullable=False,
allow_duplicates=True,
unique=False,
coerce=False,
required=True,
regex=False,
Expand All @@ -132,7 +132,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
),
],
nullable=False,
allow_duplicates=True,
unique=False,
coerce=False,
required=True,
regex=False,
Expand Down Expand Up @@ -185,15 +185,15 @@ is a convenience method for this functionality.
checks:
greater_than_or_equal_to: 5.0
less_than_or_equal_to: 20.0
allow_duplicates: true
unique: false
coerce: false
required: true
regex: false
column2:
dtype: object
nullable: false
checks: null
allow_duplicates: true
unique: false
coerce: false
required: true
regex: false
Expand All @@ -203,7 +203,7 @@ is a convenience method for this functionality.
checks:
greater_than_or_equal_to: '2010-01-01 00:00:00'
less_than_or_equal_to: '2012-01-01 00:00:00'
allow_duplicates: true
unique: false
coerce: false
required: true
regex: false
Expand All @@ -218,6 +218,7 @@ is a convenience method for this functionality.
coerce: false
coerce: true
strict: false
unique: null

You can edit this yaml file by specifying column names under the ``column``
key. The respective values map onto key-word arguments in the
Expand Down
8 changes: 7 additions & 1 deletion pandera/engines/pandas_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,13 @@ def numpy_dtype(cls, pandera_dtype: dtypes.DataType) -> np.dtype:
alias = "bool"
elif alias.startswith("string"):
alias = "str"
return np.dtype(alias)

try:
return np.dtype(alias)
except TypeError as err:
raise TypeError(
f"Data type '{pandera_dtype}' cannot be cast to a numpy dtype."
) from err


###############################################################################
Expand Down
47 changes: 31 additions & 16 deletions pandera/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def _serialize_component_stats(component_stats):
key: component_stats.get(key)
for key in [
"name",
"allow_duplicates",
"unique",
"coerce",
"required",
"regex",
Expand Down Expand Up @@ -148,6 +148,7 @@ def _serialize_schema(dataframe_schema):
"index": index,
"coerce": dataframe_schema.coerce,
"strict": dataframe_schema.strict,
"unique": dataframe_schema.unique,
}


Expand Down Expand Up @@ -195,6 +196,9 @@ def _deserialize_component_stats(serialized_component_stats):
for key in [
"name",
"nullable",
"unique",
# deserialize allow_duplicates property for backwards
# compatibility. Remove this for 0.8.0 release
"allow_duplicates",
"coerce",
"required",
Expand Down Expand Up @@ -255,6 +259,7 @@ def _deserialize_schema(serialized_schema):
index=index,
coerce=serialized_schema.get("coerce", False),
strict=serialized_schema.get("strict", False),
unique=serialized_schema.get("unique", None),
)


Expand Down Expand Up @@ -310,7 +315,7 @@ def _write_yaml(obj, stream):
dtype={dtype},
checks={checks},
nullable={nullable},
allow_duplicates={allow_duplicates},
unique={unique},
coerce={coerce},
required={required},
regex={regex},
Expand Down Expand Up @@ -397,7 +402,7 @@ def to_script(dataframe_schema, path_or_buf=None):
),
checks=_format_checks(properties["checks"]),
nullable=properties["nullable"],
allow_duplicates=properties["allow_duplicates"],
unique=properties["unique"],
coerce=properties["coerce"],
required=properties["required"],
regex=properties["regex"],
Expand All @@ -418,6 +423,7 @@ def to_script(dataframe_schema, path_or_buf=None):
coerce=dataframe_schema.coerce,
strict=dataframe_schema.strict,
name=dataframe_schema.name.__repr__(),
unique=dataframe_schema.unique,
).strip()

# add pandas imports to handle datetime and timedelta.
Expand Down Expand Up @@ -445,15 +451,15 @@ class FrictionlessFieldParser:
formats, titles, descriptions).
:param field: a field object from a frictionless schema.
:param primary_keys: the primary keys from a frictionless schema. These are used
to ensure primary key fields are treated properly - no duplicates,
no missing values etc.
:param primary_keys: the primary keys from a frictionless schema. These
are used to ensure primary key fields are treated properly - no
duplicates, no missing values etc.
"""

def __init__(self, field, primary_keys) -> None:
self.constraints = field.constraints or {}
self.primary_keys = primary_keys
self.name = field.name
self.is_a_primary_key = self.name in primary_keys
self.type = field.get("type", "string")

@property
Expand Down Expand Up @@ -544,18 +550,22 @@ def nullable(self) -> bool:
"""Determine whether this field can contain missing values.
If a field is a primary key, this will return ``False``."""
if self.is_a_primary_key:
if self.name in self.primary_keys:
return False
return not self.constraints.get("required", False)

@property
def allow_duplicates(self) -> bool:
def unique(self) -> bool:
"""Determine whether this field can contain duplicate values.
If a field is a primary key, this will return ``False``."""
if self.is_a_primary_key:
return False
return not self.constraints.get("unique", False)
If a field is a primary key, this will return ``True``.
"""

# only set column-level uniqueness property if `primary_keys` contains
# more than one field name.
if len(self.primary_keys) == 1 and self.name in self.primary_keys:
return True
return self.constraints.get("unique", False)

@property
def coerce(self) -> bool:
Expand Down Expand Up @@ -587,10 +597,10 @@ def regex(self) -> bool:
def to_pandera_column(self) -> Dict:
"""Export this field to a column spec dictionary."""
return {
"allow_duplicates": self.allow_duplicates,
"checks": self.checks,
"coerce": self.coerce,
"nullable": self.nullable,
"unique": self.unique,
"dtype": self.dtype,
"required": self.required,
"name": self.name,
Expand Down Expand Up @@ -645,8 +655,8 @@ def from_frictionless_schema(
[<Check in_range: in_range(10, 99)>]
>>> schema.columns["column_1"].required
True
>>> schema.columns["column_1"].allow_duplicates
False
>>> schema.columns["column_1"].unique
True
>>> schema.columns["column_2"].checks
[<Check str_length: str_length(None, 10)>, <Check str_matches: str_matches(re.compile('^\\\\S+$'))>]
"""
Expand All @@ -664,5 +674,10 @@ def from_frictionless_schema(
"checks": None,
"coerce": True,
"strict": True,
# only set dataframe-level uniqueness if the frictionless primary
# key property specifies more than one field
"unique": (
None if len(schema.primary_key) == 1 else list(schema.primary_key)
),
}
return _deserialize_schema(assembled_schema)
4 changes: 4 additions & 0 deletions pandera/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,9 @@ class BaseConfig: # pylint:disable=R0903
name: Optional[str] = None #: name of schema
coerce: bool = False #: coerce types of all schema components

#: make sure certain column combinations are unique
unique: Optional[Union[str, List[str]]] = None

#: make sure all specified columns are in the validated dataframe -
#: if ``"filter"``, removes columns not specified in the schema
strict: Union[bool, str] = False
Expand Down Expand Up @@ -218,6 +221,7 @@ def to_schema(cls) -> DataFrameSchema:
strict=cls.__config__.strict,
name=cls.__config__.name,
ordered=cls.__config__.ordered,
unique=cls.__config__.unique,
)
if cls not in MODEL_CACHE:
MODEL_CACHE[cls] = cls.__schema__ # type: ignore
Expand Down
15 changes: 12 additions & 3 deletions pandera/model_components.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ class FieldInfo:
__slots__ = (
"checks",
"nullable",
"unique",
"allow_duplicates",
"coerce",
"regex",
Expand All @@ -61,7 +62,8 @@ def __init__(
self,
checks: Optional[_CheckList] = None,
nullable: bool = False,
allow_duplicates: bool = True,
unique: bool = False,
allow_duplicates: Optional[bool] = None,
coerce: bool = False,
regex: bool = False,
alias: Any = None,
Expand All @@ -70,6 +72,7 @@ def __init__(
) -> None:
self.checks = _to_checklist(checks)
self.nullable = nullable
self.unique = unique
self.allow_duplicates = allow_duplicates
self.coerce = coerce
self.regex = regex
Expand Down Expand Up @@ -118,6 +121,7 @@ def to_column(
pandas_dtype,
Column,
nullable=self.nullable,
unique=self.unique,
allow_duplicates=self.allow_duplicates,
coerce=self.coerce,
regex=self.regex,
Expand All @@ -137,6 +141,7 @@ def to_index(
pandas_dtype,
Index,
nullable=self.nullable,
unique=self.unique,
allow_duplicates=self.allow_duplicates,
coerce=self.coerce,
name=name,
Expand All @@ -161,7 +166,8 @@ def Field(
str_matches: Optional[str] = None,
str_startswith: Optional[str] = None,
nullable: bool = False,
allow_duplicates: bool = True,
unique: bool = False,
allow_duplicates: Optional[bool] = None,
coerce: bool = False,
regex: bool = False,
ignore_na: bool = True,
Expand All @@ -183,6 +189,7 @@ def Field(
to the built-in `~pandera.checks.Check` methods.
:param nullable: whether or not the column/index is nullable.
:param unique: whether column values should be unique
:param allow_duplicates: whether or not to accept duplicate values.
:param coerce: coerces the data type if ``True``.
:param regex: whether or not the field name or alias is a regex pattern.
Expand All @@ -194,7 +201,8 @@ def Field(
:param check_name: Whether to check the name of the column/index during
validation. `None` is the default behavior, which translates to `True`
for columns and multi-index, and to `False` for a single index.
:param dtype_kwargs: The parameters to be forwarded to the type of the field.
:param dtype_kwargs: The parameters to be forwarded to the type of the
field.
:param kwargs: Specify custom checks that have been registered with the
:class:`~pandera.extensions.register_check_method` decorator.
"""
Expand Down Expand Up @@ -229,6 +237,7 @@ def Field(
return FieldInfo(
checks=checks or None,
nullable=nullable,
unique=unique,
allow_duplicates=allow_duplicates,
coerce=coerce,
regex=regex,
Expand Down
Loading

0 comments on commit 86a0e19

Please sign in to comment.