ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589
Description
Describe the bug
A ValueError
is raised in pandas when a pandas.DataFrame
object with MultiIndexed Columns is lazily validated (using the parameter lazy=True
) by a pandera.DataFrameSchema
object, and there is at least one failed check for the columns.
Running the code below, the following exception is raised:
Traceback (most recent call last):
line 18, in <module>
print(schema.validate(df, lazy=True))
File "Y:\Python39\lib\site-packages\pandera\schemas.py", line 613, in validate
raise errors.SchemaErrors(
File "Y:\Python39\lib\site-packages\pandera\errors.py", line 87, in __init__
error_counts, failure_cases = self._parse_schema_errors(schema_errors)
File "Y:\Python39\lib\site-packages\pandera\errors.py", line 172, in _parse_schema_errors
failure_cases = err.failure_cases.assign(
File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3699, in assign
data[k] = com.apply_if_callable(v, data)
File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3044, in __setitem__
self._set_item(key, value)
File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3120, in _set_item
value = self._sanitize_column(key, value)
File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3768, in _sanitize_column
value = sanitize_index(value, self.index)
File "Y:\Python39\lib\site-packages\pandas\core\internals\construction.py", line 747, in sanitize_index
raise ValueError(
ValueError: Length of values (2) does not match length of index (1)
Checking the line 172 in errors.py
in pandera
, i.e.
failure_cases = err.failure_cases.assign(
schema_context=err.schema.__class__.__name__,
check=check_identifier,
check_number=err.check_index,
column=column,
)
It could be seen that the MultiIndexed Column with the name ("foo", "baz")
, which has the type tuple
, would not be interpreted as a single value by pandas
, which then failed to be broadcasted to err.failure_cases
and causing the ValueError
from pandas
during the assign
method call.
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandera.
- (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample, a copy-pastable example
import pandas as pd
from pandera import Column, DataFrameSchema
schema = DataFrameSchema({
("foo", "bar"): Column(int),
("foo", "baz"): Column(int)
})
df = pd.DataFrame({
("foo", "bar"): [1, 2, 3],
("foo", "baz"): ["a", "b", "c"],
})
print(schema.validate(df, lazy=True))
Expected behavior
A pandera.SchemasError
should be raised with the type mismatch on the column ("foo", "baz")
logged, which has the value ("foo", "baz")
in the column Column.
Desktop:
- OS: Windows 10
- Version: Python 3.9.0, with pandera 0.7.0, pandas 1.1.4 installed
Additional context
If we change the code above to
import pandas as pd
from pandera import Column, DataFrameSchema, Check
schema = DataFrameSchema({
("foo", "bar"): Column(int, checks=Check(lambda s: s == 1)),
("foo", "baz"): Column(str, name="b")
})
df = pd.DataFrame({
("foo", "bar"): [1, 2, 3],
("foo", "baz"): ["a", "b", "c"],
})
try:
schema.validate(df, lazy=True)
except Exception as e:
print(e.failure_cases)
The output would be
schema_context column check check_number failure_case index
0 Column foo <lambda> 0 2 1
1 Column bar <lambda> 0 3 2
which shows that the column name ("foo", "bar")
is incorrectly interpreted as a pandas.Series
-like object and treated as a column object when calling the method assign
from err.failure_cases
.
Potential Fix
A band-aid fix would be manually broadcast the input for the column Column before assigning the column to err.failure_cases
, i.e.
failure_cases = err.failure_cases.assign(
schema_context=err.schema.__class__.__name__,
check=check_identifier,
check_number=err.check_index,
column=[column] * len(err.failure_cases),
)
which seems to have fixed the problem.