Skip to content

ValueError raised in pandas when lazy validating DataFrame with MultiIndexed Columns #589

Closed
@peter1456

Description

Describe the bug

A ValueError is raised in pandas when a pandas.DataFrame object with MultiIndexed Columns is lazily validated (using the parameter lazy=True) by a pandera.DataFrameSchema object, and there is at least one failed check for the columns.

Running the code below, the following exception is raised:

Traceback (most recent call last):
  line 18, in <module>
    print(schema.validate(df, lazy=True))
  File "Y:\Python39\lib\site-packages\pandera\schemas.py", line 613, in validate
    raise errors.SchemaErrors(
  File "Y:\Python39\lib\site-packages\pandera\errors.py", line 87, in __init__
    error_counts, failure_cases = self._parse_schema_errors(schema_errors)
  File "Y:\Python39\lib\site-packages\pandera\errors.py", line 172, in _parse_schema_errors
    failure_cases = err.failure_cases.assign(
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3699, in assign
    data[k] = com.apply_if_callable(v, data)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3044, in __setitem__
    self._set_item(key, value)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3120, in _set_item
    value = self._sanitize_column(key, value)
  File "Y:\Python39\lib\site-packages\pandas\core\frame.py", line 3768, in _sanitize_column
    value = sanitize_index(value, self.index)
  File "Y:\Python39\lib\site-packages\pandas\core\internals\construction.py", line 747, in sanitize_index
    raise ValueError(
ValueError: Length of values (2) does not match length of index (1)

Checking the line 172 in errors.py in pandera, i.e.

failure_cases = err.failure_cases.assign(
                    schema_context=err.schema.__class__.__name__,
                    check=check_identifier,
                    check_number=err.check_index,
                    column=column,
                )

It could be seen that the MultiIndexed Column with the name ("foo", "baz") , which has the type tuple, would not be interpreted as a single value by pandas, which then failed to be broadcasted to err.failure_cases and causing the ValueError from pandas during the assign method call.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
from pandera import Column, DataFrameSchema

schema = DataFrameSchema({
    ("foo", "bar"): Column(int),
    ("foo", "baz"): Column(int)
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

print(schema.validate(df, lazy=True))

Expected behavior

A pandera.SchemasError should be raised with the type mismatch on the column ("foo", "baz") logged, which has the value ("foo", "baz") in the column Column.

Desktop:

  • OS: Windows 10
  • Version: Python 3.9.0, with pandera 0.7.0, pandas 1.1.4 installed

Additional context

If we change the code above to

import pandas as pd
from pandera import Column, DataFrameSchema, Check

schema = DataFrameSchema({
    ("foo", "bar"): Column(int, checks=Check(lambda s: s == 1)),
    ("foo", "baz"): Column(str, name="b")
})

df = pd.DataFrame({
    ("foo", "bar"): [1, 2, 3],
    ("foo", "baz"): ["a", "b", "c"],
})

try:
    schema.validate(df, lazy=True)
except Exception as e:
    print(e.failure_cases)

The output would be

  schema_context column     check  check_number  failure_case  index
0         Column    foo  <lambda>             0             2      1
1         Column    bar  <lambda>             0             3      2

which shows that the column name ("foo", "bar") is incorrectly interpreted as a pandas.Series-like object and treated as a column object when calling the method assign from err.failure_cases.

Potential Fix

A band-aid fix would be manually broadcast the input for the column Column before assigning the column to err.failure_cases, i.e.

failure_cases = err.failure_cases.assign(
                    schema_context=err.schema.__class__.__name__,
                    check=check_identifier,
                    check_number=err.check_index,
                    column=[column] * len(err.failure_cases),
                )

which seems to have fixed the problem.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions