Fix handling `NaN` values when fitting JS univariate drift #340

michael-nml · 2023-11-19T19:44:31Z

Primary purpose for this PR is to fix fitting Jensen Shannon method when reference data contains NaN values, described in #339.

Changes

Remove NaN's from reference data when fitting JS.
Refactored to use a common data cleaning method, previously there were different variations in use across calculators.
Improved error reporting when fitting univariate calculator methods to specify method and column name that failed.
Removed InvalidArgumentException about NaN arguments from performance calculation so NaN will be returned instead with a warning message. This should be in line with behaviour for other calculators.

Note

I've not added special handling when reference data contains only NaN values. This will still result in the exception mentioned in #339 (with mention of the violating method and column name). This likely applies for most univariate drift methods, if not all of them.

This seems like a different case than analysis data containing only NaN values that we should still evaluate.

codecov · 2023-11-19T19:52:30Z

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (51105c7) 82.80% compared to head (59ba73c) 82.91%.

Files	Patch %	Lines
nannyml/base.py	78.94%	3 Missing and 1 partial ⚠️
nannyml/drift/univariate/calculator.py	75.00%	4 Missing ⚠️
...e_calculation/metrics/multiclass_classification.py	90.90%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #340      +/-   ##
==========================================
+ Coverage   82.80%   82.91%   +0.11%     
==========================================
  Files         100      100              
  Lines        7554     7540      -14     
  Branches     1357     1351       -6     
==========================================
- Hits         6255     6252       -3     
+ Misses        963      960       -3     
+ Partials      336      328       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Previously the data cleaning method operated by accepting multiple dataframes and inspecting each dataframe separetely for `NaN`'s. Depending on how the data is processed after cleaning, splitting columns into separate dataframes can be rather annoying. To avoid that this commit changes the method to accept a single dataframe and a columns argument. The columns argument specifies which column subsets should be inspected for `NaN`'s, enabling the same behaviour using a more convenient syntax.

The performance calculator for binary classification had checks in place to generate an exception if the prediction column contains nothing but `NaN`'s. This behaviour contradicts the warning functionality that is in the same functions that would should return `NaN` and issue a warning. It is also inconsistent with other calculators which do issue a warning instead of raising an error. This commit removes the errors and relies on the existing warning functionality.

michael-nml added 3 commits November 19, 2023 19:00

Add column & method for univariate fitting errors

55ff6a3

Refactor to use single data cleaning method

0cfcf92

Filter NaN's when fitting JS

369efbf

michael-nml requested review from nnansters and nikml as code owners November 19, 2023 19:44

michael-nml added 3 commits November 19, 2023 23:09

Refactor more data cleaning methods

7131419

michael-nml linked an issue Nov 20, 2023 that may be closed by this pull request

Univariate drift calculation fails: autodetected range of [nan, nan] is not finite #339

Closed

Deal with mypy overload issue

59ba73c

nnansters merged commit 93ac6e7 into main Nov 20, 2023
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling `NaN` values when fitting JS univariate drift #340

Fix handling `NaN` values when fitting JS univariate drift #340

michael-nml commented Nov 19, 2023 •

edited

Loading

codecov bot commented Nov 19, 2023 •

edited

Loading

Fix handling NaN values when fitting JS univariate drift #340

Fix handling NaN values when fitting JS univariate drift #340

Conversation

michael-nml commented Nov 19, 2023 • edited Loading

Changes

Note

codecov bot commented Nov 19, 2023 • edited Loading

Codecov Report

Fix handling `NaN` values when fitting JS univariate drift #340

Fix handling `NaN` values when fitting JS univariate drift #340

michael-nml commented Nov 19, 2023 •

edited

Loading

codecov bot commented Nov 19, 2023 •

edited

Loading