Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling NaN values when fitting JS univariate drift #340

Merged
merged 7 commits into from
Nov 20, 2023

Conversation

michael-nml
Copy link
Collaborator

@michael-nml michael-nml commented Nov 19, 2023

Primary purpose for this PR is to fix fitting Jensen Shannon method when reference data contains NaN values, described in #339.

Changes

  • Remove NaN's from reference data when fitting JS.
  • Refactored to use a common data cleaning method, previously there were different variations in use across calculators.
  • Improved error reporting when fitting univariate calculator methods to specify method and column name that failed.
  • Removed InvalidArgumentException about NaN arguments from performance calculation so NaN will be returned instead with a warning message. This should be in line with behaviour for other calculators.

Note

I've not added special handling when reference data contains only NaN values. This will still result in the exception mentioned in #339 (with mention of the violating method and column name). This likely applies for most univariate drift methods, if not all of them.

This seems like a different case than analysis data containing only NaN values that we should still evaluate.

Copy link

codecov bot commented Nov 19, 2023

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (51105c7) 82.80% compared to head (59ba73c) 82.91%.

Files Patch % Lines
nannyml/base.py 78.94% 3 Missing and 1 partial ⚠️
nannyml/drift/univariate/calculator.py 75.00% 4 Missing ⚠️
...e_calculation/metrics/multiclass_classification.py 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #340      +/-   ##
==========================================
+ Coverage   82.80%   82.91%   +0.11%     
==========================================
  Files         100      100              
  Lines        7554     7540      -14     
  Branches     1357     1351       -6     
==========================================
- Hits         6255     6252       -3     
+ Misses        963      960       -3     
+ Partials      336      328       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Previously the data cleaning method operated by accepting multiple
dataframes and inspecting each dataframe separetely for `NaN`'s.
Depending on how the data is processed after cleaning, splitting columns
into separate dataframes can be rather annoying.

To avoid that this commit changes the method to accept a single
dataframe and a columns argument. The columns argument specifies which
column subsets should be inspected for `NaN`'s, enabling the same
behaviour using a more convenient syntax.
The performance calculator for binary classification had checks in place
to generate an exception if the prediction column contains nothing but
`NaN`'s. This behaviour contradicts the warning functionality that is in
the same functions that would should return `NaN` and issue a warning.
It is also inconsistent with other calculators which do issue a warning
instead of raising an error.

This commit removes the errors and relies on the existing warning
functionality.
@nnansters nnansters merged commit 93ac6e7 into main Nov 20, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Univariate drift calculation fails: autodetected range of [nan, nan] is not finite
2 participants