Adds Null Issue Manager #856

abhijitpal1247 · 2023-10-03T18:32:26Z

closes #810

Made a basic IssueManager for handling null values.

CLAassistant · 2023-10-03T18:32:31Z

All committers have signed the CLA.

codecov · 2023-10-03T18:42:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8b8bf78) 96.70% compared to head (cfa26aa) 96.82%.
Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #856      +/-   ##
==========================================
+ Coverage   96.70%   96.82%   +0.12%     
==========================================
  Files          65       66       +1     
  Lines        5091     5229     +138     
  Branches      875      905      +30     
==========================================
+ Hits         4923     5063     +140     
+ Misses         86       85       -1     
+ Partials       82       81       -1

Files	Coverage Δ
cleanlab/datalab/internal/issue_manager/null.py	`100.00% <100.00%> (ø)`

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jwmueller · 2023-10-04T12:25:34Z

Thanks for the contribution @abhijitpal1247!

To make this easier to review, can you:

Add a unit test that for now gets this PR to at least 60-70% codecov (ie. a basic unit test that evaluates the end-to-end usage of your new issue manager from the Datalab level).
Add a comment to your PR description showing example code to test this out, and what the output looks like exactly (copy/paste the resulting Datalab report() and its issues, issue_summary, and info attributes.

Thanks, we can get somebody to review this PR once those basics are provided. After your code structure has been verified to be good, we'll ask you to add a few more unit tests in order to get the codecov of this new code up to 100%.

abhijitpal1247 · 2023-10-08T05:53:34Z

@jwmueller
inputs:

embeddings:
[[ nan 0.95071431 0.73199394], [0.59865848 0.15601864 0.15599452], [0.05808361 nan 0.60111501], [0.70807258 0.02058449 0.96990985]]

outputs:

report:

----------------------- null issues ------------------------

Number of examples with this issue: 2
Overall dataset quality in terms of this issue: 0.2500

Examples representing most severe instances of this issue:
   is_null_issue  null_score
1          False    0.000000
3          False    0.000000
0           True    0.333333
2           True    0.666667

Additional Information: 
average_null_score: 0.25

issues:

[{'is_null_issue': True, 'null_score': 0.3333333432674408}, {'is_null_issue': False, 'null_score': 0.0}, {'is_null_issue': True, 'null_score': 0.6666666865348816}, {'is_null_issue': False, 'null_score': 0.0}]

summary:

[{'issue_type': 'null', 'score': 0.25}]

info:

{'average_null_score': 0.25}

P.S.: Converted dataframes to dict and presented here

jwmueller

The quality score for each row should be the fraction of features which are not null in that row.

The rows that are marked as is_null_issue should ONLY be those rows which are 100% null values.

elisno

Great work @abhijitpal1247!

I suggest you extract the core logic of computing the issue masks and scores into a separate method (see comment).

I also suggest you compute additional information in collect_info.

Finally, I suggest you split the "parametrized" embeddings into separate test classes. One without null values, another with some null values.

cleanlab/datalab/internal/issue_manager/null.py

tests/datalab/issue_manager/test_null.py

cleanlab/datalab/internal/issue_manager/null.py

jwmueller · 2023-10-14T18:28:13Z

@elisno will leave this for your review. The main criteria to ensure are simply that:

The quality score for each row should be the fraction of features which are not null in that row.
The rows that are marked as is_null_issue should ONLY be those rows which are 100% null values.

abhijitpal1247 · 2023-10-18T05:23:26Z

@elisno I have implemented the suggested changes. Let me know if I am missing out on something.

abhijitpal1247 · 2023-10-25T19:53:31Z

@elisno , I have added the changes. Please review it. As we are nearing to the end of hacktoberfest, I would love to get this merge too.

As a quality score, high values mean there are proportionally fewer NaN values in a given row.

abhijitpal1247 · 2023-10-27T13:00:27Z

Hey @jwmueller any updates 😅 We are very close to the end of hacktoberfest. Let me know if you need any additional info that can help you out.

elisno · 2023-10-27T14:34:27Z

Hi @abhijitpal1247, I'm pushing a few changes in a bit before we can merge this today!

elisno

Wonderful work @abhijitpal1247!
Thank you for adding this issue manager!

abhijitpal1247 added 3 commits October 3, 2023 23:51

changes for NullIssueManager

baa5021

changing null_scores to average_null_score

10a8bfe

adding comments for readability

46b471d

removing unnecessary dependencies

65fe3f8

abhijitpal1247 added 3 commits October 4, 2023 00:36

autocorrected formatting with black

c3a53f4

correcting some datatypes

aef2c60

autocorrecting the format with black-2

143e289

abhijitpal1247 mentioned this pull request Oct 3, 2023

Datalab issue type for null/missing feature values #810

Closed

Merge branch 'master' into abhijitpal1247-null

e6ca06d

abhijitpal1247 added 3 commits October 8, 2023 11:34

added test and some minor changes

2a8347a

added test for both null and non-null values

b028779

reformatted with black

0fa0710

jwmueller requested review from elisno and jwmueller and removed request for elisno October 10, 2023 13:14

jwmueller requested changes Oct 10, 2023

View reviewed changes

elisno reviewed Oct 11, 2023

View reviewed changes

abhijitpal1247 and others added 4 commits October 14, 2023 17:07

Merge branch 'cleanlab:master' into abhijitpal1247-null

4f55765

adding some more metrics to track and test them

afa34e8

removing unused imports

1c96a11

adding some type declarations

348bcad

abhijitpal1247 requested review from elisno and jwmueller October 14, 2023 14:02

abhijitpal1247 added 3 commits October 14, 2023 19:36

adding some type declarations - 2

4efc50c

adding some type declarations - 3

9791913

reformatted using black - 2

9643483

abhijitpal1247 added 4 commits October 14, 2023 19:55

typing changes - 3

931952d

removing unused imports - 2

20da838

typing changes - 4

9622342

typing changes - 5

683ba72

jwmueller removed their request for review October 14, 2023 18:27

elisno added 5 commits October 26, 2023 14:58

add numpy extra for hypothesis dev dependency

c678f1c

fix quality score for null issue manager

2b10067

As a quality score, high values mean there are proportionally fewer NaN values in a given row.

format import

29e1555

remove useless super() delegation in __init__ method

18de7a5

add property based test for scoring null issues

766ff5c

elisno added 5 commits October 28, 2023 05:24

turn info helper methods into internal methods

a87ac46

add basic docstring for method containing main logic of issue manager

f617653

update formatting

e701df0

extract variable in test

a483952

add temporary docs page for null issue manager

cfa26aa

elisno added the hacktoberfest-accepted label Oct 30, 2023

elisno approved these changes Oct 30, 2023

View reviewed changes

elisno requested review from jwmueller and removed request for jwmueller October 30, 2023 02:41

elisno merged commit 69c8c53 into cleanlab:master Oct 30, 2023
20 checks passed

jwmueller mentioned this pull request Nov 26, 2023

make null issue type more user-facing #903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Null Issue Manager #856

Adds Null Issue Manager #856

abhijitpal1247 commented Oct 3, 2023 •

edited

Loading

CLAassistant commented Oct 3, 2023 •

edited

Loading

codecov bot commented Oct 3, 2023 •

edited

Loading

jwmueller commented Oct 4, 2023

abhijitpal1247 commented Oct 8, 2023 •

edited

Loading

jwmueller left a comment

elisno left a comment

jwmueller commented Oct 14, 2023

abhijitpal1247 commented Oct 18, 2023

abhijitpal1247 commented Oct 25, 2023

abhijitpal1247 commented Oct 27, 2023 •

edited

Loading

elisno commented Oct 27, 2023

elisno left a comment

Adds Null Issue Manager #856

Adds Null Issue Manager #856

Conversation

abhijitpal1247 commented Oct 3, 2023 • edited Loading

CLAassistant commented Oct 3, 2023 • edited Loading

codecov bot commented Oct 3, 2023 • edited Loading

Codecov Report

jwmueller commented Oct 4, 2023

abhijitpal1247 commented Oct 8, 2023 • edited Loading

jwmueller left a comment

Choose a reason for hiding this comment

elisno left a comment

Choose a reason for hiding this comment

jwmueller commented Oct 14, 2023

abhijitpal1247 commented Oct 18, 2023

abhijitpal1247 commented Oct 25, 2023

abhijitpal1247 commented Oct 27, 2023 • edited Loading

elisno commented Oct 27, 2023

elisno left a comment

Choose a reason for hiding this comment

abhijitpal1247 commented Oct 3, 2023 •

edited

Loading

CLAassistant commented Oct 3, 2023 •

edited

Loading

codecov bot commented Oct 3, 2023 •

edited

Loading

abhijitpal1247 commented Oct 8, 2023 •

edited

Loading

abhijitpal1247 commented Oct 27, 2023 •

edited

Loading