Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Null Issue Manager #856

Merged
merged 32 commits into from
Oct 30, 2023
Merged

Conversation

abhijitpal1247
Copy link
Contributor

@abhijitpal1247 abhijitpal1247 commented Oct 3, 2023

closes #810

Made a basic IssueManager for handling null values.

@CLAassistant
Copy link

CLAassistant commented Oct 3, 2023

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8b8bf78) 96.70% compared to head (cfa26aa) 96.82%.
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #856      +/-   ##
==========================================
+ Coverage   96.70%   96.82%   +0.12%     
==========================================
  Files          65       66       +1     
  Lines        5091     5229     +138     
  Branches      875      905      +30     
==========================================
+ Hits         4923     5063     +140     
+ Misses         86       85       -1     
+ Partials       82       81       -1     
Files Coverage Δ
cleanlab/datalab/internal/issue_manager/null.py 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwmueller
Copy link
Member

Thanks for the contribution @abhijitpal1247!

To make this easier to review, can you:

  1. Add a unit test that for now gets this PR to at least 60-70% codecov (ie. a basic unit test that evaluates the end-to-end usage of your new issue manager from the Datalab level).

  2. Add a comment to your PR description showing example code to test this out, and what the output looks like exactly (copy/paste the resulting Datalab report() and its issues, issue_summary, and info attributes.

Thanks, we can get somebody to review this PR once those basics are provided. After your code structure has been verified to be good, we'll ask you to add a few more unit tests in order to get the codecov of this new code up to 100%.

@abhijitpal1247
Copy link
Contributor Author

abhijitpal1247 commented Oct 8, 2023

@jwmueller
inputs:

embeddings:
[[ nan 0.95071431 0.73199394], [0.59865848 0.15601864 0.15599452], [0.05808361 nan 0.60111501], [0.70807258 0.02058449 0.96990985]]

outputs:

report:

----------------------- null issues ------------------------

Number of examples with this issue: 2
Overall dataset quality in terms of this issue: 0.2500

Examples representing most severe instances of this issue:
   is_null_issue  null_score
1          False    0.000000
3          False    0.000000
0           True    0.333333
2           True    0.666667

Additional Information: 
average_null_score: 0.25

issues:

[{'is_null_issue': True, 'null_score': 0.3333333432674408}, {'is_null_issue': False, 'null_score': 0.0}, {'is_null_issue': True, 'null_score': 0.6666666865348816}, {'is_null_issue': False, 'null_score': 0.0}]

summary:

[{'issue_type': 'null', 'score': 0.25}]

info:

{'average_null_score': 0.25}

P.S.: Converted dataframes to dict and presented here

@jwmueller jwmueller requested review from elisno and jwmueller and removed request for elisno October 10, 2023 13:14
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quality score for each row should be the fraction of features which are not null in that row.

The rows that are marked as is_null_issue should ONLY be those rows which are 100% null values.

Copy link
Member

@elisno elisno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @abhijitpal1247!

I suggest you extract the core logic of computing the issue masks and scores into a separate method (see comment).

I also suggest you compute additional information in collect_info.

Finally, I suggest you split the "parametrized" embeddings into separate test classes. One without null values, another with some null values.

cleanlab/datalab/internal/issue_manager/null.py Outdated Show resolved Hide resolved
cleanlab/datalab/internal/issue_manager/null.py Outdated Show resolved Hide resolved
cleanlab/datalab/internal/issue_manager/null.py Outdated Show resolved Hide resolved
tests/datalab/issue_manager/test_null.py Outdated Show resolved Hide resolved
cleanlab/datalab/internal/issue_manager/null.py Outdated Show resolved Hide resolved
@jwmueller jwmueller removed their request for review October 14, 2023 18:27
@jwmueller
Copy link
Member

@elisno will leave this for your review. The main criteria to ensure are simply that:

  • The quality score for each row should be the fraction of features which are not null in that row.
  • The rows that are marked as is_null_issue should ONLY be those rows which are 100% null values.

@abhijitpal1247
Copy link
Contributor Author

@elisno I have implemented the suggested changes. Let me know if I am missing out on something.

@abhijitpal1247
Copy link
Contributor Author

@elisno , I have added the changes. Please review it. As we are nearing to the end of hacktoberfest, I would love to get this merge too.

@abhijitpal1247
Copy link
Contributor Author

abhijitpal1247 commented Oct 27, 2023

Hey @jwmueller any updates 😅 We are very close to the end of hacktoberfest. Let me know if you need any additional info that can help you out.

@elisno
Copy link
Member

elisno commented Oct 27, 2023

Hi @abhijitpal1247, I'm pushing a few changes in a bit before we can merge this today!

Copy link
Member

@elisno elisno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful work @abhijitpal1247!
Thank you for adding this issue manager!

@elisno elisno requested review from jwmueller and removed request for jwmueller October 30, 2023 02:41
@elisno elisno merged commit 69c8c53 into cleanlab:master Oct 30, 2023
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datalab issue type for null/missing feature values
4 participants