Releases · cleanlab/cleanlab

@allincowell

This release introduces new features and improvements aimed at helping users detect complex dataset issues and improve their ML models' robustness. As always, we maintain backward compatibility, making this release non-breaking when upgrading from v2.6.6. We continue to support Python 3.8-3.11 in this version, but support for Python 3.8 will be dropped in a future minor release.

Introducing Spurious Correlation Detection in Datalab

With this release, Datalab now detects spurious correlations in image datasets by default, helping users identify potentially misleading patterns that may lead to overfitting or reduced model generalization.

Spurious correlations occur when models pick up on patterns in the data that are coincidental rather than meaningful. For example, a model might incorrectly associate the background color with a particular label, leading to poor generalization on new data. Identifying these correlations helps ensure more reliable models by minimizing the risk of learning from irrelevant or misleading features.

Detecting spurious correlations in image datasets is straightforward:

from cleanlab import Datalab

lab = Datalab(data=image_dataset, label_name="label_column", image_key="image_column")

lab.find_issues()

lab.report()

You can find a more detailed workflow for finding spurious correlations in our documentation.

This new issue type aims to give users deeper insights into their data, enabling more robust model development.

New Tutorial: Improving ML Performance with Train and Test Set Curation

We've introduced a new tutorial that demonstrates how to carefully use cleanlab (via Datalab) for both training and test data. This approach helps ensure reliable ML model training and evaluation, particularly for noisy datasets.

You can find this tutorial in our documentation: Improving ML Performance via Data Curation with Train vs Test Splits.

Other Major Improvements

Optimized Internal Functions: Several internal optimizations have been made, including updates to clip_noise_rates, remove_noise_from_class, and clip_values functions, improving the overall efficiency of cleanlab.
Improved Underperforming Group Detection: Enhanced scoring for all underperforming groups, providing more accurate identification of problematic data subsets.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Change Log

Significant changes in this release include:

Added Spurious Correlation feature by @allincowell in #1140, #1171, #1181, #1194; @elisno in #1170, #1192, #1192, #1193, #1192, #1201; @jwmueller in #1195, #1196
Added new CLOS train test split tutorial notebook by @mturk24 in #1071; @jwmueller in #1178
Update links to Issue Type Guide in workflows tutorials by @elisno in #1168
Optimize internal clip_noise_rates and remove_noise_from_class functions by @gogetron in #1105
Optimize internal clip_values function by @gogetron in #1104
Move models.fasttext wrapper to examples repo by @jwmueller in #1173
Mypy fixes by @elisno in #1174
Improve tests in Datalab Quickstart tutorial by @allincowell in #1166
Improve docs by @mturk24 in #1177; @jwmueller in #1189; @dduong1603 in #1197; @elisno in #1204
Update Studio References by @nelsonauner in #1182
Update README by @nelsonauner in #1188
Improve cluster score for all underperforming groups by @tataganesh in #1180
Improve CI test setup by @dduong1603 in #1198

New Contributors

@dduong1603 made their first contribution in #1197

For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.

@elisno

What's Changed

Improvements in Issue Type guide by @elisno in #1100; @jwmueller in #1136
Improve docstrings in token_classification/summary.py by @gogetron in #1094
Update dictionary for deciding on omitting underperforming_group_check by @elisno in #1135
Add notebook with miscellaneous Datalab workflows by @elisno in #1125, #1138
Update datalab report text by @jwmueller in #1134; @elisno in #1154
Update FAQ sections by @jwmueller in #1139; @elisno in #1152
Pin fasttext in CI by @elisno in #1144
Improve test setup by @elisno in #1146
Update quickstart links that were outdated by @jwmueller in #1148
Update knn shapely score computation by @elisno in #1142
Refactor KNN graph handling and outlier detection in issue managers by @elisno in #1155, #1163

Full Changelog: v2.6.5...v2.6.6

@allincowell

What's Changed

Add end-to-end tests at the end of Datalab quickstart tutorial by @allincowell in #1118
Centralize existing functionality for constructing and correcting knn graphs in a separate module by @elisno in #1117, #1119, #1129
Optimize multiannotator.py for performance by @gogetron in #1077
Optimize value_counts function for performance improvement with missing classes by @gogetron in #1073
Improve test coverage for setting confident joint in CleanLearning by @elisno in #1123
Switch from np.isnan to pd.isna for null value check by @gogetron in #1096
Update pip install instruction in object detection tutorial by @elisno in #1126
Refine handling of underperforming_group issue type by @gogetron in #1099
Improve compatibility with sklearn 1.5 by removing the deprecated multi_class argument in LogisticRegression by @elisno in #1124
Display exact duplicate sets dynamically in tabular tutorial by @nelsonauner in #1128

New Contributors

@allincowell made their first contribution in #1118
@nelsonauner made their first contribution in #1128

Full Changelog: v2.6.4...v2.6.5

@gogetron

What's Changed

Various performance optimizations and test improvements by @gogetron in #1064, #1067, #1079, #1087, #1095, #1106, #1107
Restructured text and tabular classification tutorials into CleanLear… by @mturk24 in #1066
user-facing cleanlab.datavaluation module by @coding-famer in #1050
fix typo in datalab issue types by @coding-famer in #1085
Add kwargs to functions that call plt.show() by @mturk24 in #1084; by @jwmueller in #1088
update tutorials by @jwmueller in #1089, #1090, #1091
Refine type hints by @desboisGIT in #1101; by @elisno in #1086
Updated datalab issue type description for non iid issue by @mturk24 in #1102
Remove unsqueeze call in image tutorial by @elisno in #1108
Temporarily Revert to macOS 12 in CI due to Incompatibility with Python 3.8 and 3.9 by @elisno in #1110
Fix numerical instability with Euclidean distance metric by @elisno in #1113
avoid sensitive divisions by @jwmueller in #1114; by @elisno in #1116
All identical datasets tests by @elisno in #1115

New Contributors

@gogetron made their first contribution in #1064
@desboisGIT made their first contribution in #1101

Full Changelog: v2.6.3...v2.6.4

@sanjanag

This release is non-breaking when upgrading from v2.6.2.

What's Changed

Updated image_key documentation by @sanjanag in #1048
Refine Scoring and Enhance Stability for Datasets with Identical Examples by @elisno in #1056
Add warning message about TensorFlow compatibility to docs by @elisno in #1057

Full Changelog: v2.6.2...v2.6.3

@elisno

This release is non-breaking when upgrading from v2.6.1.

What's Changed

Convert DataFrame features to numpy arrays in null value check by @elisno in #1045

Full Changelog: v2.6.1...v2.6.2

@jwmueller

This release is non-breaking when upgrading from v2.6.0. Some noteworthy updates include:

The label quality score in the cleanlab.regression module is improved to be more human-readable.
- This only involves rescaling the scores to display a more human-interpretable range of scores, without affecting how your data points are ranked within a dataset according to these scores.
Better address some edge-cases in Datalab.get_issues().

What's Changed

Readme updates by @jwmueller in #1030, #1031, #1039; @elisno in #1040
Adjust the range of regression label quality scores by @huiwengoh in #1032
Misc fixes of get_issues method by @elisno in #1025, #1026, #1028
Support features as input for data valuation check in Datalab by @elisno in #1023
Fix/clarify docs by @mturk24 in #1029; @elisno in #1024, #1037
CI/CD changes by @elisno in #1036

New Contributors

@mturk24 made their first contribution in #1029

Full Changelog: v2.6.0...v2.6.1

@smttsp

This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements.
However, this release drops support for Python 3.7 while adding support for Python 3.11.

Enhancements to Datalab

In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:

Identify null values in your dataset.
Detect class_imbalance.
Highlight an underperforming_group, which refers to a subset of data points where your model exhibits poorer performance compared to others.
See our FAQ
for more information on how to provide pre-defined groups for this issue type.

Additionally, Datalab can now optionally:

Assess the value of data points in your dataset using KNN-Shapley scores as a measure of data_valuation.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Expanded Datalab Support for New ML Tasks

With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.
This release introduces the task parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.

from cleanlab import Datalab

lab = Datalab(..., task="regression")

The tasks currently supported are:

classification (default): Includes all previously supported issue-checking capabilities based on pred_probs, features, or a knn_graph, and the new features introduced earlier.
regression (new):
- Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated regression tutorial.
- Find other issues utilizing features or a knn_graph.
multilabel (new):
- Detect label errors in multilabel classification datasets using pred_probs exclusively. Explore the updated capabilities in our multilabel tutorial.
- Find various other types of issues based on features or a knn_graph.

Improved Object Detection Dataset Exploration

New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection.
Learn how to leverage some of these functions in our object detection tutorial.

Other Major Improvements

Rescaled Near Duplicate and Outlier Scores:
- Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.
Consistency in counting label issues:
- cleanlab.dataset.health_summary() now returns the same number of issues as cleanlab.classification.find_label_issues() and cleanlab.count.num_label_issues().
Improved handling of non-iid issues:
- The non-iid issue check in Datalab now handles pred_probs as input.
Better reporting in Datalab:
- Simplified Datalab.report() now highlights only detected issue types. To view all checked issue types, use Datalab.report(show_all_issues=True).
Enhanced Handling of Binary Classification Tasks:
- Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.
Experimental Functionality:
- cleanlab now offers experimental functionality for detecting label issues in span categorization tasks with a single class, enhancing its applicability in natural language processing projects.

New Contributors

We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:

@smttsp made their first contribution in #867
@abhijitpal1247 made their first contribution in #856
@01PrathamS made their first contribution in #893
@mglowacki100 made their first contribution in #796
@gibsonliketheguitar made their first contribution in #831
@kylegallatin made their first contribution in #885
@ryansingman made their first contribution in #919
@R-Peleg made their first contribution in #948

Thank you for your valuable contributions! If you're interested in contributing, check out our contributing guide for ways to get involved.

Change Log

Significant changes in this release include:

Update FAQ section in docs by @tataganesh in #869; @elisno in #913
Improve Object Detection module by @Steven-Yiran in #840, #877; @aditya1503 in #883, #969, #968
Clearer documentation/tutorials/readme by @jwmueller in #851, #931, #981, #983, #1001, #978, #994, #1010; @01PrathamS in #893; @elisno in #878, #1007, #992, #1015, #1016; @huiwengoh in #984; @sanjanag in #936; @tataganesh in #916; @ulya-tkch in #954;
CI updates by @aditya1503 in #864; @elisno in #879, #961, #963, #965, #1008, #975, #1011, #1012, #1013, #1014; @jwmueller in #852, #865; @tataganesh in #900; @anishathalye in #956; @sanjanag in #1009
Docs system updates by @elisno in #880, #881, #958, #959, #960, #964
Add Null Issue Manager by @abhijitpal1247 in #856; @tataganesh in #927, #917
Add Data Valuation Issue Manager by @coding-famer in #850, #925
Extend non-iid issue check to run if only pred_probs are provided by @abhijitpal1247 in #857; @tataganesh in #896, #897
Add Underperforming Group Issue Manager by @tataganesh in #838, #907; @elisno in #990
Add Class Imbalance issue type to Datalab defaults by @tataganesh in #912, #933; @jwmueller in #924, #934; @elisno in #940
Add regression task to Datalab by @mglowacki100 in #796; @elisno in #902
Add multilabel task to Datalab by @tataganesh in #929
702 - Shorten Refs of classes and functions in Docs by @gibsonliketheguitar in #831
Update near duplicate issues and sets by @ryansingman in #919; @elisno in #8...

@gordon-lim

This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental that have been moved, especially utility methods related to Datalab).

New ML tasks supported

Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:

regression (finding errors in numeric data): see cleanlab.regression and the "noisy labels in regression" quickstart tutorial.
object detection: see cleanlab.object_detection and the "Object Detection" quickstart tutorial.
image segmentation: see cleanlab.segmentation and the "Semantic Segmentation tutorial.

Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).

If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!

Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:
https://cleanlab.ai/research/
https://cleanlab.ai/blog/

Improvements to Datalab

Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.

This release introduces major improvements and new functionalities in Datalab that include the ability to:

Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of CleanVision.
Detect label issues even without pred_probs from a ML model (you can instead just provide features).
Flag rare classes in imbalanced classification datasets.
Audit unlabeled datasets.

Other major improvements

50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
Out-of-Distribution detection based on pred_probs via the GEN algorithm which is particularly effective for datasets with tons of classes.
Many of the methods across the package to find label issues now support a low_memory option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.

New Contributors

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:

@gordon-lim made their first contribution in #746
@tataganesh made their first contribution in #751
@vdlad made their first contribution in #677
@axl1313 made their first contribution in #798
@coding-famer made their first contribution in #800

Change Log

New feature: Label error detection in regression datasets by @krmayankb in #572; by @huiwengoh in #830
New feature: ObjectLab for detecting mislabeled images in objection detection datasets by @ulya-tkch in #676, #739, #745, #770, #779, #807, #833; by @aditya1503 in #750, #804
New feature: Label error detection in segmentation datasets by @vdlad in #677; by @ulya-tkch in #754, #756, #759, #772; by @elisno in #775
New feature: CleanVision to detect low-quality images by @sanjanag in #679, #797
New image quickstart tutorial that uses Datalab by @sanjanag in #795
Datalab code refactoring by @elisno in #803, #783, #793, #729
Make labels optional in Datalab by @elisno in #730
Update near-duplicate sets in Datalab by @elisno in #781
Include non-IID detection in set of default Datalab issue types by @elisno in #723
Extend Datalab to be able to detect label issues based on features by @Steven-Yiran in #760
Add imbalance issue type to Datalab by @tataganesh in #758, #828
Catch specific exception for knn in Datalab issue managers by @tataganesh in #825
Make plots smaller for datalab tutorials by @tataganesh in #751
50x speedup and other improvements in multiannotator module by @huiwengoh in #821, #784; by @ulya-tkch in #827
ENH: make clipping unnecessary for entropy by @DerWeh in #703
Extend default CleanLearning classifier to work for more datasets by @Steven-Yiran in #749
CleanLearning code improvements by @huiwengoh in #724; by @jwmueller in #744
Change CleanLearning inspect.getfullargspec to signature for sklearn v1.3 compatibility by @huiwengoh in #761
Expose low memory option for finding label issues by @tataganesh in #791, #822
Add GEN OOD-detection algorithm by @coding-famer in #800
Unify softmax implementations throughout package by @elisno in #826
Better warning handling for off_calibrated_custom in confident joint by @gordon-lim in #746
Clearer explanations in documentation/tutorials/readme by @cgnorthcutt in #725; by @jwmueller in #726, #734, #741, #743, #766, #832, #799, #752, #841, #816, #755, #731, #753, #845, #835, #847
CI and documentation system updates by @anishathalye in #742, #768, #769; by @jwmueller in #837; by @huiwengoh in #788, #757, #738, #794; by @sanjanag in #843; by @ulya-tkch in #777; by @elisno in #802; by @axl1313 in #798
Improved tests by @huiwengoh in #778, #763

Full Changelog: v2.4.0...v2.5.0

@jwmueller

Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models. Many new methods/algorithms were added in recent months to increase the capabilities of this data-centric AI library.

Introducing Datalab

Now we've added a unified platform called Datalab for you to apply many of these capabilities in a single line of code!
To audit any classification dataset for issues, first use any trained ML model to produce pred_probs (predicted class probabilities) and/or feature_embeddings (numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:

from cleanlab import Datalab

lab = Datalab(data=dataset, label_name="column_name_for_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report()  # summarize the issues found, how severe they are, and other useful info about the dataset

Follow our blog to better understand how this works internally, many articles will be published there shortly!
A detailed description of each type of issue Datalab can detect is provided in this guide, but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.

Datalab can be used to do things like find label issues with string class labels (whereas the prior find_label_issues() method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! Datalab is also using these internally to detect data issues.

Our goal is for Datalab to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some Datalab APIs may change in subsequent package versions -- as noted in the documentation.
You can easily run the issue checks in Datalab together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into Datalab. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications! Feel free to reach out via Slack.

Revamped Tutorials

We've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with Datalab instead (see Datalab Tutorials). This should help existing users quickly ramp up on using Datalab to see how much more powerful this comprehensive data audit can be.

Improvements for Multi-label Classification

To provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the cleanlab.multilabel_classification module. So please start there rather than specifying the multi_label=True flag in certain methods outside of this module, as that option will be deprecated in the future.

Particularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the cleanlab.multilabel_classification.dataset module.

While moving methods to the cleanlab.multilabel_classification module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the cleanlab.multilabel_classification module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.

Backwards incompatible changes

Your existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed). Here are changes you must make in your code for it to work with newer cleanlab versions:

cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)

The multi_label=False/True argument will be removed in the future from the former method.

cleanlab.dataset.find_overlapping_classes(..., multi_label=True)
→
cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)

The multi_label=False/True argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.

cleanlab.dataset.overall_label_health_score(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.overall_label_health_score(...)

The multi_label=False/True argument will be removed in the future from the former method.

cleanlab.dataset.health_summary(...multi_label=True)
→
cleanlab.multilabel_classification.dataset.multilabel_health_summary(...)

The multi_label=False/True argument will be removed in the future from the former method.

There are no other backwards incompatible changes in the package with this release.

Deprecated workflows

We recommend updating your existing code to the new versions of these methods (existing cleanlab v2.3.1 code will still work though, for now). Here are changes we recommend:

cleanlab.filter.find_label_issues(..., multi_label=True)
→
cleanlab.multilabel_classification.filter.find_label_issues(...)

The multi_label=False/True argument will be removed in the future from the former method.

from cleanlab.multilabel_classification import get_label_quality_scores
→
from cleanlab.multilabel_classification.rank import get_label_quality_scores

Remember: All of the code to work with multi-label data now lives in the cleanlab.multilabel_classification module.

Change Log

readme updates by @jwmueller in #659, #660, #713
CI updates (by @sanjanag in #701; by @huiwengoh in #671; by @elisno in #695, #706)
Documentation updates (by @jwmueller in #669, #710, #711, #716, #719, #720; by @huiwengoh in #714, #717; by @elisno in #678, #684)
Documentation: use default rules for shorter, more readable links by @DerWeh in #700
Added installation instructions for package extras by @sanjanag in #697
Pass confident joint computed in CleanLearning to filter.find_label_issues by @huiwengoh in #661
Add Example codeblock to the docstrings of important functions in the dataset module by @Steven-Yiran in #662, #663, #668
Remove batch size check in label_issues_batched by @huiwengoh in #665
adding multilabel dataset issue summaries by @aditya1503 in #657
move int2onehot, onehot2int to top of multilabel tutorial by @jwmueller in #666
Update softmax to more stable variant by @ulya-tkch in #667
Revamp text and tabular tutorial by @huiwengoh in #673, #693
allow for kwargs in token find_label_issues by @jwmueller in #686
Update numpy.typing import and annotations by @elisno in #688
Standardize documentation and simplify code for outliers by @DerWeh in #689
Extract function for computing OOD scores from distances by @elisno in #664
Introduce Datalab by @elisno in #614
Introduce NonIID issue type by @jecummin in #614
Further Datalab updates by @elisno in #680, #683, #687, #690, #691, #699, #705, #709, #712
Add descriptions of issues that Datalab can detect by @elisno in #682
Datalab IssueManager.get_summary() -> make_summary() in custom issue manager example by @jwmueller in #692
Improve NonIID issue checks by @elisno in #694, #707

New Contributors

@Steven-Yiran made th...

Releases: cleanlab/cleanlab

v2.7.0 -- Broadening Data Quality Checks and ML Workflows

Introducing Spurious Correlation Detection in Datalab

New Tutorial: Improving ML Performance with Train and Test Set Curation

Other Major Improvements

Change Log

New Contributors

Contributors

v2.6.6

What's Changed

Contributors

v2.6.5

What's Changed

New Contributors

Contributors

v2.6.4

What's Changed

New Contributors

Contributors

v2.6.3 - Enhanced scores for outliers and near-duplicates

What's Changed

Contributors

v2.6.2

What's Changed

Contributors

v2.6.1 -- Refined Regression Score and Fixes

What's Changed

New Contributors

Contributors

v2.6.0 -- Elevating Data Insights: Comprehensive Issue Checks & Expanded ML Task Compatibility

Enhancements to Datalab

Expanded Datalab Support for New ML Tasks

Improved Object Detection Dataset Exploration

Other Major Improvements

New Contributors

Change Log

Contributors

v2.5.0 -- All major ML tasks now supported

New ML tasks supported

Improvements to Datalab

Other major improvements

New Contributors

Change Log

Contributors

v2.4.0 -- One line of code to detect all sorts of dataset issues

Introducing Datalab

Revamped Tutorials

Improvements for Multi-label Classification

Backwards incompatible changes

Deprecated workflows

Change Log

New Contributors

Contributors