🧼🔎 SelfClean

A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates, and label errors.

Publications: SelfClean Paper (NeurIPS24) | Data Cleaning Protocol Paper (ML4H23@NeurIPS)

NOTE: Make sure to have git-lfs installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).

This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.

Installation

Install SelfClean via PyPI:

# upgrade pip to its latest version
pip install -U pip

# install selfclean
pip install selfclean

# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean

Getting Started

You can run SelfClean in a few lines of code:

from selfclean import SelfClean

selfclean = SelfClean(
    # displays the top-7 images from each error type
    # per default this option is disabled
    plot_top_N=7,
)

# run on pytorch dataset
issues = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
    input_path="path/to/images",
)

# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_off_topic_samples = issues.get_issues("off_topic_samples", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)

Examples: In examples/, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean. These examples analyze different benchmark datasets such as:

Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)

Also, check out our Kaggle notebook to see an illustration of how to get a gold medal for cleaning a competition dataset.

Development Environment

Run make for a list of possible targets.

Run these commands to install the requirements for the development environment:

make init
make install

To run linters on all files:

pre-commit run --all-files

We use the following packages for code and test conventions:

black for code style
isort for import sorting
pytest for running tests

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
selfclean		selfclean
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint		.yamllint
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
common.mk		common.mk
pyproject.toml		pyproject.toml
requirements.extras.txt		requirements.extras.txt
requirements.txt		requirements.txt
update_dependencies.py		update_dependencies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧼🔎 SelfClean

Installation

Getting Started

Development Environment

About

Releases

Packages

Contributors 2

Languages

License

Digital-Dermatology/SelfClean

Folders and files

Latest commit

History

Repository files navigation

🧼🔎 SelfClean

Installation

Getting Started

Development Environment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages