Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize internal multilabel score #1106

Merged
merged 2 commits into from
Apr 23, 2024

Conversation

gogetron
Copy link
Contributor

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of internal multilabel_py and _get_split_generator functions.

[ ✏️ Write your summary here. ]
The significant improvement comes from using numpy operations where possible.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import numpy as np
from sklearn.model_selection import ShuffleSplit

from cleanlab.internal.multilabel_scorer import _get_split_generator, multilabel_py

np.random.seed(0)

N = 1_000_000
K = 20
labels = np.random.randint(K, size=(N, K))
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

# We will use a larger input for multilabel_py function
N = 10_000_000
y = np.random.randint(2, size=(N, K))

Current version

%%timeit
%memit _get_split_generator(labels, cv)
# peak memory: 1457.49 MiB, increment: 1036.85 MiB
# peak memory: 1483.90 MiB, increment: 1058.73 MiB
# peak memory: 1484.05 MiB, increment: 1034.66 MiB
# peak memory: 1484.31 MiB, increment: 1034.92 MiB
# peak memory: 1484.32 MiB, increment: 1034.67 MiB
# peak memory: 1484.30 MiB, increment: 1034.66 MiB
# peak memory: 1484.32 MiB, increment: 1034.68 MiB
# peak memory: 1484.31 MiB, increment: 1034.67 MiB
# 5 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit 
%memit multilabel_py(y)
# peak memory: 2091.26 MiB, increment: 121.00 MiB
# peak memory: 2087.01 MiB, increment: 152.54 MiB
# peak memory: 2102.09 MiB, increment: 167.62 MiB
# peak memory: 2087.30 MiB, increment: 152.83 MiB
# peak memory: 2106.12 MiB, increment: 171.39 MiB
# peak memory: 2106.13 MiB, increment: 171.41 MiB
# peak memory: 2106.05 MiB, increment: 171.32 MiB
# peak memory: 2106.13 MiB, increment: 171.41 MiB
# 1min 27s ± 651 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit _get_split_generator(labels, cv)
# peak memory: 830.01 MiB, increment: 385.66 MiB
# peak memory: 766.49 MiB, increment: 350.77 MiB
# peak memory: 726.51 MiB, increment: 319.19 MiB
# peak memory: 722.98 MiB, increment: 315.66 MiB
# peak memory: 732.79 MiB, increment: 325.22 MiB
# peak memory: 872.84 MiB, increment: 465.27 MiB
# peak memory: 812.81 MiB, increment: 405.24 MiB
# peak memory: 732.82 MiB, increment: 325.25 MiB
# 2.11 s ± 67.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit 
%memit multilabel_py(y)
# peak memory: 2119.58 MiB, increment: 165.53 MiB
# peak memory: 2119.68 MiB, increment: 190.74 MiB
# peak memory: 2119.68 MiB, increment: 190.74 MiB
# peak memory: 2119.84 MiB, increment: 190.91 MiB
# peak memory: 2119.84 MiB, increment: 190.71 MiB
# peak memory: 2119.84 MiB, increment: 190.71 MiB
# peak memory: 2119.87 MiB, increment: 190.74 MiB
# peak memory: 2119.87 MiB, increment: 190.74 MiB
# 497 ms ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing test suite and I also verified that the outputs were the same after refactoring.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

Copy link
Member

@elisno elisno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @gogetron!

@elisno elisno merged commit f493b99 into cleanlab:master Apr 23, 2024
19 checks passed
@elisno elisno added the performance Must go faster label Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants