Optimize internal multilabel score #1106

gogetron · 2024-04-20T18:07:52Z

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of internal multilabel_py and _get_split_generator functions.

[ ✏️ Write your summary here. ]
The significant improvement comes from using numpy operations where possible.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import numpy as np
from sklearn.model_selection import ShuffleSplit

from cleanlab.internal.multilabel_scorer import _get_split_generator, multilabel_py

np.random.seed(0)

N = 1_000_000
K = 20
labels = np.random.randint(K, size=(N, K))
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

# We will use a larger input for multilabel_py function
N = 10_000_000
y = np.random.randint(2, size=(N, K))

Current version

%%timeit
%memit _get_split_generator(labels, cv)
# peak memory: 1457.49 MiB, increment: 1036.85 MiB
# peak memory: 1483.90 MiB, increment: 1058.73 MiB
# peak memory: 1484.05 MiB, increment: 1034.66 MiB
# peak memory: 1484.31 MiB, increment: 1034.92 MiB
# peak memory: 1484.32 MiB, increment: 1034.67 MiB
# peak memory: 1484.30 MiB, increment: 1034.66 MiB
# peak memory: 1484.32 MiB, increment: 1034.68 MiB
# peak memory: 1484.31 MiB, increment: 1034.67 MiB
# 5 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
%memit multilabel_py(y)
# peak memory: 2091.26 MiB, increment: 121.00 MiB
# peak memory: 2087.01 MiB, increment: 152.54 MiB
# peak memory: 2102.09 MiB, increment: 167.62 MiB
# peak memory: 2087.30 MiB, increment: 152.83 MiB
# peak memory: 2106.12 MiB, increment: 171.39 MiB
# peak memory: 2106.13 MiB, increment: 171.41 MiB
# peak memory: 2106.05 MiB, increment: 171.32 MiB
# peak memory: 2106.13 MiB, increment: 171.41 MiB
# 1min 27s ± 651 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit _get_split_generator(labels, cv)
# peak memory: 830.01 MiB, increment: 385.66 MiB
# peak memory: 766.49 MiB, increment: 350.77 MiB
# peak memory: 726.51 MiB, increment: 319.19 MiB
# peak memory: 722.98 MiB, increment: 315.66 MiB
# peak memory: 732.79 MiB, increment: 325.22 MiB
# peak memory: 872.84 MiB, increment: 465.27 MiB
# peak memory: 812.81 MiB, increment: 405.24 MiB
# peak memory: 732.82 MiB, increment: 325.25 MiB
# 2.11 s ± 67.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
%memit multilabel_py(y)
# peak memory: 2119.58 MiB, increment: 165.53 MiB
# peak memory: 2119.68 MiB, increment: 190.74 MiB
# peak memory: 2119.68 MiB, increment: 190.74 MiB
# peak memory: 2119.84 MiB, increment: 190.91 MiB
# peak memory: 2119.84 MiB, increment: 190.71 MiB
# peak memory: 2119.84 MiB, increment: 190.71 MiB
# peak memory: 2119.87 MiB, increment: 190.74 MiB
# peak memory: 2119.87 MiB, increment: 190.74 MiB
# 497 ms ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing test suite and I also verified that the outputs were the same after refactoring.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

elisno

Great work @gogetron!

gogetron added 2 commits April 20, 2024 16:25

Perf: _get_split_generator

a63879b

Perf: multilabel_py

90a4627

elisno approved these changes Apr 23, 2024

View reviewed changes

elisno merged commit f493b99 into cleanlab:master Apr 23, 2024
19 checks passed

elisno added the performance Must go faster label Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize internal multilabel score #1106

Optimize internal multilabel score #1106

gogetron commented Apr 20, 2024

elisno left a comment

Optimize internal multilabel score #1106

Optimize internal multilabel score #1106

Conversation

gogetron commented Apr 20, 2024

Summary

Testing

References

Reviewer Notes

elisno left a comment

Choose a reason for hiding this comment