-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Fused sparse-dense support for PairwiseDistancesReduction
#23585
FEA Fused sparse-dense support for PairwiseDistancesReduction
#23585
Conversation
PairwiseDistancesReduction
PairwiseDistancesReduction
PairwiseDistancesReduction
PairwiseDistancesReduction
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This is kind of an hack for now. IMO, it would be better to use a flatiter on a view if possible. See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…diusNeighbors* methods
I just have added minimal changes to tests and documentation for user API. I think we can tests combinations of sparse and dense datasets more thoroughly and systematically. This necessitates refactoring the tests, which I think we might prefer doing in another PR. I think this PR is mergeable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments below but otherwise the diff looks good to me.
However when experimenting with this PR locally I found a quite significant performance regression in LocalOutlierFactor
and Birch
on sparse float32 data (I am not yet systematically tested the others), typically around 1.2x to 1.5x slower.
For instance for Birch:
- on main:
In [1]: from sklearn.cluster import Birch
...: from scipy import sparse as sp
...: import numpy as np
...:
...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
...: %time Birch().fit(X)
CPU times: user 41.6 s, sys: 1.15 s, total: 42.7 s
Wall time: 10.9 s
- on this branch:
In [1]: from sklearn.cluster import Birch
...: from scipy import sparse as sp
...: import numpy as np
...:
...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
...: %time Birch().fit(X)
CPU times: user 1min 24s, sys: 619 ms, total: 1min 25s
Wall time: 12.6 s
and for LOF:
- on
main
:
In [1]: from sklearn.neighbors import LocalOutlierFactor
...: from scipy import sparse as sp
...: import numpy as np
...:
...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
...: %time lof = LocalOutlierFactor().fit(X)
...: lof.negative_outlier_factor_
CPU times: user 25.3 s, sys: 3.21 s, total: 28.5 s
Wall time: 28.5 s
Out[1]:
array([-2.25244 , -3.163138 , -3.0220103, ..., -3.5005074, -3.367448 ,
-2.2913156], dtype=float32)
- on this branch:
In [1]: from sklearn.neighbors import LocalOutlierFactor
...: from scipy import sparse as sp
...: import numpy as np
...:
...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
...: %time lof = LocalOutlierFactor().fit(X)
...: lof.negative_outlier_factor_
CPU times: user 4min 48s, sys: 698 ms, total: 4min 49s
Wall time: 37.5 s
Out[1]:
array([-2.25244047, -3.16313819, -3.02201071, ..., -3.50050721,
-3.36744845, -2.29131578])
Also not that the .negative_outlier_factor_
fitted attribute used to be float32
on main
while it is upcasted to float64
in this branch.
The previous back-end is sometimes more performant because it manages to use the same decomposition for chunks of the Squared Euclidean distance matrix used GRMM in the dense case, namely: $$ where the two first terms (likely) are computed once and are reused and the last is computed via scikit-learn/sklearn/metrics/pairwise.py Lines 568 to 570 in 9723115
In this case,
I think we might be able to use the I would prefer exploring this in another PR and in the meantime just deactivate the new implementations proposed in this PR for user-facing APIs that are subjects to regressions (by using a What do you think? `py-spy` profile# script.py
from sklearn.cluster import Birch
from scipy import sparse as sp
import numpy as np
X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
Birch().fit(X) py-spy record --native -o py-spy-`git rev-parse HEAD`.svg -f speedscope -- python ./script.py To browse on https://www.speedscope.app/ |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Thanks for your investigations. I agree with your plan. Before merging this, what do you think of #23585 (comment) ? |
Oh, I already have accepted this sensible suggestion in fcf15b6. 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I did some manual tests quickly with the updated PR and did not observe any regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to look at this. Overall looks great! I just left a minor suggestion for simplifying the tests a bit, and a couple of optional nits.
Edit: Should this also remove the comments here:
scikit-learn/sklearn/metrics/pairwise.py
Lines 691 to 692 in fec55bf
# TODO: once ArgKmin supports sparse input matrices and 32 bit, | |
# we won't need to fallback to pairwise_distances_chunked anymore. |
and here:
scikit-learn/sklearn/metrics/pairwise.py
Lines 802 to 803 in fec55bf
# TODO: once ArgKmin supports sparse input matrices and 32 bit, | |
# we won't need to fallback to pairwise_distances_chunked anymore. |
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx.tp
Outdated
Show resolved
Hide resolved
…and add a space somewhere for proper formatting. Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>
See: `BaseDistanceReductionDispatcher.valid_metrics` Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>
Merged! Thank you @jjerphan and everybody else! |
Thank you for the reviews, @ogrisel, @thomasjpfan and @Micky774! |
Reference Issues/PRs
Comes after #23515.
Relates to #22587.
What does this implement/fix? Explain your changes.
Add
SparseSparseDatasetsPair
,SparseDenseDatasetsPair
,DenseSparseDatasetsPair
to bridge distances computations for pairs of fused dense-sparse datasets.This allows support for dense-sparse, sparse-dense and sparse-sparse datasets' pairs for a variety of estimators while at the same time improving performance for the existing sparse-sparse case.
Note that this does not implement optimisation for the Euclidean and Squared Euclidean distances for reasons explained in #23585 (comment).
TODO
DistanceMetric
#23604main
viaNearestNeighbors