Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Merged
merged 83 commits into from
Sep 20, 2022

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jun 10, 2022

Reference Issues/PRs

Comes after #23515.

Relates to #22587.

What does this implement/fix? Explain your changes.

Add SparseSparseDatasetsPair, SparseDenseDatasetsPair, DenseSparseDatasetsPair to bridge distances computations for pairs of fused dense-sparse datasets.

This allows support for dense-sparse, sparse-dense and sparse-sparse datasets' pairs for a variety of estimators while at the same time improving performance for the existing sparse-sparse case.

Note that this does not implement optimisation for the Euclidean and Squared Euclidean distances for reasons explained in #23585 (comment).

TODO

@jjerphan jjerphan changed the title POC Sparse support for PairwiseDistancesReduction MAINT Sparse support for PairwiseDistancesReduction Jun 15, 2022
@jjerphan jjerphan changed the title MAINT Sparse support for PairwiseDistancesReduction MAINT Fused sparse-dense support for PairwiseDistancesReduction Jun 15, 2022
jjerphan and others added 18 commits June 16, 2022 16:49
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This is kind of an hack for now.

IMO, it would be better to use a flatiter on a view if possible.
See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
@jjerphan
Copy link
Member Author

jjerphan commented Sep 12, 2022

I just have added minimal changes to tests and documentation for user API.

I think we can tests combinations of sparse and dense datasets more thoroughly and systematically. This necessitates refactoring the tests, which I think we might prefer doing in another PR.

I think this PR is mergeable.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments below but otherwise the diff looks good to me.

However when experimenting with this PR locally I found a quite significant performance regression in LocalOutlierFactor and Birch on sparse float32 data (I am not yet systematically tested the others), typically around 1.2x to 1.5x slower.

For instance for Birch:

  • on main:
In [1]: from sklearn.cluster import Birch
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time Birch().fit(X)
CPU times: user 41.6 s, sys: 1.15 s, total: 42.7 s
Wall time: 10.9 s
  • on this branch:
In [1]: from sklearn.cluster import Birch
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time Birch().fit(X)
CPU times: user 1min 24s, sys: 619 ms, total: 1min 25s
Wall time: 12.6 s

and for LOF:

  • on main:
In [1]: from sklearn.neighbors import LocalOutlierFactor
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time lof = LocalOutlierFactor().fit(X)
   ...: lof.negative_outlier_factor_
CPU times: user 25.3 s, sys: 3.21 s, total: 28.5 s
Wall time: 28.5 s
Out[1]: 
array([-2.25244  , -3.163138 , -3.0220103, ..., -3.5005074, -3.367448 ,
       -2.2913156], dtype=float32)
  • on this branch:
In [1]: from sklearn.neighbors import LocalOutlierFactor
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time lof = LocalOutlierFactor().fit(X)
   ...: lof.negative_outlier_factor_
CPU times: user 4min 48s, sys: 698 ms, total: 4min 49s
Wall time: 37.5 s
Out[1]: 
array([-2.25244047, -3.16313819, -3.02201071, ..., -3.50050721,
       -3.36744845, -2.29131578])

Also not that the .negative_outlier_factor_ fitted attribute used to be float32 on main while it is upcasted to float64 in this branch.

sklearn/neighbors/tests/test_lof.py Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise_distances_reduction.py Outdated Show resolved Hide resolved
@jjerphan
Copy link
Member Author

jjerphan commented Sep 15, 2022

The previous back-end is sometimes more performant because it manages to use the same decomposition for chunks of the Squared Euclidean distance matrix used GRMM in the dense case, namely:

$$
\mathbf D^{\odot 2}d \left(\mathbf{X}{c}^{(l)}, \mathbf{Y}{c}^{(k)}\right) \overset{\text{euclidean case}}{=} \left[\left\Vert \left({\mathbf{X}c^{(l)}}\right){i\cdot}\right\Vert^2_2 \right]{(i,j)\in [c]\times[c]} + \left[\left\Vert \left({\mathbf{Y}c^{(k)}}\right){j\cdot }\right\Vert^2_2 \right]_{(i, j)\in [c]\times[c]} - 2 \mathbf{X}_c^{(l)} {\mathbf{Y}_c^{(k)}}^\top
$$

where the two first terms (likely) are computed once and are reused and the last is computed via safe_sparse_dot, e.g. euclidean_distances for the sparse float32 case:

d = -2 * safe_sparse_dot(X_chunk, Y_chunk.T, dense_output=True)
d += XX_chunk
d += YY_chunk

In this case, safe_sparse_dot calls scipy._csr_array.__matmul__ which ultimately routes to the following C++ routines:

  • csr_matmat for the CSR × CSR case
  • csr_matvecs for the CSR × Dense case
  • csc_matvecs for the Dense × CSR case (seen as transposed result of the CSR × Dense case)

I think we might be able to use the Euclidean specialisation for the fused {sparse, dense}² by extending the GEMMTermComputer case to call relevant SciPy's C++ routines (in this case GEMMTermComputer will be renamed because GEMM won't be called since sparse data will be supported used).

I would prefer exploring this in another PR and in the meantime just deactivate the new implementations proposed in this PR for user-facing APIs that are subjects to regressions (by using a sklearn.config_context-context manager or by overriding is_usable_for to return False if metric="*euclidean" and at least one sparse dataset is passed as done in 1d7bcc7).

What do you think?

`py-spy` profile
# script.py
from sklearn.cluster import Birch
from scipy import sparse as sp
import numpy as np

X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
Birch().fit(X)
py-spy record --native -o py-spy-`git rev-parse HEAD`.svg -f speedscope -- python ./script.py

On main: py-spy-profile-851d02f2ae0dbf258f7cd2f1214ba47c79d1737a

On this branch: py-spy-profile-1eb5b2c3ce31f96eff0922adbdd236c5f0a5c879

To browse on https://www.speedscope.app/

@ogrisel
Copy link
Member

ogrisel commented Sep 16, 2022

What do you think?

Thanks for your investigations. I agree with your plan.

Before merging this, what do you think of #23585 (comment) ?

@jjerphan
Copy link
Member Author

Before merging this, what do you think of #23585 (comment) ?

Oh, I already have accepted this sensible suggestion in fcf15b6. 🙂

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I did some manual tests quickly with the updated PR and did not observe any regression.

Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to look at this. Overall looks great! I just left a minor suggestion for simplifying the tests a bit, and a couple of optional nits.

Edit: Should this also remove the comments here:

# TODO: once ArgKmin supports sparse input matrices and 32 bit,
# we won't need to fallback to pairwise_distances_chunked anymore.

and here:

# TODO: once ArgKmin supports sparse input matrices and 32 bit,
# we won't need to fallback to pairwise_distances_chunked anymore.

jjerphan and others added 2 commits September 20, 2022 09:56
…and add a space somewhere for proper formatting.

Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>
See: `BaseDistanceReductionDispatcher.valid_metrics`

Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>
@ogrisel ogrisel merged commit 60cc5b5 into scikit-learn:main Sep 20, 2022
@ogrisel
Copy link
Member

ogrisel commented Sep 20, 2022

Merged! Thank you @jjerphan and everybody else!

@jjerphan
Copy link
Member Author

Thank you for the reviews, @ogrisel, @thomasjpfan and @Micky774!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants