PERF Implement `PairwiseDistancesReduction` backend for `RadiusNeighbors.predict_proba` #26828

OmarManzoor · 2023-07-13T11:24:16Z

Reference Issues/PRs

Towards: #25888

What does this implement/fix? Explain your changes.

Adds a dedicated pairwise distance backend for the predict probabilities functionality of radius neighbors.
Adds a dispatcher for this new backend
Uses this backend called RadiusNeighborsClassMode in the predict_proba method of RadiusNeighborsClassifier for specific conditions.

Up to 3× speed-ups, see: #26828 (review)

Any other comments?

cc: @jjerphan

….predict_proba

github-actions · 2023-07-13T11:26:23Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: a1e722d. Link to the linter CI: here}

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

* Add a test to check results are consistent with both strategies

OmarManzoor · 2023-07-20T10:48:50Z

Some Benchmarks using ASV

· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[50.00%] ··· ========= =============
             --        n_test / n_features
             --------- -------------
              n_train   10000 / 300 
             ========= =============
               100000    41.6±0.2s  
             ========= =============

[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl..
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[100.00%] ··· ========= =============
              --        n_test / n_features
              --------- -------------
               n_train   10000 / 300 
              ========= =============
                100000    1.33±0.04m 
              ========= =============

| Change   | Before [335b5556] <main>   | After [65d8b966] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                     |
|----------|----------------------------|----------------------------------------------------|---------|-------------------------------------------------------------------------------------------|
| -        | 1.33±0.04m                 | 41.6±0.2s                                          |    0.52 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(100000, 10000, 300) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Code used for the ASV benchmarks

OmarManzoor · 2023-07-20T10:56:04Z

· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[50.00%] ··· ========= ============
             --        n_test / n_features
             --------- ------------
              n_train   1000 / 100 
             ========= ============
               100000   1.76±0.01s 
             ========= ============

[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl.
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[100.00%] ··· ========= ============
              --        n_test / n_features
              --------- ------------
               n_train   1000 / 100 
              ========= ============
                100000   2.01±0.1s  
              ========= ============

| Change   | Before [335b5556] <main>   | After [65d8b966] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                    |
|----------|----------------------------|----------------------------------------------------|---------|------------------------------------------------------------------------------------------|
| -        | 2.01±0.1s                  | 1.76±0.01s                                         |    0.88 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(100000, 1000, 100) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

ASV benchmark code

jjerphan · 2023-07-20T11:50:01Z

Thanks!

A few remarks:

you can populate lists of parameters in the ASV benchmarks scripts and not just have singleton, it will run it on the Cartesian product of the lists
you can reuse this gist to perform hardware scalability analysis with and without mimalloc if you are using more than 8 threads
scikit-learn has a dedicated asv benchmark suite for PairwiseDistancesReduction, I would recommend having a look at their parameters' values to use them for your script here.

OmarManzoor · 2023-07-21T07:34:31Z

Thanks!

A few remarks:

you can populate lists of parameters in the ASV benchmarks scripts and not just have singleton, it will run it on the Cartesian product of the lists

you can reuse this gist to perform hardware scalability analysis with and without mimalloc if you are using more than 8 threads

scikit-learn has a dedicated asv benchmark suite for PairwiseDistancesReduction, I would recommend having a look at their parameters' values to use them for your script here.

How many parameter variations should I put? Because the more there are the more time it will taken on my local system.

jjerphan · 2023-07-23T15:48:06Z

How many parameter variations should I put? Because the more there are the more time it will taken on my local system.

On a laptop I would first use the --quick options from asv continuous and those values:

    param_names = [
        "n_train",
        "n_test",
        "n_features",
        "metric",
        "strategy",
        "dtype",
        "X_train",
        "X_test",
    ]
    params = [
        [1000, 10_000, int(1e7)],
        [1000, 10_000, 100_000],
        [100],
        ["manhattan"],
        ["auto"],
        [np.float64],
        ["dense"],
        ["dense"],
    ]

OmarManzoor · 2023-07-24T06:35:03Z

How many parameter variations should I put? Because the more there are the more time it will taken on my local system.

On a laptop I would first use the --quick options from asv continuous and those values:

    param_names = [
        "n_train",
        "n_test",
        "n_features",
        "metric",
        "strategy",
        "dtype",
        "X_train",
        "X_test",
    ]
    params = [
        [1000, 10_000, int(1e7)],
        [1000, 10_000, 100_000],
        [100],
        ["manhattan"],
        ["auto"],
        [np.float64],
        ["dense"],
        ["dense"],
    ]

Thanks for specifying the values. Currently for strategy we have fixed it to "parallel_on_X" and not "auto".

OmarManzoor · 2023-07-24T09:10:54Z

@jjerphan For larger values it is timing out. I used this command to run

asv continuous --quick -b BruteRadiusNeighborsClass upstream/main HEAD

For parameters: 10000000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense'

asv: benchmark timed out (timeout 500s)

For parameters: 10000000, 100000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense'

asv: benchmark timed out (timeout 500s)

For the relatively smaller values for train, I basically used the same way to calculate radius as done in the asv benchmark suite for pairwise distances. But I think this radius turns out to be smaller than required and as a result we get outliers and the error associated with outliers since we aren't specifying an outlier label.

dist_mat = cdist(
            (self.X_train if X_train == "dense" else self.X_train.toarray())[:1000],
            (self.X_test if X_test == "dense" else self.X_test.toarray())[:10],
        )

self.radius = np.quantile(a=dist_mat.ravel(), q=0.001)

ValueError: No neighbors found for test samples array(
[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12, ....., 
   ....., 990, 991, 992, 993, 994, 995, 996, 997, 998, 999
]), 
you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.

jjerphan · 2023-07-24T10:42:37Z

Just do not use the largest value for datasets' size and use a larger quantile (e.g. q=0.01).

OmarManzoor · 2023-07-24T10:57:34Z

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
·· Installing 65d8b966 <radius_neighbors_predict_pwdb> into conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[50.00%] ··· ========= ======================================================== =========================================================
             --                                     n_test / n_features / metric / dtype / X_train / X_test                              
             --------- ------------------------------------------------------------------------------------------------------------------
              n_train   1000 / 100 / manhattan / numpy.float64 / dense / dense   10000 / 100 / manhattan / numpy.float64 / dense / dense 
             ========= ======================================================== =========================================================
                1000                           33.3±0ms                                                  159±0ms                         
               10000                           194±0ms                                                   1.82±0s                         
             ========= ======================================================== =========================================================

[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl..
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[100.00%] ··· ========= ======================================================== =========================================================
              --                                     n_test / n_features / metric / dtype / X_train / X_test                              
              --------- ------------------------------------------------------------------------------------------------------------------
               n_train   1000 / 100 / manhattan / numpy.float64 / dense / dense   10000 / 100 / manhattan / numpy.float64 / dense / dense 
              ========= ======================================================== =========================================================
                 1000                           69.8±0ms                                                  212±0ms                         
                10000                           233±0ms                                                   2.01±0s                         
              ========= ======================================================== =========================================================

| Change   | Before [335b5556] <main>   | After [65d8b966] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                                                                            |
|----------|----------------------------|----------------------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| -        | 2.01±0s                    | 1.82±0s                                            |    0.91 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| -        | 233±0ms                    | 194±0ms                                            |    0.83 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')  |
| -        | 212±0ms                    | 159±0ms                                            |    0.75 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')  |
| -        | 69.8±0ms                   | 33.3±0ms                                           |    0.48 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Code used for ASV benchmark

jjerphan · 2023-07-24T11:04:03Z

Thanks !

Could you check that there is no memory leak using a profiler such as memory_profiler
?

OmarManzoor · 2023-07-24T11:36:00Z

Thanks !

Could you check that there is no memory leak using a profiler such as memory_profiler ?

Should I just run a simply python script under the memory profiler, that fits the RadiusNeighborsClassifier and then runs predict_proba?

jjerphan · 2023-07-24T13:01:29Z

Yes, we want to see if memory leaks in this new implementation.

OmarManzoor · 2023-07-26T10:38:56Z

Output of memory profiler python -m memory_profiler test_radius_neighbors.py on this script Script

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   672  127.344 MiB  127.344 MiB           1       @classmethod
   673                                             @profile
   674                                             def compute(
   675                                                 cls,
   676                                                 X,
   677                                                 Y,
   678                                                 radius,
   679                                                 weights,
   680                                                 labels,
   681                                                 unique_labels,
   682                                                 outlier_label,
   683                                                 metric="euclidean",
   684                                                 chunk_size=None,
   685                                                 metric_kwargs=None,
   686                                                 strategy=None,
   687                                             ):
   688                                                 """Return the results of the reduction for the given arguments.
   689                                                 Parameters
   690                                                 ----------
   691                                                 X : ndarray of shape (n_samples_X, n_features)
   692                                                     The input array to be labelled.
   693                                                 Y : ndarray of shape (n_samples_Y, n_features)
   694                                                     The input array whose labels are provided through the `labels`
   695                                                     parameter.
   696                                                 radius : float
   697                                                     The radius defining the neighborhood.
   698                                                 weights : ndarray
   699                                                     The weights applied over the `labels` of `Y` when computing the
   700                                                     weighted mode of the labels.
   701                                                 labels : ndarray
   702                                                     An array containing the index of the class membership of the
   703                                                     associated samples in `Y`. This is used in labeling `X`.
   704                                                 unique_classes : ndarray
   705                                                     An array containing all unique class labels.
   706                                                 outlier_label : int, default=None
   707                                                     Label for outlier samples (samples with no neighbors in given
   708                                                     radius).
   709                                                 metric : str, default='euclidean'
   710                                                     The distance metric to use. For a list of available metrics, see
   711                                                     the documentation of :class:`~sklearn.metrics.DistanceMetric`.
   712                                                     Currently does not support `'precomputed'`.
   713                                                 chunk_size : int, default=None,
   714                                                     The number of vectors per chunk. If None (default) looks-up in
   715                                                     scikit-learn configuration for `pairwise_dist_chunk_size`,
   716                                                     and use 256 if it is not set.
   717                                                 metric_kwargs : dict, default=None
   718                                                     Keyword arguments to pass to specified metric function.
   719                                                 strategy : str, {'auto', 'parallel_on_X', 'parallel_on_Y'}, default=None
   720                                                     The chunking strategy defining which dataset parallelization are made on.
   721                                                     For both strategies the computations happens with two nested loops,
   722                                                     respectively on chunks of X and chunks of Y.
   723                                                     Strategies differs on which loop (outer or inner) is made to run
   724                                                     in parallel with the Cython `prange` construct:
   725                                                       - 'parallel_on_X' dispatches chunks of X uniformly on threads.
   726                                                         Each thread then iterates on all the chunks of Y. This strategy is
   727                                                         embarrassingly parallel and comes with no datastructures
   728                                                         synchronisation.
   729                                                       - 'parallel_on_Y' dispatches chunks of Y uniformly on threads.
   730                                                         Each thread processes all the chunks of X in turn. This strategy is
   731                                                         a sequence of embarrassingly parallel subtasks (the inner loop on Y
   732                                                         chunks) with intermediate datastructures synchronisation at each
   733                                                         iteration of the sequential outer loop on X chunks.
   734                                                       - 'auto' relies on a simple heuristic to choose between
   735                                                         'parallel_on_X' and 'parallel_on_Y': when `X.shape[0]` is large enough,
   736                                                         'parallel_on_X' is usually the most efficient strategy.
   737                                                         When `X.shape[0]` is small but `Y.shape[0]` is large, 'parallel_on_Y'
   738                                                         brings more opportunity for parallelism and is therefore more efficient
   739                                                         despite the synchronization step at each iteration of the outer loop
   740                                                         on chunks of `X`.
   741                                                       - None (default) looks-up in scikit-learn configuration for
   742                                                         `pairwise_dist_parallel_strategy`, and use 'auto' if it is not set.
   743                                                 Returns
   744                                                 -------
   745                                                 probabilities : ndarray of shape (n_samples_X, n_classes)
   746                                                     An array containing the class probabilities for each sample.
   747                                                 """
   748  127.344 MiB    0.000 MiB           1           if weights not in {"uniform", "distance"}:
   749                                                     raise ValueError(
   750                                                         "Only the 'uniform' or 'distance' weights options are supported"
   751                                                         f" at this time. Got: {weights=}."
   752                                                     )
   753  127.344 MiB    0.000 MiB           1           if X.dtype == Y.dtype == np.float64:
   754  901.719 MiB  774.375 MiB           2               return RadiusNeighborsClassMode64.compute(
   755  127.344 MiB    0.000 MiB           1                   X=X,
   756  127.344 MiB    0.000 MiB           1                   Y=Y,
   757  127.344 MiB    0.000 MiB           1                   radius=radius,
   758  127.344 MiB    0.000 MiB           1                   weights=weights,
   759  127.344 MiB    0.000 MiB           1                   class_membership=np.array(labels, dtype=np.intp),
   760  127.344 MiB    0.000 MiB           1                   unique_labels=np.array(unique_labels, dtype=np.intp),
   761  127.344 MiB    0.000 MiB           1                   outlier_label=outlier_label,
   762  127.344 MiB    0.000 MiB           1                   metric=metric,
   763  127.344 MiB    0.000 MiB           1                   chunk_size=chunk_size,
   764  127.344 MiB    0.000 MiB           1                   metric_kwargs=metric_kwargs,
   765  127.344 MiB    0.000 MiB           1                   strategy=strategy,
   766                                                     )
   767                                         
   768                                                 if X.dtype == Y.dtype == np.float32:
   769                                                     return RadiusNeighborsClassMode32.compute(
   770                                                         X=X,
   771                                                         Y=Y,
   772                                                         radius=radius,
   773                                                         weights=weights,
   774                                                         class_membership=np.array(labels, dtype=np.intp),
   775                                                         unique_labels=np.array(unique_labels, dtype=np.intp),
   776                                                         outlier_label=outlier_label,
   777                                                         metric=metric,
   778                                                         chunk_size=chunk_size,
   779                                                         metric_kwargs=metric_kwargs,
   780                                                         strategy=strategy,
   781                                                     )
   782                                         
   783                                                 raise ValueError(
   784                                                     "Only float64 or float32 datasets pairs are supported at this time, "
   785                                                     f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}."
   786                                                 )


Filename: /Users/omar.salman/Projects/scikit-learn/sklearn/neighbors/_classification.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   697  127.328 MiB  127.328 MiB           1       @profile
   698                                             def predict_proba(self, X):
   699                                                 """Return probability estimates for the test data X.
   700                                         
   701                                                 Parameters
   702                                                 ----------
   703                                                 X : {array-like, sparse matrix} of shape (n_queries, n_features), \
   704                                                         or (n_queries, n_indexed) if metric == 'precomputed'
   705                                                     Test samples.
   706                                         
   707                                                 Returns
   708                                                 -------
   709                                                 p : ndarray of shape (n_queries, n_classes), or a list of \
   710                                                         n_outputs of such arrays if n_outputs > 1.
   711                                                     The class probabilities of the input samples. Classes are ordered
   712                                                     by lexicographic order.
   713                                                 """
   714  127.328 MiB    0.000 MiB           1           check_is_fitted(self, "_fit_method")
   715  127.328 MiB    0.000 MiB           1           n_queries = _num_samples(X)
   716                                         
   717  127.328 MiB    0.000 MiB           2           metric, metric_kwargs = _adjusted_metric(
   718  127.328 MiB    0.000 MiB           1               metric=self.metric, metric_kwargs=self.metric_params, p=self.p
   719                                                 )
   720                                         
   721                                                 if (
   722  127.328 MiB    0.000 MiB           1               self.weights == "uniform"
   723  127.344 MiB    0.000 MiB           3               and self._fit_method == "brute"
   724  127.328 MiB    0.000 MiB           1               and not self.outputs_2d_
   725  127.344 MiB    0.016 MiB           1               and RadiusNeighborsClassMode.is_usable_for(X, self._fit_X, metric)
   726                                                 ):
   727  901.719 MiB  901.719 MiB           2               probabilities = RadiusNeighborsClassMode.compute(
   728  127.344 MiB    0.000 MiB           1                   X=X,
   729  127.344 MiB    0.000 MiB           1                   Y=self._fit_X,
   730  127.344 MiB    0.000 MiB           1                   radius=self.radius,
   731  127.344 MiB    0.000 MiB           1                   weights=self.weights,
   732  127.344 MiB    0.000 MiB           1                   labels=self._y,
   733  127.344 MiB    0.000 MiB           1                   unique_labels=self.classes_,
   734  127.344 MiB    0.000 MiB           1                   outlier_label=self.outlier_label,
   735  127.344 MiB    0.000 MiB           1                   metric=metric,
   736  127.344 MiB    0.000 MiB           1                   metric_kwargs=metric_kwargs,
   737  127.344 MiB    0.000 MiB           1                   strategy="parallel_on_X",
   738                                                     )
   739                                                 else:
   740                                                     neigh_dist, neigh_ind = self.radius_neighbors(X)
   741                                                     outlier_mask = np.zeros(n_queries, dtype=bool)
   742                                                     outlier_mask[:] = [len(nind) == 0 for nind in neigh_ind]
   743                                                     outliers = np.flatnonzero(outlier_mask)
   744                                                     inliers = np.flatnonzero(~outlier_mask)
   745                                         
   746                                                     classes_ = self.classes_
   747                                                     _y = self._y
   748                                                     if not self.outputs_2d_:
   749                                                         _y = self._y.reshape((-1, 1))
   750                                                         classes_ = [self.classes_]
   751                                         
   752                                                     if self.outlier_label_ is None and outliers.size > 0:
   753                                                         raise ValueError(
   754                                                             "No neighbors found for test samples %r, "
   755                                                             "you can try using larger radius, "
   756                                                             "giving a label for outliers, "
   757                                                             "or considering removing them from your dataset." % outliers
   758                                                         )
   759                                         
   760                                                     weights = _get_weights(neigh_dist, self.weights)
   761                                                     if weights is not None:
   762                                                         weights = weights[inliers]
   763                                         
   764                                                     probabilities = []
   765                                                     # iterate over multi-output, measure probabilities of the k-th output.
   766                                                     for k, classes_k in enumerate(classes_):
   767                                                         pred_labels = np.zeros(len(neigh_ind), dtype=object)
   768                                                         pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
   769                                         
   770                                                         proba_k = np.zeros((n_queries, classes_k.size))
   771                                                         proba_inl = np.zeros((len(inliers), classes_k.size))
   772                                         
   773                                                         # samples have different size of neighbors within the same radius
   774                                                         if weights is None:
   775                                                             for i, idx in enumerate(pred_labels[inliers]):
   776                                                                 proba_inl[i, :] = np.bincount(idx, minlength=classes_k.size)
   777                                                         else:
   778                                                             for i, idx in enumerate(pred_labels[inliers]):
   779                                                                 proba_inl[i, :] = np.bincount(
   780                                                                     idx, weights[i], minlength=classes_k.size
   781                                                                 )
   782                                                         proba_k[inliers, :] = proba_inl
   783                                         
   784                                                         if outliers.size > 0:
   785                                                             _outlier_label = self.outlier_label_[k]
   786                                                             label_index = np.flatnonzero(classes_k == _outlier_label)
   787                                                             if label_index.size == 1:
   788                                                                 proba_k[outliers, label_index[0]] = 1.0
   789                                                             else:
   790                                                                 warnings.warn(
   791                                                                     "Outlier label {} is not in training "
   792                                                                     "classes. All class probabilities of "
   793                                                                     "outliers will be assigned with 0."
   794                                                                     "".format(self.outlier_label_[k])
   795                                                                 )
   796                                         
   797                                                         # normalize 'votes' into real [0,1] probabilities
   798                                                         normalizer = proba_k.sum(axis=1)[:, np.newaxis]
   799                                                         normalizer[normalizer == 0.0] = 1.0
   800                                                         proba_k /= normalizer
   801                                         
   802                                                         probabilities.append(proba_k)
   803                                         
   804                                                     if not self.outputs_2d_:
   805                                                         probabilities = probabilities[0]
   806                                         
   807  901.719 MiB    0.000 MiB           1           return probabilities

jjerphan · 2023-07-26T11:02:46Z

Thanks for the report. I do not know if the memory is being released after returning.

I think I should have been clearer and I should have asked to report plots using mprof to know if memory usage is back to the normal after the call to RadiusNeighbors, such as similarly done by Olivier in #22320 (review).

I have low bandwidth to guide you at the moment but will try providing more precise suggestions the the previous ones.

OmarManzoor · 2023-07-26T11:05:30Z

Thanks for the report. I do not know if the memory is being released after returning.

I think I should have been clearer and I should have asked to report plots using mprof to know if memory usage is back to the normal after the call to RadiusNeighbors, such as similarly done by Olivier in #22320 (review).

I have low bandwidth to guide you at the moment but will try providing more precise suggestions the the previous ones.

No problem! Thanks for the guidance.

Micky774 · 2023-07-26T12:31:03Z

You mean I should run mprof with this same script on main and share the graph?

Yup! Just to ensure that the memory use isn't significantly greater so as to introduce a regression.

OmarManzoor · 2023-07-26T12:40:41Z

Memory profile on main

Micky774 · 2023-07-26T12:42:50Z

Nice! We actually have some good savings on memory overhead. I still need to make a pass through the code, but will do soon. Thank you for your work @OmarManzoor

Micky774

Thanks for the work done so far! Sorry for the delay in review. Let me know if you have any questions/concerns.

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

doc/whats_new/v1.4.rst

Micky774 · 2023-08-02T13:40:33Z

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

+            and not issparse(X)
+            and not issparse(Y)
+            # TODO: implement Euclidean specialization with GEMM.
+            and metric not in ("euclidean", "sqeuclidean")


For consistency, we should probably specify this in cls.valid_metrics instead

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

Micky774 · 2023-08-02T13:55:11Z

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

+        outlier_label : int, default=None
+            Label for outlier samples (samples with no neighbors in given
+            radius).


May we add to the description the default behavior when outlier_label=None?

Micky774 · 2023-08-02T20:22:50Z

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

+        if outlier_label is not None:
+            self.outlier_label_exists = True
+            for idx in range(self.unique_labels.shape[0]):
+                if self.unique_labels[idx] == outlier_label:
+                    self.outlier_index = idx


Is it not possible to use an outlier_label value that is not included in the list of unique labels? Mainly, the unique labels are inferred from the provided labels at the call-site, so if a user wanted to mark all outliers with an out-of-distribution label (e.g. -1) then they'd need to know a-priori to add it to unique_labels.

Either we allow them to provide it independent of unique_labels, or clearly document that it must be present in unique_labels. If we want to keep identical semantics to RadiusNeighborsClassifier, we should do the latter.

From a practitioner perspective, I think this is less than ideal. I imagine, in practice, it is helpful to be able to do this ad-hoc rather than a-priori. That is to say, if you have unique_labels=[1, 2, 3] then allow outlier_label=5 without issue. Even so, that would be a different PR I suppose, since we don't intend for this to really be publicly used...

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

Micky774 · 2023-08-02T20:40:59Z

sklearn/neighbors/_classification.py

+                outlier_label=self.outlier_label,
+                metric=metric,
+                metric_kwargs=metric_kwargs,
+                strategy="parallel_on_X",


Are there benchmarks to support a fixed dispatch on X rather than an arbitrary heuristic?

I think this is based on the current call to ArgKminClassMode:

scikit-learn/sklearn/neighbors/_classification.py

Lines 327 to 345 in 59048f9

probabilities = ArgKminClassMode.compute(

X,

self._fit_X,

k=self.n_neighbors,

weights=self.weights,

Y_labels=self._y,

unique_Y_labels=self.classes_,

metric=metric,

metric_kwargs=metric_kwargs,

# `strategy="parallel_on_X"` has in practice be shown

# to be more efficient than `strategy="parallel_on_Y``

# on many combination of datasets.

# Hence, we choose to enforce it here.

# For more information, see:

# https://github.com/scikit-learn/scikit-learn/pull/24076#issuecomment-1445258342 # noqa

# TODO: adapt the heuristic for `strategy="auto"` for

# `ArgKminClassMode` and use `strategy="auto"`.

strategy="parallel_on_X",

)

Benchmarks you previously made are reported in #24076 (comment).

I think it is worth replicating the comment here as well.

Yes I did not run any specific benchmarks with regards to this but it does seem that parallel_on_X runs generally better.

jjerphan

Here is a first review of the code completing @Micky774's.

Looking back, I think a few adaptations could be made for ArgKminClassMode. Also we might be able to favor some logic, but we better do it after this PR.

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

jjerphan · 2023-08-05T12:19:50Z

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

+                class_membership=np.array(labels, dtype=np.intp),
+                unique_labels=np.array(unique_labels, dtype=np.intp),


@Micky774: This is taken from ArgKminClassMode. Do you remember why the conversion to NumPy arrays are needed here? Is it to support Python lists?

If so, could we perform the conversion on dedicated lines and use np.asarray instead?

jjerphan · 2023-08-05T12:20:25Z

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

+                class_membership=np.array(labels, dtype=np.intp),
+                unique_labels=np.array(unique_labels, dtype=np.intp),


Same question and comment as the previous one.

sklearn/neighbors/_classification.py

sklearn/metrics/_pairwise_distances_reduction/__init__.py

jjerphan · 2023-08-05T12:45:44Z

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

+    RadiusNeighborsClassMode is typically used to perform bruteforce
+    radius neighbors queries when the weighted mode of the labels for
+    the nearest neighbors within the specified radius are required,
+    such as in `predict` methods.


Not related to this PR: I think such parts of the doc-string of {ArgKmin,RadiusNeighbors}ClassMode can be made clearer in another pull request.

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

jjerphan · 2023-08-05T12:55:23Z

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

+        )
+
+
+    cdef inline void weighted_histogram_mode(


Side-note: {ArgKmin,RadiusNeighbors}.weighted_histogram_mode are notably similar. Potentially this and the enum might be factorizable elements for those implementations. I would study this in another PR.

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

jjerphan

Here is another review.

Looking back, I am a bit surprised by the result of asv reported in #26828 (comment): the performance of those implementations reaches the one on main when datasets get taller (i.e. when there are more samples).

I would rerun benchmarks on more configurations and also on a machine with more cores than laptops' to better understand the dynamics.

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

sklearn/neighbors/_classification.py

jjerphan · 2023-08-10T15:57:17Z

sklearn/metrics/tests/test_pairwise_distances_reduction.py

+        weights=weights,
+        Y_labels=Y_labels,
+        unique_Y_labels=unique_Y_labels,
+        outlier_label=None,


Can we parametrize this test to also test cases where outlier_label is not None?

jjerphan · 2023-08-10T15:59:31Z

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

+        else:
+            self.weight_type = WeightingStrategy.callable


Note related to this PR: As of the presence of the GIL, I do not think we should support Python callable in Cython because this won't be efficient (callable are object which needs to be access holding the GIL, which meant no effective parallelized execution).

doc/whats_new/v1.4.rst

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

sklearn/neighbors/_classification.py

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

OmarManzoor · 2023-08-29T16:56:27Z

@jjerphan @Micky774 Do you have time to revisit this PR?

jjerphan · 2023-08-30T07:10:56Z

I hardly have bandwidth for scikit-learn for now, but I might in the future.

jjerphan · 2023-09-09T20:55:50Z

Hi @OmarManzoor,

I reran your ASV benchmark with the -q flags on more cases locally and it seems that your implementation (as of a973325) perform better than scikit-learn's (as of c634b8a). They also support large problems without OOM errors contrarily to the current implementation.

Details

Note that I have adapted n_test, n_train and n_features.

$ asv continuous main radius_neighbors_predict_pwdb -b BruteRadiusNeighborsClass -v -v -v -q

Couldn't load asv.plugins.mamba because
No module named 'mamba'
· Running '/usr/bin/git config init.defaultBranch'
  OUTPUT -------->
  main
· Running '/usr/bin/git rev-parse radius_neighbors_predict_pwdb^{commit}'
  OUTPUT -------->
  a9733250b863e495745bc1672dc58e928e4300b1
· Running '/usr/bin/git rev-parse main^{commit}'
  OUTPUT -------->
  c634b8abbb5d96e0089b593aa04fb5ac80a047ec
· Running '/usr/bin/git config init.defaultBranch'
  OUTPUT -------->
  main
· Creating environments
·· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python -c pass'
· Discovering benchmarks
·· Running '/usr/bin/git rev-parse main^{commit}'
   OUTPUT -------->
   c634b8abbb5d96e0089b593aa04fb5ac80a047ec
·· Running '/home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py discover /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/tmpk4umbuq5/result.json' in conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
·· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python /home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py discover /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/tmpk4umbuq5/result.json'
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · Running '/usr/bin/git name-rev --name-only --exclude=remotes/* --no-undefined a9733250b863e495745bc1672dc58e928e4300b1'
           OUTPUT -------->
           radius_neighbors_predict_pwdb
[ 0.00%] · For scikit-learn commit a9733250 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Running '/usr/bin/git rev-list -n 1 --format=%at a9733250b863e495745bc1672dc58e928e4300b1'
            OUTPUT -------->
            commit a9733250b863e495745bc1672dc58e928e4300b1
            1694283582
[ 0.00%] ·· Benchmarking conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[ 0.00%] ··· Running '/home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-c7h27c4l/socket' in conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[ 0.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python /home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-c7h27c4l/socket'
[50.00%] ··· radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba                                                              ok
[50.00%] ··· ========= ======== ============ ===========================================
             --                                   metric / dtype / X_train / X_test
             ------------------------------- -------------------------------------------
              n_train   n_test   n_features   manhattan / numpy.float64 / dense / dense
             ========= ======== ============ ===========================================
                1000     1000        50                        113±0ms
                1000     1000       100                        131±0ms
                1000     1000       500                        234±0ms
                1000    10000        50                        227±0ms
                1000    10000       100                        260±0ms
                1000    10000       500                        732±0ms
               10000     1000        50                        245±0ms
               10000     1000       100                        329±0ms
               10000     1000       500                        1.18±0s
               10000    10000        50                        879±0ms
               10000    10000       100                        1.39±0s
               10000    10000       500                        7.88±0s
               100000    1000        50                        1.93±0s
               100000    1000       100                        3.05±0s
               100000    1000       500                        13.1±0s
               100000   10000        50                        13.6±0s
               100000   10000       100                        21.8±0s
               100000   10000       500                        1.61±0m
             ========= ======== ============ ===========================================

[50.00%] · Running '/usr/bin/git name-rev --name-only --exclude=remotes/* --no-undefined c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
           OUTPUT -------->
           main
[50.00%] · For scikit-learn commit c634b8ab <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/usr/bin/git -c protocol.file.allow=always submodule deinit -f .'
[50.00%] ··· Running '/usr/bin/git checkout -f c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
             OUTPUT -------->
             Previous HEAD position was a9733250b Merge branch 'main' into radius_neighbors_predict_pwdb
             HEAD is now at c634b8abb MNT Deprecate SAMME.R algorithm from AdaBoostClassifier (#26830)
[50.00%] ··· Running '/usr/bin/git clean -fdx'
[50.00%] ··· Running '/usr/bin/git -c protocol.file.allow=always submodule update --init --recursive'
[50.00%] ··· Uninstalling from conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python -mpip uninstall -y scikit-learn'
             OUTPUT -------->
             Found existing installation: scikit-learn 1.4.dev0
             Uninstalling scikit-learn-1.4.dev0:
               Successfully uninstalled scikit-learn-1.4.dev0
[50.00%] ··· Running '/usr/bin/git name-rev --name-only --exclude=remotes/* --no-undefined c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
             OUTPUT -------->
             main
[50.00%] ··· Installing c634b8ab <main> into conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python -mpip install /home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/asv-build-cache/c634b8abbb5d96e0089b593aa04fb5ac80a047ec/scikit_learn-1.4.dev0-cp311-cp311-linux_x86_64.whl'
             OUTPUT -------->
             Processing ./asv-build-cache/c634b8abbb5d96e0089b593aa04fb5ac80a047ec/scikit_learn-1.4.dev0-cp311-cp311-linux_x86_64.whl
             Requirement already satisfied: numpy>=1.17.3 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (1.25.2)
             Requirement already satisfied: scipy>=1.5.0 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (1.11.2)
             Requirement already satisfied: joblib>=1.1.1 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (1.3.2)
             Requirement already satisfied: threadpoolctl>=2.0.0 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (3.2.0)
             Installing collected packages: scikit-learn
             Successfully installed scikit-learn-1.4.dev0
[50.00%] ·· Running '/usr/bin/git rev-list -n 1 --format=%at c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
            OUTPUT -------->
            commit c634b8abbb5d96e0089b593aa04fb5ac80a047ec
            1694116492
[50.00%] ·· Benchmarking conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-9l1sgjfv/socket' in conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python /home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-9l1sgjfv/socket'
[100.00%] ··· radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba                                                     3/18 failed
[100.00%] ··· ========= ======== ============ ===========================================
              --                                   metric / dtype / X_train / X_test
              ------------------------------- -------------------------------------------
               n_train   n_test   n_features   manhattan / numpy.float64 / dense / dense
              ========= ======== ============ ===========================================
                 1000     1000        50                        150±0ms
                 1000     1000       100                        164±0ms
                 1000     1000       500                        311±0ms
                 1000    10000        50                        385±0ms
                 1000    10000       100                        453±0ms
                 1000    10000       500                        1.23±0s
                10000     1000        50                        450±0ms
                10000     1000       100                        484±0ms
                10000     1000       500                        1.30±0s
                10000    10000        50                        3.34±0s
                10000    10000       100                        3.95±0s
                10000    10000       500                        11.9±0s
                100000    1000        50                        2.38±0s
                100000    1000       100                        3.26±0s
                100000    1000       500                        11.0±0s
                100000   10000        50                         failed
                100000   10000       100                         failed
                100000   10000       500                         failed
              ========= ======== ============ ===========================================

I am currently rerunning them without the -q flags for greater accuracy.

OmarManzoor · 2023-09-10T04:58:58Z

Hi @OmarManzoor,

I reran your ASV benchmark with the -q flags on more cases locally and it seems that your implementation (as of a973325) perform better than scikit-learn's (as of c634b8a). They also support large problems without OOM errors contrarily to the current implementation.

Details
I am currently rerunning them without the -q flags for greater accuracy.

Thank you for running the benchmarks @jjerphan

jjerphan

I have adapted the benchmark script to test on the sparse-dense case as well.

The performance improvements are compelling to me and can get to 3× speed-ups as fast as the current implementation.

There's only one regression out of all the runs; might it be alleviated by #27145?

The implementation looks good to me, from a maintenance perspective it adds a lot of lines but I think this is acceptable.

Yet, I would wait for one or two more approvals to get other point of views on maintenance and on the benchmarked cases before merging this PR.

asv benchmark script

import numpy as np
from scipy.spatial.distance import cdist
from scipy.sparse import rand as sparse_rand
from .common import Benchmark
from sklearn.neighbors import RadiusNeighborsClassifier


class BruteRadiusNeighborsClass(Benchmark):
    param_names = [
        "n_train",
        "n_test",
        "n_features",
        "metric",
        "dtype",
        "X_train",
        "X_test",
    ]
    params = [
        [1000, 10_000],
        [1000, 10_000],
        [10, 50, 100],
        ["manhattan"],
        [np.float64],
        ["dense", "sparse"],
        ["dense"],
    ]

    def setup(self, *params):
        n_train, n_test, n_features, metric, dtype, X_train, X_test = params
        rng = np.random.RandomState(0)
        self.X_train = (
            rng.rand(n_train, n_features).astype(dtype)
            if X_train == "dense"
            else sparse_rand(
                n_train,
                n_features,
                density=0.05,
                format="csr",
                dtype=dtype,
                random_state=rng,
            )
        )
        self.X_test = (
            rng.rand(n_test, n_features).astype(dtype)
            if X_test == "dense"
            else sparse_rand(
                n_test,
                n_features,
                density=0.05,
                format="csr",
                dtype=dtype,
                random_state=rng,
            )
        )
        self.y_train = rng.randint(low=-1, high=1, size=(n_train,))
        self.metric = metric

        dist_mat = cdist(
            (self.X_train if X_train == "dense" else self.X_train.toarray())[:1000],
            (self.X_test if X_test == "dense" else self.X_test.toarray())[:10],
        )

        self.radius = np.quantile(a=dist_mat.ravel(), q=0.05) * 20

        self.rc = RadiusNeighborsClassifier(
            radius=self.radius, algorithm="brute", metric=self.metric,
        )

        self.rc.fit(X=self.X_train, y=self.y_train)

    def time_predict_proba(self, *params):
        self.rc.predict_proba(self.X_test)

asv benchmark results

| Change   | Before [c634b8ab] <main>   | After [a9733250] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                                                                                                  |
|----------|----------------------------|----------------------------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| +        | 379±5ms                    | 466±4ms                                            |    1.23 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 201±2ms                    | 180±3ms                                            |    0.9  | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 378±6ms                    | 311±2ms                                            |    0.82 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 287±4ms                    | 223±3ms                                            |    0.78 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |
| -        | 141±3ms                    | 111±2ms                                            |    0.78 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 60.4±0.8ms                 | 45.3±0.2ms                                         |    0.75 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 4.19±0.1s                  | 3.13±0.01s                                         |    0.75 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| -        | 238±10ms                   | 168±4ms                                            |    0.71 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 129±0.5ms                  | 90.0±3ms                                           |    0.7  | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 36.1±0.7ms                 | 24.8±0.6ms                                         |    0.69 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')    |
| -        | 48.4±0.4ms                 | 32.5±0.3ms                                         |    0.67 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 199±3ms                    | 134±3ms                                            |    0.67 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 3.28±0.01s                 | 2.11±0.01s                                         |    0.64 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')  |
| -        | 2.64±0.01s                 | 1.62±0s                                            |    0.61 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 29.7±0.9ms                 | 17.5±0.09ms                                        |    0.59 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')     |
| -        | 134±0.6ms                  | 78.9±4ms                                           |    0.59 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 2.28±0.01s                 | 1.22±0s                                            |    0.53 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |
| -        | 20.2±0.8ms                 | 10.1±0.09ms                                        |    0.5  | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')    |
| -        | 124±0.7ms                  | 48.6±1ms                                           |    0.39 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 1.64±0.02s                 | 644±5ms                                            |    0.39 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 1.52±0.01s                 | 548±8ms                                            |    0.36 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |
| -        | 23.2±3ms                   | 7.76±0.07ms                                        |    0.33 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')     |

Micky774

Looks good, thanks @OmarManzoor! I left a couple comments, one nit and one recommendation for enforcing contiguous memory. Aside from that, +1 from me

Micky774 · 2023-09-24T16:48:44Z

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

+        const intp_t[:] Y_labels
+        const intp_t[:] unique_Y_labels
+        intp_t outlier_label_index
+        bint outlier_label_exists
+        bint outliers_exist
+        unsigned char[::1] outliers
+        object outlier_label
+        float64_t[:, :] class_scores


I think we can expect/enforce contiguous memory here.

Suggested change

const intp_t[:] Y_labels

const intp_t[:] unique_Y_labels

intp_t outlier_label_index

bint outlier_label_exists

bint outliers_exist

unsigned char[::1] outliers

object outlier_label

float64_t[:, :] class_scores

const intp_t[::1] Y_labels

const intp_t[::1] unique_Y_labels

intp_t outlier_label_index

bint outlier_label_exists

bint outliers_exist

unsigned char[::1] outliers

object outlier_label

float64_t[:, ::1] class_scores

Micky774 · 2023-09-24T16:54:44Z

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp

+        self.outlier_label_index = -1
+        self.outliers_exist = False
+        self.outlier_label = outlier_label
+        self.outlier_label_exists = outlier_label is not None


NIT: I think we can replace self.outlier_label_exists with its literal self.outlier_label is not None. IMO it is actually a bit more readable, but this is personal preference so feel free to dismiss this suggestion

Micky774 · 2023-09-25T14:56:51Z

Thank you very much for the work @OmarManzoor, and thank you @jjerphan for keeping up with review (and helping remind me). Congrats!

…ors.predict_proba` (scikit-learn#26828) Co-authored-by: Omar Salman <omar.salman@arbisoft> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

OmarManzoor added 2 commits July 12, 2023 20:26

PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors…

6b2700a

….predict_proba

FIX index issue in _parallel_on_X_prange_iter_finalize

0ad3127

github-actions bot added cython module:metrics module:neighbors labels Jul 13, 2023

OmarManzoor changed the title ~~PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors~~ PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors predict_proba Jul 13, 2023

Add tests for wrong method usages

c46e590

OmarManzoor commented Jul 14, 2023

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp Show resolved Hide resolved

OmarManzoor added 4 commits July 14, 2023 17:14

* Use a mem view and numpy array to handle outliers

04dd4b2

* Add a test to check results are consistent with both strategies

Remove tolist() when printing outliers in error

495e97e

Merge branch 'main' into radius_neighbors_predict_pwdb

9d0fc3e

Add changelog

65d8b96

OmarManzoor marked this pull request as ready for review July 20, 2023 11:52

OmarManzoor mentioned this pull request Jul 21, 2023

PERF PairwiseDistancesReductions subsequent work #25888

Open

21 tasks

Micky774 reviewed Aug 2, 2023

View reviewed changes

jjerphan reviewed Aug 5, 2023

View reviewed changes

Omar Salman and others added 2 commits August 7, 2023 15:18

Applied PR suggestions

3154843

Merge branch 'main' into radius_neighbors_predict_pwdb

6af6134

Micky774 mentioned this pull request Aug 7, 2023

MNT Exported WeightingStrategy for *_classmode reductions #27030

Merged

OmarManzoor added 3 commits August 9, 2023 08:41

Merge branch 'main' into radius_neighbors_predict_pwdb

a4fa941

Import WeightingStrategy from the common _classmode

32026e7

Merge branch 'main' into radius_neighbors_predict_pwdb

3a8d391

jjerphan reviewed Aug 10, 2023

View reviewed changes

OmarManzoor added 2 commits August 11, 2023 11:28

Merge branch 'main' into radius_neighbors_predict_pwdb

61e3235

Address PR suggestions

daf0443

OmarManzoor mentioned this pull request Aug 11, 2023

FIX Add common test to check for unfitted behaviour in classifiers and fix RadiusNeighborsClassifier and ClassifierChain correspondingly #27051

Closed

Merge branch 'main' into radius_neighbors_predict_pwdb

a973325

jjerphan approved these changes Sep 10, 2023

View reviewed changes

jjerphan changed the title ~~PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors predict_proba~~ PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors.predict_proba Sep 10, 2023

Micky774 approved these changes Sep 24, 2023

View reviewed changes

OmarManzoor added 2 commits September 25, 2023 12:24

Merge branch 'main' into radius_neighbors_predict_pwdb

0615c5f

Updates: PR suggestions

a1e722d

Micky774 merged commit f86f41d into scikit-learn:main Sep 25, 2023

OmarManzoor deleted the radius_neighbors_predict_pwdb branch September 25, 2023 15:08

	probabilities = ArgKminClassMode.compute(
	X,
	self._fit_X,
	k=self.n_neighbors,
	weights=self.weights,
	Y_labels=self._y,
	unique_Y_labels=self.classes_,
	metric=metric,
	metric_kwargs=metric_kwargs,
	# `strategy="parallel_on_X"` has in practice be shown
	# to be more efficient than `strategy="parallel_on_Y``
	# on many combination of datasets.
	# Hence, we choose to enforce it here.
	# For more information, see:
	# https://github.com/scikit-learn/scikit-learn/pull/24076#issuecomment-1445258342 # noqa
	# TODO: adapt the heuristic for `strategy="auto"` for
	# `ArgKminClassMode` and use `strategy="auto"`.
	strategy="parallel_on_X",
	)

		class_membership=np.array(labels, dtype=np.intp),
		unique_labels=np.array(unique_labels, dtype=np.intp),

PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors.predict_proba #26828

PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors.predict_proba #26828

Conversation

OmarManzoor commented Jul 13, 2023 • edited by jjerphan Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Jul 13, 2023 • edited Loading

✔️ Linting Passed

OmarManzoor commented Jul 20, 2023 • edited Loading

Some Benchmarks using ASV

OmarManzoor commented Jul 20, 2023

jjerphan commented Jul 20, 2023

OmarManzoor commented Jul 21, 2023

jjerphan commented Jul 23, 2023

OmarManzoor commented Jul 24, 2023

OmarManzoor commented Jul 24, 2023

jjerphan commented Jul 24, 2023

OmarManzoor commented Jul 24, 2023 • edited Loading

jjerphan commented Jul 24, 2023

OmarManzoor commented Jul 24, 2023

jjerphan commented Jul 24, 2023

OmarManzoor commented Jul 26, 2023 • edited Loading

jjerphan commented Jul 26, 2023

OmarManzoor commented Jul 26, 2023

Micky774 commented Jul 26, 2023

OmarManzoor commented Jul 26, 2023

Micky774 commented Jul 26, 2023

Micky774 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OmarManzoor Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjerphan Aug 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjerphan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OmarManzoor commented Aug 29, 2023

jjerphan commented Aug 30, 2023 • edited Loading

jjerphan commented Sep 9, 2023

OmarManzoor commented Sep 10, 2023

jjerphan left a comment • edited Loading

Choose a reason for hiding this comment

Micky774 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Micky774 commented Sep 25, 2023

PERF Implement `PairwiseDistancesReduction` backend for `RadiusNeighbors.predict_proba` #26828

PERF Implement `PairwiseDistancesReduction` backend for `RadiusNeighbors.predict_proba` #26828

OmarManzoor commented Jul 13, 2023 •

edited by jjerphan

Loading

github-actions bot commented Jul 13, 2023 •

edited

Loading

OmarManzoor commented Jul 20, 2023 •

edited

Loading

OmarManzoor commented Jul 24, 2023 •

edited

Loading

OmarManzoor commented Jul 26, 2023 •

edited

Loading

OmarManzoor Aug 7, 2023 •

edited

Loading

jjerphan Aug 5, 2023 •

edited

Loading

jjerphan left a comment •

edited

Loading

jjerphan commented Aug 30, 2023 •

edited

Loading

jjerphan left a comment •

edited

Loading