Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors.predict_proba #26828

Merged
merged 17 commits into from
Sep 25, 2023

Conversation

OmarManzoor
Copy link
Contributor

@OmarManzoor OmarManzoor commented Jul 13, 2023

Reference Issues/PRs

Towards: #25888

What does this implement/fix? Explain your changes.

  • Adds a dedicated pairwise distance backend for the predict probabilities functionality of radius neighbors.
  • Adds a dispatcher for this new backend
  • Uses this backend called RadiusNeighborsClassMode in the predict_proba method of RadiusNeighborsClassifier for specific conditions.

Up to 3× speed-ups, see: #26828 (review)

Any other comments?

cc: @jjerphan

@github-actions
Copy link

github-actions bot commented Jul 13, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a1e722d. Link to the linter CI: here

@OmarManzoor OmarManzoor changed the title PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors predict_proba Jul 13, 2023
@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Jul 20, 2023

Some Benchmarks using ASV

· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[50.00%] ··· ========= =============
             --        n_test / n_features
             --------- -------------
              n_train   10000 / 300 
             ========= =============
               100000    41.6±0.2s  
             ========= =============

[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl..
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[100.00%] ··· ========= =============
              --        n_test / n_features
              --------- -------------
               n_train   10000 / 300 
              ========= =============
                100000    1.33±0.04m 
              ========= =============

| Change   | Before [335b5556] <main>   | After [65d8b966] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                     |
|----------|----------------------------|----------------------------------------------------|---------|-------------------------------------------------------------------------------------------|
| -        | 1.33±0.04m                 | 41.6±0.2s                                          |    0.52 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(100000, 10000, 300) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Code used for the ASV benchmarks

@OmarManzoor
Copy link
Contributor Author

· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[50.00%] ··· ========= ============
             --        n_test / n_features
             --------- ------------
              n_train   1000 / 100 
             ========= ============
               100000   1.76±0.01s 
             ========= ============

[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl.
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[100.00%] ··· ========= ============
              --        n_test / n_features
              --------- ------------
               n_train   1000 / 100 
              ========= ============
                100000   2.01±0.1s  
              ========= ============

| Change   | Before [335b5556] <main>   | After [65d8b966] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                    |
|----------|----------------------------|----------------------------------------------------|---------|------------------------------------------------------------------------------------------|
| -        | 2.01±0.1s                  | 1.76±0.01s                                         |    0.88 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(100000, 1000, 100) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

ASV benchmark code

@jjerphan
Copy link
Member

Thanks!

A few remarks:

@OmarManzoor OmarManzoor marked this pull request as ready for review July 20, 2023 11:52
@OmarManzoor
Copy link
Contributor Author

Thanks!

A few remarks:

How many parameter variations should I put? Because the more there are the more time it will taken on my local system.

@jjerphan
Copy link
Member

How many parameter variations should I put? Because the more there are the more time it will taken on my local system.

On a laptop I would first use the --quick options from asv continuous and those values:

    param_names = [
        "n_train",
        "n_test",
        "n_features",
        "metric",
        "strategy",
        "dtype",
        "X_train",
        "X_test",
    ]
    params = [
        [1000, 10_000, int(1e7)],
        [1000, 10_000, 100_000],
        [100],
        ["manhattan"],
        ["auto"],
        [np.float64],
        ["dense"],
        ["dense"],
    ]

@OmarManzoor
Copy link
Contributor Author

How many parameter variations should I put? Because the more there are the more time it will taken on my local system.

On a laptop I would first use the --quick options from asv continuous and those values:

    param_names = [
        "n_train",
        "n_test",
        "n_features",
        "metric",
        "strategy",
        "dtype",
        "X_train",
        "X_test",
    ]
    params = [
        [1000, 10_000, int(1e7)],
        [1000, 10_000, 100_000],
        [100],
        ["manhattan"],
        ["auto"],
        [np.float64],
        ["dense"],
        ["dense"],
    ]

Thanks for specifying the values. Currently for strategy we have fixed it to "parallel_on_X" and not "auto".

@OmarManzoor
Copy link
Contributor Author

@jjerphan For larger values it is timing out. I used this command to run

asv continuous --quick -b BruteRadiusNeighborsClass upstream/main HEAD

For parameters: 10000000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense'

asv: benchmark timed out (timeout 500s)

For parameters: 10000000, 100000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense'

asv: benchmark timed out (timeout 500s)

For the relatively smaller values for train, I basically used the same way to calculate radius as done in the asv benchmark suite for pairwise distances. But I think this radius turns out to be smaller than required and as a result we get outliers and the error associated with outliers since we aren't specifying an outlier label.

dist_mat = cdist(
            (self.X_train if X_train == "dense" else self.X_train.toarray())[:1000],
            (self.X_test if X_test == "dense" else self.X_test.toarray())[:10],
        )

self.radius = np.quantile(a=dist_mat.ravel(), q=0.001)

ValueError: No neighbors found for test samples array(
[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12, ....., 
   ....., 990, 991, 992, 993, 994, 995, 996, 997, 998, 999
]), 
you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.

@jjerphan
Copy link
Member

Just do not use the largest value for datasets' size and use a larger quantile (e.g. q=0.01).

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Jul 24, 2023

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
·· Installing 65d8b966 <radius_neighbors_predict_pwdb> into conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[50.00%] ··· ========= ======================================================== =========================================================
             --                                     n_test / n_features / metric / dtype / X_train / X_test                              
             --------- ------------------------------------------------------------------------------------------------------------------
              n_train   1000 / 100 / manhattan / numpy.float64 / dense / dense   10000 / 100 / manhattan / numpy.float64 / dense / dense 
             ========= ======================================================== =========================================================
                1000                           33.3±0ms                                                  159±0ms                         
               10000                           194±0ms                                                   1.82±0s                         
             ========= ======================================================== =========================================================

[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl..
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba                                                                                                                                              ok
[100.00%] ··· ========= ======================================================== =========================================================
              --                                     n_test / n_features / metric / dtype / X_train / X_test                              
              --------- ------------------------------------------------------------------------------------------------------------------
               n_train   1000 / 100 / manhattan / numpy.float64 / dense / dense   10000 / 100 / manhattan / numpy.float64 / dense / dense 
              ========= ======================================================== =========================================================
                 1000                           69.8±0ms                                                  212±0ms                         
                10000                           233±0ms                                                   2.01±0s                         
              ========= ======================================================== =========================================================

| Change   | Before [335b5556] <main>   | After [65d8b966] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                                                                            |
|----------|----------------------------|----------------------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| -        | 2.01±0s                    | 1.82±0s                                            |    0.91 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| -        | 233±0ms                    | 194±0ms                                            |    0.83 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')  |
| -        | 212±0ms                    | 159±0ms                                            |    0.75 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')  |
| -        | 69.8±0ms                   | 33.3±0ms                                           |    0.48 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Code used for ASV benchmark

@jjerphan
Copy link
Member

Thanks !

Could you check that there is no memory leak using a profiler such as memory_profiler
?

@OmarManzoor
Copy link
Contributor Author

Thanks !

Could you check that there is no memory leak using a profiler such as memory_profiler ?

Should I just run a simply python script under the memory profiler, that fits the RadiusNeighborsClassifier and then runs predict_proba?

@jjerphan
Copy link
Member

Yes, we want to see if memory leaks in this new implementation.

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Jul 26, 2023

Output of memory profiler python -m memory_profiler test_radius_neighbors.py on this script Script

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   672  127.344 MiB  127.344 MiB           1       @classmethod
   673                                             @profile
   674                                             def compute(
   675                                                 cls,
   676                                                 X,
   677                                                 Y,
   678                                                 radius,
   679                                                 weights,
   680                                                 labels,
   681                                                 unique_labels,
   682                                                 outlier_label,
   683                                                 metric="euclidean",
   684                                                 chunk_size=None,
   685                                                 metric_kwargs=None,
   686                                                 strategy=None,
   687                                             ):
   688                                                 """Return the results of the reduction for the given arguments.
   689                                                 Parameters
   690                                                 ----------
   691                                                 X : ndarray of shape (n_samples_X, n_features)
   692                                                     The input array to be labelled.
   693                                                 Y : ndarray of shape (n_samples_Y, n_features)
   694                                                     The input array whose labels are provided through the `labels`
   695                                                     parameter.
   696                                                 radius : float
   697                                                     The radius defining the neighborhood.
   698                                                 weights : ndarray
   699                                                     The weights applied over the `labels` of `Y` when computing the
   700                                                     weighted mode of the labels.
   701                                                 labels : ndarray
   702                                                     An array containing the index of the class membership of the
   703                                                     associated samples in `Y`. This is used in labeling `X`.
   704                                                 unique_classes : ndarray
   705                                                     An array containing all unique class labels.
   706                                                 outlier_label : int, default=None
   707                                                     Label for outlier samples (samples with no neighbors in given
   708                                                     radius).
   709                                                 metric : str, default='euclidean'
   710                                                     The distance metric to use. For a list of available metrics, see
   711                                                     the documentation of :class:`~sklearn.metrics.DistanceMetric`.
   712                                                     Currently does not support `'precomputed'`.
   713                                                 chunk_size : int, default=None,
   714                                                     The number of vectors per chunk. If None (default) looks-up in
   715                                                     scikit-learn configuration for `pairwise_dist_chunk_size`,
   716                                                     and use 256 if it is not set.
   717                                                 metric_kwargs : dict, default=None
   718                                                     Keyword arguments to pass to specified metric function.
   719                                                 strategy : str, {'auto', 'parallel_on_X', 'parallel_on_Y'}, default=None
   720                                                     The chunking strategy defining which dataset parallelization are made on.
   721                                                     For both strategies the computations happens with two nested loops,
   722                                                     respectively on chunks of X and chunks of Y.
   723                                                     Strategies differs on which loop (outer or inner) is made to run
   724                                                     in parallel with the Cython `prange` construct:
   725                                                       - 'parallel_on_X' dispatches chunks of X uniformly on threads.
   726                                                         Each thread then iterates on all the chunks of Y. This strategy is
   727                                                         embarrassingly parallel and comes with no datastructures
   728                                                         synchronisation.
   729                                                       - 'parallel_on_Y' dispatches chunks of Y uniformly on threads.
   730                                                         Each thread processes all the chunks of X in turn. This strategy is
   731                                                         a sequence of embarrassingly parallel subtasks (the inner loop on Y
   732                                                         chunks) with intermediate datastructures synchronisation at each
   733                                                         iteration of the sequential outer loop on X chunks.
   734                                                       - 'auto' relies on a simple heuristic to choose between
   735                                                         'parallel_on_X' and 'parallel_on_Y': when `X.shape[0]` is large enough,
   736                                                         'parallel_on_X' is usually the most efficient strategy.
   737                                                         When `X.shape[0]` is small but `Y.shape[0]` is large, 'parallel_on_Y'
   738                                                         brings more opportunity for parallelism and is therefore more efficient
   739                                                         despite the synchronization step at each iteration of the outer loop
   740                                                         on chunks of `X`.
   741                                                       - None (default) looks-up in scikit-learn configuration for
   742                                                         `pairwise_dist_parallel_strategy`, and use 'auto' if it is not set.
   743                                                 Returns
   744                                                 -------
   745                                                 probabilities : ndarray of shape (n_samples_X, n_classes)
   746                                                     An array containing the class probabilities for each sample.
   747                                                 """
   748  127.344 MiB    0.000 MiB           1           if weights not in {"uniform", "distance"}:
   749                                                     raise ValueError(
   750                                                         "Only the 'uniform' or 'distance' weights options are supported"
   751                                                         f" at this time. Got: {weights=}."
   752                                                     )
   753  127.344 MiB    0.000 MiB           1           if X.dtype == Y.dtype == np.float64:
   754  901.719 MiB  774.375 MiB           2               return RadiusNeighborsClassMode64.compute(
   755  127.344 MiB    0.000 MiB           1                   X=X,
   756  127.344 MiB    0.000 MiB           1                   Y=Y,
   757  127.344 MiB    0.000 MiB           1                   radius=radius,
   758  127.344 MiB    0.000 MiB           1                   weights=weights,
   759  127.344 MiB    0.000 MiB           1                   class_membership=np.array(labels, dtype=np.intp),
   760  127.344 MiB    0.000 MiB           1                   unique_labels=np.array(unique_labels, dtype=np.intp),
   761  127.344 MiB    0.000 MiB           1                   outlier_label=outlier_label,
   762  127.344 MiB    0.000 MiB           1                   metric=metric,
   763  127.344 MiB    0.000 MiB           1                   chunk_size=chunk_size,
   764  127.344 MiB    0.000 MiB           1                   metric_kwargs=metric_kwargs,
   765  127.344 MiB    0.000 MiB           1                   strategy=strategy,
   766                                                     )
   767                                         
   768                                                 if X.dtype == Y.dtype == np.float32:
   769                                                     return RadiusNeighborsClassMode32.compute(
   770                                                         X=X,
   771                                                         Y=Y,
   772                                                         radius=radius,
   773                                                         weights=weights,
   774                                                         class_membership=np.array(labels, dtype=np.intp),
   775                                                         unique_labels=np.array(unique_labels, dtype=np.intp),
   776                                                         outlier_label=outlier_label,
   777                                                         metric=metric,
   778                                                         chunk_size=chunk_size,
   779                                                         metric_kwargs=metric_kwargs,
   780                                                         strategy=strategy,
   781                                                     )
   782                                         
   783                                                 raise ValueError(
   784                                                     "Only float64 or float32 datasets pairs are supported at this time, "
   785                                                     f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}."
   786                                                 )


Filename: /Users/omar.salman/Projects/scikit-learn/sklearn/neighbors/_classification.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   697  127.328 MiB  127.328 MiB           1       @profile
   698                                             def predict_proba(self, X):
   699                                                 """Return probability estimates for the test data X.
   700                                         
   701                                                 Parameters
   702                                                 ----------
   703                                                 X : {array-like, sparse matrix} of shape (n_queries, n_features), \
   704                                                         or (n_queries, n_indexed) if metric == 'precomputed'
   705                                                     Test samples.
   706                                         
   707                                                 Returns
   708                                                 -------
   709                                                 p : ndarray of shape (n_queries, n_classes), or a list of \
   710                                                         n_outputs of such arrays if n_outputs > 1.
   711                                                     The class probabilities of the input samples. Classes are ordered
   712                                                     by lexicographic order.
   713                                                 """
   714  127.328 MiB    0.000 MiB           1           check_is_fitted(self, "_fit_method")
   715  127.328 MiB    0.000 MiB           1           n_queries = _num_samples(X)
   716                                         
   717  127.328 MiB    0.000 MiB           2           metric, metric_kwargs = _adjusted_metric(
   718  127.328 MiB    0.000 MiB           1               metric=self.metric, metric_kwargs=self.metric_params, p=self.p
   719                                                 )
   720                                         
   721                                                 if (
   722  127.328 MiB    0.000 MiB           1               self.weights == "uniform"
   723  127.344 MiB    0.000 MiB           3               and self._fit_method == "brute"
   724  127.328 MiB    0.000 MiB           1               and not self.outputs_2d_
   725  127.344 MiB    0.016 MiB           1               and RadiusNeighborsClassMode.is_usable_for(X, self._fit_X, metric)
   726                                                 ):
   727  901.719 MiB  901.719 MiB           2               probabilities = RadiusNeighborsClassMode.compute(
   728  127.344 MiB    0.000 MiB           1                   X=X,
   729  127.344 MiB    0.000 MiB           1                   Y=self._fit_X,
   730  127.344 MiB    0.000 MiB           1                   radius=self.radius,
   731  127.344 MiB    0.000 MiB           1                   weights=self.weights,
   732  127.344 MiB    0.000 MiB           1                   labels=self._y,
   733  127.344 MiB    0.000 MiB           1                   unique_labels=self.classes_,
   734  127.344 MiB    0.000 MiB           1                   outlier_label=self.outlier_label,
   735  127.344 MiB    0.000 MiB           1                   metric=metric,
   736  127.344 MiB    0.000 MiB           1                   metric_kwargs=metric_kwargs,
   737  127.344 MiB    0.000 MiB           1                   strategy="parallel_on_X",
   738                                                     )
   739                                                 else:
   740                                                     neigh_dist, neigh_ind = self.radius_neighbors(X)
   741                                                     outlier_mask = np.zeros(n_queries, dtype=bool)
   742                                                     outlier_mask[:] = [len(nind) == 0 for nind in neigh_ind]
   743                                                     outliers = np.flatnonzero(outlier_mask)
   744                                                     inliers = np.flatnonzero(~outlier_mask)
   745                                         
   746                                                     classes_ = self.classes_
   747                                                     _y = self._y
   748                                                     if not self.outputs_2d_:
   749                                                         _y = self._y.reshape((-1, 1))
   750                                                         classes_ = [self.classes_]
   751                                         
   752                                                     if self.outlier_label_ is None and outliers.size > 0:
   753                                                         raise ValueError(
   754                                                             "No neighbors found for test samples %r, "
   755                                                             "you can try using larger radius, "
   756                                                             "giving a label for outliers, "
   757                                                             "or considering removing them from your dataset." % outliers
   758                                                         )
   759                                         
   760                                                     weights = _get_weights(neigh_dist, self.weights)
   761                                                     if weights is not None:
   762                                                         weights = weights[inliers]
   763                                         
   764                                                     probabilities = []
   765                                                     # iterate over multi-output, measure probabilities of the k-th output.
   766                                                     for k, classes_k in enumerate(classes_):
   767                                                         pred_labels = np.zeros(len(neigh_ind), dtype=object)
   768                                                         pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
   769                                         
   770                                                         proba_k = np.zeros((n_queries, classes_k.size))
   771                                                         proba_inl = np.zeros((len(inliers), classes_k.size))
   772                                         
   773                                                         # samples have different size of neighbors within the same radius
   774                                                         if weights is None:
   775                                                             for i, idx in enumerate(pred_labels[inliers]):
   776                                                                 proba_inl[i, :] = np.bincount(idx, minlength=classes_k.size)
   777                                                         else:
   778                                                             for i, idx in enumerate(pred_labels[inliers]):
   779                                                                 proba_inl[i, :] = np.bincount(
   780                                                                     idx, weights[i], minlength=classes_k.size
   781                                                                 )
   782                                                         proba_k[inliers, :] = proba_inl
   783                                         
   784                                                         if outliers.size > 0:
   785                                                             _outlier_label = self.outlier_label_[k]
   786                                                             label_index = np.flatnonzero(classes_k == _outlier_label)
   787                                                             if label_index.size == 1:
   788                                                                 proba_k[outliers, label_index[0]] = 1.0
   789                                                             else:
   790                                                                 warnings.warn(
   791                                                                     "Outlier label {} is not in training "
   792                                                                     "classes. All class probabilities of "
   793                                                                     "outliers will be assigned with 0."
   794                                                                     "".format(self.outlier_label_[k])
   795                                                                 )
   796                                         
   797                                                         # normalize 'votes' into real [0,1] probabilities
   798                                                         normalizer = proba_k.sum(axis=1)[:, np.newaxis]
   799                                                         normalizer[normalizer == 0.0] = 1.0
   800                                                         proba_k /= normalizer
   801                                         
   802                                                         probabilities.append(proba_k)
   803                                         
   804                                                     if not self.outputs_2d_:
   805                                                         probabilities = probabilities[0]
   806                                         
   807  901.719 MiB    0.000 MiB           1           return probabilities

@jjerphan
Copy link
Member

Thanks for the report. I do not know if the memory is being released after returning.

I think I should have been clearer and I should have asked to report plots using mprof to know if memory usage is back to the normal after the call to RadiusNeighbors, such as similarly done by Olivier in #22320 (review).

I have low bandwidth to guide you at the moment but will try providing more precise suggestions the the previous ones.

@OmarManzoor
Copy link
Contributor Author

Thanks for the report. I do not know if the memory is being released after returning.

I think I should have been clearer and I should have asked to report plots using mprof to know if memory usage is back to the normal after the call to RadiusNeighbors, such as similarly done by Olivier in #22320 (review).

I have low bandwidth to guide you at the moment but will try providing more precise suggestions the the previous ones.

No problem! Thanks for the guidance.

@Micky774
Copy link
Contributor

You mean I should run mprof with this same script on main and share the graph?

Yup! Just to ensure that the memory use isn't significantly greater so as to introduce a regression.

@OmarManzoor
Copy link
Contributor Author

Memory profile on main
Figure_1

@Micky774
Copy link
Contributor

Nice! We actually have some good savings on memory overhead. I still need to make a pass through the code, but will do soon. Thank you for your work @OmarManzoor

Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work done so far! Sorry for the delay in review. Let me know if you have any questions/concerns.

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
and not issparse(X)
and not issparse(Y)
# TODO: implement Euclidean specialization with GEMM.
and metric not in ("euclidean", "sqeuclidean")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we should probably specify this in cls.valid_metrics instead

Comment on lines 705 to 707
outlier_label : int, default=None
Label for outlier samples (samples with no neighbors in given
radius).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we add to the description the default behavior when outlier_label=None?

Comment on lines 110 to 114
if outlier_label is not None:
self.outlier_label_exists = True
for idx in range(self.unique_labels.shape[0]):
if self.unique_labels[idx] == outlier_label:
self.outlier_index = idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to use an outlier_label value that is not included in the list of unique labels? Mainly, the unique labels are inferred from the provided labels at the call-site, so if a user wanted to mark all outliers with an out-of-distribution label (e.g. -1) then they'd need to know a-priori to add it to unique_labels.

Either we allow them to provide it independent of unique_labels, or clearly document that it must be present in unique_labels. If we want to keep identical semantics to RadiusNeighborsClassifier, we should do the latter.

From a practitioner perspective, I think this is less than ideal. I imagine, in practice, it is helpful to be able to do this ad-hoc rather than a-priori. That is to say, if you have unique_labels=[1, 2, 3] then allow outlier_label=5 without issue. Even so, that would be a different PR I suppose, since we don't intend for this to really be publicly used...

outlier_label=self.outlier_label,
metric=metric,
metric_kwargs=metric_kwargs,
strategy="parallel_on_X",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there benchmarks to support a fixed dispatch on X rather than an arbitrary heuristic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is based on the current call to ArgKminClassMode:

probabilities = ArgKminClassMode.compute(
X,
self._fit_X,
k=self.n_neighbors,
weights=self.weights,
Y_labels=self._y,
unique_Y_labels=self.classes_,
metric=metric,
metric_kwargs=metric_kwargs,
# `strategy="parallel_on_X"` has in practice be shown
# to be more efficient than `strategy="parallel_on_Y``
# on many combination of datasets.
# Hence, we choose to enforce it here.
# For more information, see:
# https://github.com/scikit-learn/scikit-learn/pull/24076#issuecomment-1445258342 # noqa
# TODO: adapt the heuristic for `strategy="auto"` for
# `ArgKminClassMode` and use `strategy="auto"`.
strategy="parallel_on_X",
)

Benchmarks you previously made are reported in #24076 (comment).

I think it is worth replicating the comment here as well.

Copy link
Contributor Author

@OmarManzoor OmarManzoor Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did not run any specific benchmarks with regards to this but it does seem that parallel_on_X runs generally better.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first review of the code completing @Micky774's.

Looking back, I think a few adaptations could be made for ArgKminClassMode. Also we might be able to favor some logic, but we better do it after this PR.

Comment on lines 758 to 759
class_membership=np.array(labels, dtype=np.intp),
unique_labels=np.array(unique_labels, dtype=np.intp),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Micky774: This is taken from ArgKminClassMode. Do you remember why the conversion to NumPy arrays are needed here? Is it to support Python lists?

If so, could we perform the conversion on dedicated lines and use np.asarray instead?

Comment on lines 773 to 774
class_membership=np.array(labels, dtype=np.intp),
unique_labels=np.array(unique_labels, dtype=np.intp),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question and comment as the previous one.

sklearn/neighbors/_classification.py Show resolved Hide resolved
sklearn/metrics/_pairwise_distances_reduction/__init__.py Outdated Show resolved Hide resolved
Comment on lines +631 to +634
RadiusNeighborsClassMode is typically used to perform bruteforce
radius neighbors queries when the weighted mode of the labels for
the nearest neighbors within the specified radius are required,
such as in `predict` methods.
Copy link
Member

@jjerphan jjerphan Aug 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR: I think such parts of the doc-string of {ArgKmin,RadiusNeighbors}ClassMode can be made clearer in another pull request.

)


cdef inline void weighted_histogram_mode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side-note: {ArgKmin,RadiusNeighbors}.weighted_histogram_mode are notably similar. Potentially this and the enum might be factorizable elements for those implementations. I would study this in another PR.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another review.

Looking back, I am a bit surprised by the result of asv reported in #26828 (comment): the performance of those implementations reaches the one on main when datasets get taller (i.e. when there are more samples).

I would rerun benchmarks on more configurations and also on a machine with more cores than laptops' to better understand the dynamics.

sklearn/neighbors/_classification.py Outdated Show resolved Hide resolved
weights=weights,
Y_labels=Y_labels,
unique_Y_labels=unique_Y_labels,
outlier_label=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parametrize this test to also test cases where outlier_label is not None?

Comment on lines +94 to +95
else:
self.weight_type = WeightingStrategy.callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note related to this PR: As of the presence of the GIL, I do not think we should support Python callable in Cython because this won't be efficient (callable are object which needs to be access holding the GIL, which meant no effective parallelized execution).

doc/whats_new/v1.4.rst Show resolved Hide resolved
sklearn/neighbors/_classification.py Show resolved Hide resolved
@OmarManzoor
Copy link
Contributor Author

@jjerphan @Micky774 Do you have time to revisit this PR?

@jjerphan
Copy link
Member

jjerphan commented Aug 30, 2023

I hardly have bandwidth for scikit-learn for now, but I might in the future.

@jjerphan
Copy link
Member

jjerphan commented Sep 9, 2023

Hi @OmarManzoor,

I reran your ASV benchmark with the -q flags on more cases locally and it seems that your implementation (as of a973325) perform better than scikit-learn's (as of c634b8a). They also support large problems without OOM errors contrarily to the current implementation.

Details

Note that I have adapted n_test, n_train and n_features.

$ asv continuous main radius_neighbors_predict_pwdb -b BruteRadiusNeighborsClass -v -v -v -q
Couldn't load asv.plugins.mamba because
No module named 'mamba'
· Running '/usr/bin/git config init.defaultBranch'
  OUTPUT -------->
  main
· Running '/usr/bin/git rev-parse radius_neighbors_predict_pwdb^{commit}'
  OUTPUT -------->
  a9733250b863e495745bc1672dc58e928e4300b1
· Running '/usr/bin/git rev-parse main^{commit}'
  OUTPUT -------->
  c634b8abbb5d96e0089b593aa04fb5ac80a047ec
· Running '/usr/bin/git config init.defaultBranch'
  OUTPUT -------->
  main
· Creating environments
·· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python -c pass'
· Discovering benchmarks
·· Running '/usr/bin/git rev-parse main^{commit}'
   OUTPUT -------->
   c634b8abbb5d96e0089b593aa04fb5ac80a047ec
·· Running '/home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py discover /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/tmpk4umbuq5/result.json' in conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
·· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python /home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py discover /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/tmpk4umbuq5/result.json'
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · Running '/usr/bin/git name-rev --name-only --exclude=remotes/* --no-undefined a9733250b863e495745bc1672dc58e928e4300b1'
           OUTPUT -------->
           radius_neighbors_predict_pwdb
[ 0.00%] · For scikit-learn commit a9733250 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Running '/usr/bin/git rev-list -n 1 --format=%at a9733250b863e495745bc1672dc58e928e4300b1'
            OUTPUT -------->
            commit a9733250b863e495745bc1672dc58e928e4300b1
            1694283582
[ 0.00%] ·· Benchmarking conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[ 0.00%] ··· Running '/home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-c7h27c4l/socket' in conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[ 0.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python /home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-c7h27c4l/socket'
[50.00%] ··· radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba                                                              ok
[50.00%] ··· ========= ======== ============ ===========================================
             --                                   metric / dtype / X_train / X_test
             ------------------------------- -------------------------------------------
              n_train   n_test   n_features   manhattan / numpy.float64 / dense / dense
             ========= ======== ============ ===========================================
                1000     1000        50                        113±0ms
                1000     1000       100                        131±0ms
                1000     1000       500                        234±0ms
                1000    10000        50                        227±0ms
                1000    10000       100                        260±0ms
                1000    10000       500                        732±0ms
               10000     1000        50                        245±0ms
               10000     1000       100                        329±0ms
               10000     1000       500                        1.18±0s
               10000    10000        50                        879±0ms
               10000    10000       100                        1.39±0s
               10000    10000       500                        7.88±0s
               100000    1000        50                        1.93±0s
               100000    1000       100                        3.05±0s
               100000    1000       500                        13.1±0s
               100000   10000        50                        13.6±0s
               100000   10000       100                        21.8±0s
               100000   10000       500                        1.61±0m
             ========= ======== ============ ===========================================

[50.00%] · Running '/usr/bin/git name-rev --name-only --exclude=remotes/* --no-undefined c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
           OUTPUT -------->
           main
[50.00%] · For scikit-learn commit c634b8ab <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/usr/bin/git -c protocol.file.allow=always submodule deinit -f .'
[50.00%] ··· Running '/usr/bin/git checkout -f c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
             OUTPUT -------->
             Previous HEAD position was a9733250b Merge branch 'main' into radius_neighbors_predict_pwdb
             HEAD is now at c634b8abb MNT Deprecate SAMME.R algorithm from AdaBoostClassifier (#26830)
[50.00%] ··· Running '/usr/bin/git clean -fdx'
[50.00%] ··· Running '/usr/bin/git -c protocol.file.allow=always submodule update --init --recursive'
[50.00%] ··· Uninstalling from conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python -mpip uninstall -y scikit-learn'
             OUTPUT -------->
             Found existing installation: scikit-learn 1.4.dev0
             Uninstalling scikit-learn-1.4.dev0:
               Successfully uninstalled scikit-learn-1.4.dev0
[50.00%] ··· Running '/usr/bin/git name-rev --name-only --exclude=remotes/* --no-undefined c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
             OUTPUT -------->
             main
[50.00%] ··· Installing c634b8ab <main> into conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python -mpip install /home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/asv-build-cache/c634b8abbb5d96e0089b593aa04fb5ac80a047ec/scikit_learn-1.4.dev0-cp311-cp311-linux_x86_64.whl'
             OUTPUT -------->
             Processing ./asv-build-cache/c634b8abbb5d96e0089b593aa04fb5ac80a047ec/scikit_learn-1.4.dev0-cp311-cp311-linux_x86_64.whl
             Requirement already satisfied: numpy>=1.17.3 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (1.25.2)
             Requirement already satisfied: scipy>=1.5.0 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (1.11.2)
             Requirement already satisfied: joblib>=1.1.1 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (1.3.2)
             Requirement already satisfied: threadpoolctl>=2.0.0 in ./lib/python3.11/site-packages (from scikit-learn==1.4.dev0) (3.2.0)
             Installing collected packages: scikit-learn
             Successfully installed scikit-learn-1.4.dev0
[50.00%] ·· Running '/usr/bin/git rev-list -n 1 --format=%at c634b8abbb5d96e0089b593aa04fb5ac80a047ec'
            OUTPUT -------->
            commit c634b8abbb5d96e0089b593aa04fb5ac80a047ec
            1694116492
[50.00%] ·· Benchmarking conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-9l1sgjfv/socket' in conda-py3.11-cython0.29.36-joblib1.3.2-numpy1.25.2-pandas2.1.0-scipy1.11.2-threadpoolctl3.2.0
[50.00%] ··· Running '/home/jjerphan/dev/scikit-learn/asv_benchmarks/env/8f3cc7f890c0ac4cefda50d795ce83c3/bin/python /home/jjerphan/.local/share/mambaforge/envs/sk/lib/python3.11/site-packages/asv/benchmark.py run_server /home/jjerphan/dev/scikit-learn/asv_benchmarks/benchmarks /tmp/asv-forkserver-9l1sgjfv/socket'
[100.00%] ··· radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba                                                     3/18 failed
[100.00%] ··· ========= ======== ============ ===========================================
              --                                   metric / dtype / X_train / X_test
              ------------------------------- -------------------------------------------
               n_train   n_test   n_features   manhattan / numpy.float64 / dense / dense
              ========= ======== ============ ===========================================
                 1000     1000        50                        150±0ms
                 1000     1000       100                        164±0ms
                 1000     1000       500                        311±0ms
                 1000    10000        50                        385±0ms
                 1000    10000       100                        453±0ms
                 1000    10000       500                        1.23±0s
                10000     1000        50                        450±0ms
                10000     1000       100                        484±0ms
                10000     1000       500                        1.30±0s
                10000    10000        50                        3.34±0s
                10000    10000       100                        3.95±0s
                10000    10000       500                        11.9±0s
                100000    1000        50                        2.38±0s
                100000    1000       100                        3.26±0s
                100000    1000       500                        11.0±0s
                100000   10000        50                         failed
                100000   10000       100                         failed
                100000   10000       500                         failed
              ========= ======== ============ ===========================================

I am currently rerunning them without the -q flags for greater accuracy.

@OmarManzoor
Copy link
Contributor Author

Hi @OmarManzoor,

I reran your ASV benchmark with the -q flags on more cases locally and it seems that your implementation (as of a973325) perform better than scikit-learn's (as of c634b8a). They also support large problems without OOM errors contrarily to the current implementation.

Details
I am currently rerunning them without the -q flags for greater accuracy.

Thank you for running the benchmarks @jjerphan

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have adapted the benchmark script to test on the sparse-dense case as well.

The performance improvements are compelling to me and can get to 3× speed-ups as fast as the current implementation.

There's only one regression out of all the runs; might it be alleviated by #27145?

The implementation looks good to me, from a maintenance perspective it adds a lot of lines but I think this is acceptable.

Yet, I would wait for one or two more approvals to get other point of views on maintenance and on the benchmarked cases before merging this PR.

asv benchmark script
import numpy as np
from scipy.spatial.distance import cdist
from scipy.sparse import rand as sparse_rand
from .common import Benchmark
from sklearn.neighbors import RadiusNeighborsClassifier


class BruteRadiusNeighborsClass(Benchmark):
    param_names = [
        "n_train",
        "n_test",
        "n_features",
        "metric",
        "dtype",
        "X_train",
        "X_test",
    ]
    params = [
        [1000, 10_000],
        [1000, 10_000],
        [10, 50, 100],
        ["manhattan"],
        [np.float64],
        ["dense", "sparse"],
        ["dense"],
    ]

    def setup(self, *params):
        n_train, n_test, n_features, metric, dtype, X_train, X_test = params
        rng = np.random.RandomState(0)
        self.X_train = (
            rng.rand(n_train, n_features).astype(dtype)
            if X_train == "dense"
            else sparse_rand(
                n_train,
                n_features,
                density=0.05,
                format="csr",
                dtype=dtype,
                random_state=rng,
            )
        )
        self.X_test = (
            rng.rand(n_test, n_features).astype(dtype)
            if X_test == "dense"
            else sparse_rand(
                n_test,
                n_features,
                density=0.05,
                format="csr",
                dtype=dtype,
                random_state=rng,
            )
        )
        self.y_train = rng.randint(low=-1, high=1, size=(n_train,))
        self.metric = metric

        dist_mat = cdist(
            (self.X_train if X_train == "dense" else self.X_train.toarray())[:1000],
            (self.X_test if X_test == "dense" else self.X_test.toarray())[:10],
        )

        self.radius = np.quantile(a=dist_mat.ravel(), q=0.05) * 20

        self.rc = RadiusNeighborsClassifier(
            radius=self.radius, algorithm="brute", metric=self.metric,
        )

        self.rc.fit(X=self.X_train, y=self.y_train)

    def time_predict_proba(self, *params):
        self.rc.predict_proba(self.X_test)
asv benchmark results
| Change   | Before [c634b8ab] <main>   | After [a9733250] <radius_neighbors_predict_pwdb>   |   Ratio | Benchmark (Parameter)                                                                                                                                                  |
|----------|----------------------------|----------------------------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| +        | 379±5ms                    | 466±4ms                                            |    1.23 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 201±2ms                    | 180±3ms                                            |    0.9  | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 378±6ms                    | 311±2ms                                            |    0.82 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 287±4ms                    | 223±3ms                                            |    0.78 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |
| -        | 141±3ms                    | 111±2ms                                            |    0.78 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 60.4±0.8ms                 | 45.3±0.2ms                                         |    0.75 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 4.19±0.1s                  | 3.13±0.01s                                         |    0.75 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| -        | 238±10ms                   | 168±4ms                                            |    0.71 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 129±0.5ms                  | 90.0±3ms                                           |    0.7  | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 36.1±0.7ms                 | 24.8±0.6ms                                         |    0.69 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')    |
| -        | 48.4±0.4ms                 | 32.5±0.3ms                                         |    0.67 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 199±3ms                    | 134±3ms                                            |    0.67 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 3.28±0.01s                 | 2.11±0.01s                                         |    0.64 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')  |
| -        | 2.64±0.01s                 | 1.62±0s                                            |    0.61 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 29.7±0.9ms                 | 17.5±0.09ms                                        |    0.59 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')     |
| -        | 134±0.6ms                  | 78.9±4ms                                           |    0.59 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')   |
| -        | 2.28±0.01s                 | 1.22±0s                                            |    0.53 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |
| -        | 20.2±0.8ms                 | 10.1±0.09ms                                        |    0.5  | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')    |
| -        | 124±0.7ms                  | 48.6±1ms                                           |    0.39 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')    |
| -        | 1.64±0.02s                 | 644±5ms                                            |    0.39 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense')  |
| -        | 1.52±0.01s                 | 548±8ms                                            |    0.36 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')   |
| -        | 23.2±3ms                   | 7.76±0.07ms                                        |    0.33 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense')     |

@jjerphan jjerphan changed the title PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors predict_proba PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors.predict_proba Sep 10, 2023
Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @OmarManzoor! I left a couple comments, one nit and one recommendation for enforcing contiguous memory. Aside from that, +1 from me

Comment on lines 23 to 30
const intp_t[:] Y_labels
const intp_t[:] unique_Y_labels
intp_t outlier_label_index
bint outlier_label_exists
bint outliers_exist
unsigned char[::1] outliers
object outlier_label
float64_t[:, :] class_scores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can expect/enforce contiguous memory here.

Suggested change
const intp_t[:] Y_labels
const intp_t[:] unique_Y_labels
intp_t outlier_label_index
bint outlier_label_exists
bint outliers_exist
unsigned char[::1] outliers
object outlier_label
float64_t[:, :] class_scores
const intp_t[::1] Y_labels
const intp_t[::1] unique_Y_labels
intp_t outlier_label_index
bint outlier_label_exists
bint outliers_exist
unsigned char[::1] outliers
object outlier_label
float64_t[:, ::1] class_scores

self.outlier_label_index = -1
self.outliers_exist = False
self.outlier_label = outlier_label
self.outlier_label_exists = outlier_label is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think we can replace self.outlier_label_exists with its literal self.outlier_label is not None. IMO it is actually a bit more readable, but this is personal preference so feel free to dismiss this suggestion

@Micky774 Micky774 merged commit f86f41d into scikit-learn:main Sep 25, 2023
@Micky774
Copy link
Contributor

Thank you very much for the work @OmarManzoor, and thank you @jjerphan for keeping up with review (and helping remind me). Congrats!

@OmarManzoor OmarManzoor deleted the radius_neighbors_predict_pwdb branch September 25, 2023 15:08
lesteve pushed a commit to lesteve/scikit-learn that referenced this pull request Sep 28, 2023
…ors.predict_proba` (scikit-learn#26828)

Co-authored-by: Omar Salman <omar.salman@arbisoft>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…ors.predict_proba` (scikit-learn#26828)

Co-authored-by: Omar Salman <omar.salman@arbisoft>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants