-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF Implement PairwiseDistancesReduction
backend for RadiusNeighbors.predict_proba
#26828
PERF Implement PairwiseDistancesReduction
backend for RadiusNeighbors.predict_proba
#26828
Conversation
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Show resolved
Hide resolved
* Add a test to check results are consistent with both strategies
Some Benchmarks using ASV· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba ok
[50.00%] ··· ========= =============
-- n_test / n_features
--------- -------------
n_train 10000 / 300
========= =============
100000 41.6±0.2s
========= =============
[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl..
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba ok
[100.00%] ··· ========= =============
-- n_test / n_features
--------- -------------
n_train 10000 / 300
========= =============
100000 1.33±0.04m
========= =============
| Change | Before [335b5556] <main> | After [65d8b966] <radius_neighbors_predict_pwdb> | Ratio | Benchmark (Parameter) |
|----------|----------------------------|----------------------------------------------------|---------|-------------------------------------------------------------------------------------------|
| - | 1.33±0.04m | 41.6±0.2s | 0.52 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(100000, 10000, 300) |
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
|
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba ok
[50.00%] ··· ========= ============
-- n_test / n_features
--------- ------------
n_train 1000 / 100
========= ============
100000 1.76±0.01s
========= ============
[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl.
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba ok
[100.00%] ··· ========= ============
-- n_test / n_features
--------- ------------
n_train 1000 / 100
========= ============
100000 2.01±0.1s
========= ============
| Change | Before [335b5556] <main> | After [65d8b966] <radius_neighbors_predict_pwdb> | Ratio | Benchmark (Parameter) |
|----------|----------------------------|----------------------------------------------------|---------|------------------------------------------------------------------------------------------|
| - | 2.01±0.1s | 1.76±0.01s | 0.88 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(100000, 1000, 100) |
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED. |
Thanks! A few remarks:
|
How many parameter variations should I put? Because the more there are the more time it will taken on my local system. |
On a laptop I would first use the param_names = [
"n_train",
"n_test",
"n_features",
"metric",
"strategy",
"dtype",
"X_train",
"X_test",
]
params = [
[1000, 10_000, int(1e7)],
[1000, 10_000, 100_000],
[100],
["manhattan"],
["auto"],
[np.float64],
["dense"],
["dense"],
] |
Thanks for specifying the values. Currently for strategy we have fixed it to "parallel_on_X" and not "auto". |
@jjerphan For larger values it is timing out. I used this command to run asv continuous --quick -b BruteRadiusNeighborsClass upstream/main HEAD For parameters: 10000000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense' asv: benchmark timed out (timeout 500s) For parameters: 10000000, 100000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense' asv: benchmark timed out (timeout 500s) For the relatively smaller values for train, I basically used the same way to calculate radius as done in the asv benchmark suite for pairwise distances. But I think this radius turns out to be smaller than required and as a result we get outliers and the error associated with outliers since we aren't specifying an outlier label. dist_mat = cdist(
(self.X_train if X_train == "dense" else self.X_train.toarray())[:1000],
(self.X_test if X_test == "dense" else self.X_test.toarray())[:10],
)
self.radius = np.quantile(a=dist_mat.ravel(), q=0.001)
ValueError: No neighbors found for test samples array(
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, .....,
....., 990, 991, 992, 993, 994, 995, 996, 997, 998, 999
]),
you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.
|
Just do not use the largest value for datasets' size and use a larger quantile (e.g. |
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
·· Installing 65d8b966 <radius_neighbors_predict_pwdb> into conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl.
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For scikit-learn commit 65d8b966 <radius_neighbors_predict_pwdb> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[50.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba ok
[50.00%] ··· ========= ======================================================== =========================================================
-- n_test / n_features / metric / dtype / X_train / X_test
--------- ------------------------------------------------------------------------------------------------------------------
n_train 1000 / 100 / manhattan / numpy.float64 / dense / dense 10000 / 100 / manhattan / numpy.float64 / dense / dense
========= ======================================================== =========================================================
1000 33.3±0ms 159±0ms
10000 194±0ms 1.82±0s
========= ======================================================== =========================================================
[50.00%] · For scikit-learn commit 335b5556 <main> (round 1/1):
[50.00%] ·· Building for conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl..
[50.00%] ·· Benchmarking conda-py3.11-cython-joblib-numpy-pandas-scipy-threadpoolctl
[100.00%] ··· radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba ok
[100.00%] ··· ========= ======================================================== =========================================================
-- n_test / n_features / metric / dtype / X_train / X_test
--------- ------------------------------------------------------------------------------------------------------------------
n_train 1000 / 100 / manhattan / numpy.float64 / dense / dense 10000 / 100 / manhattan / numpy.float64 / dense / dense
========= ======================================================== =========================================================
1000 69.8±0ms 212±0ms
10000 233±0ms 2.01±0s
========= ======================================================== =========================================================
| Change | Before [335b5556] <main> | After [65d8b966] <radius_neighbors_predict_pwdb> | Ratio | Benchmark (Parameter) |
|----------|----------------------------|----------------------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| - | 2.01±0s | 1.82±0s | 0.91 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 233±0ms | 194±0ms | 0.83 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 212±0ms | 159±0ms | 0.75 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 69.8±0ms | 33.3±0ms | 0.48 | radius_neighbors_predict.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
|
Thanks ! Could you check that there is no memory leak using a profiler such as |
Should I just run a simply python script under the memory profiler, that fits the RadiusNeighborsClassifier and then runs predict_proba? |
Yes, we want to see if memory leaks in this new implementation. |
Output of memory profiler Line # Mem usage Increment Occurrences Line Contents
=============================================================
672 127.344 MiB 127.344 MiB 1 @classmethod
673 @profile
674 def compute(
675 cls,
676 X,
677 Y,
678 radius,
679 weights,
680 labels,
681 unique_labels,
682 outlier_label,
683 metric="euclidean",
684 chunk_size=None,
685 metric_kwargs=None,
686 strategy=None,
687 ):
688 """Return the results of the reduction for the given arguments.
689 Parameters
690 ----------
691 X : ndarray of shape (n_samples_X, n_features)
692 The input array to be labelled.
693 Y : ndarray of shape (n_samples_Y, n_features)
694 The input array whose labels are provided through the `labels`
695 parameter.
696 radius : float
697 The radius defining the neighborhood.
698 weights : ndarray
699 The weights applied over the `labels` of `Y` when computing the
700 weighted mode of the labels.
701 labels : ndarray
702 An array containing the index of the class membership of the
703 associated samples in `Y`. This is used in labeling `X`.
704 unique_classes : ndarray
705 An array containing all unique class labels.
706 outlier_label : int, default=None
707 Label for outlier samples (samples with no neighbors in given
708 radius).
709 metric : str, default='euclidean'
710 The distance metric to use. For a list of available metrics, see
711 the documentation of :class:`~sklearn.metrics.DistanceMetric`.
712 Currently does not support `'precomputed'`.
713 chunk_size : int, default=None,
714 The number of vectors per chunk. If None (default) looks-up in
715 scikit-learn configuration for `pairwise_dist_chunk_size`,
716 and use 256 if it is not set.
717 metric_kwargs : dict, default=None
718 Keyword arguments to pass to specified metric function.
719 strategy : str, {'auto', 'parallel_on_X', 'parallel_on_Y'}, default=None
720 The chunking strategy defining which dataset parallelization are made on.
721 For both strategies the computations happens with two nested loops,
722 respectively on chunks of X and chunks of Y.
723 Strategies differs on which loop (outer or inner) is made to run
724 in parallel with the Cython `prange` construct:
725 - 'parallel_on_X' dispatches chunks of X uniformly on threads.
726 Each thread then iterates on all the chunks of Y. This strategy is
727 embarrassingly parallel and comes with no datastructures
728 synchronisation.
729 - 'parallel_on_Y' dispatches chunks of Y uniformly on threads.
730 Each thread processes all the chunks of X in turn. This strategy is
731 a sequence of embarrassingly parallel subtasks (the inner loop on Y
732 chunks) with intermediate datastructures synchronisation at each
733 iteration of the sequential outer loop on X chunks.
734 - 'auto' relies on a simple heuristic to choose between
735 'parallel_on_X' and 'parallel_on_Y': when `X.shape[0]` is large enough,
736 'parallel_on_X' is usually the most efficient strategy.
737 When `X.shape[0]` is small but `Y.shape[0]` is large, 'parallel_on_Y'
738 brings more opportunity for parallelism and is therefore more efficient
739 despite the synchronization step at each iteration of the outer loop
740 on chunks of `X`.
741 - None (default) looks-up in scikit-learn configuration for
742 `pairwise_dist_parallel_strategy`, and use 'auto' if it is not set.
743 Returns
744 -------
745 probabilities : ndarray of shape (n_samples_X, n_classes)
746 An array containing the class probabilities for each sample.
747 """
748 127.344 MiB 0.000 MiB 1 if weights not in {"uniform", "distance"}:
749 raise ValueError(
750 "Only the 'uniform' or 'distance' weights options are supported"
751 f" at this time. Got: {weights=}."
752 )
753 127.344 MiB 0.000 MiB 1 if X.dtype == Y.dtype == np.float64:
754 901.719 MiB 774.375 MiB 2 return RadiusNeighborsClassMode64.compute(
755 127.344 MiB 0.000 MiB 1 X=X,
756 127.344 MiB 0.000 MiB 1 Y=Y,
757 127.344 MiB 0.000 MiB 1 radius=radius,
758 127.344 MiB 0.000 MiB 1 weights=weights,
759 127.344 MiB 0.000 MiB 1 class_membership=np.array(labels, dtype=np.intp),
760 127.344 MiB 0.000 MiB 1 unique_labels=np.array(unique_labels, dtype=np.intp),
761 127.344 MiB 0.000 MiB 1 outlier_label=outlier_label,
762 127.344 MiB 0.000 MiB 1 metric=metric,
763 127.344 MiB 0.000 MiB 1 chunk_size=chunk_size,
764 127.344 MiB 0.000 MiB 1 metric_kwargs=metric_kwargs,
765 127.344 MiB 0.000 MiB 1 strategy=strategy,
766 )
767
768 if X.dtype == Y.dtype == np.float32:
769 return RadiusNeighborsClassMode32.compute(
770 X=X,
771 Y=Y,
772 radius=radius,
773 weights=weights,
774 class_membership=np.array(labels, dtype=np.intp),
775 unique_labels=np.array(unique_labels, dtype=np.intp),
776 outlier_label=outlier_label,
777 metric=metric,
778 chunk_size=chunk_size,
779 metric_kwargs=metric_kwargs,
780 strategy=strategy,
781 )
782
783 raise ValueError(
784 "Only float64 or float32 datasets pairs are supported at this time, "
785 f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}."
786 )
Filename: /Users/omar.salman/Projects/scikit-learn/sklearn/neighbors/_classification.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
697 127.328 MiB 127.328 MiB 1 @profile
698 def predict_proba(self, X):
699 """Return probability estimates for the test data X.
700
701 Parameters
702 ----------
703 X : {array-like, sparse matrix} of shape (n_queries, n_features), \
704 or (n_queries, n_indexed) if metric == 'precomputed'
705 Test samples.
706
707 Returns
708 -------
709 p : ndarray of shape (n_queries, n_classes), or a list of \
710 n_outputs of such arrays if n_outputs > 1.
711 The class probabilities of the input samples. Classes are ordered
712 by lexicographic order.
713 """
714 127.328 MiB 0.000 MiB 1 check_is_fitted(self, "_fit_method")
715 127.328 MiB 0.000 MiB 1 n_queries = _num_samples(X)
716
717 127.328 MiB 0.000 MiB 2 metric, metric_kwargs = _adjusted_metric(
718 127.328 MiB 0.000 MiB 1 metric=self.metric, metric_kwargs=self.metric_params, p=self.p
719 )
720
721 if (
722 127.328 MiB 0.000 MiB 1 self.weights == "uniform"
723 127.344 MiB 0.000 MiB 3 and self._fit_method == "brute"
724 127.328 MiB 0.000 MiB 1 and not self.outputs_2d_
725 127.344 MiB 0.016 MiB 1 and RadiusNeighborsClassMode.is_usable_for(X, self._fit_X, metric)
726 ):
727 901.719 MiB 901.719 MiB 2 probabilities = RadiusNeighborsClassMode.compute(
728 127.344 MiB 0.000 MiB 1 X=X,
729 127.344 MiB 0.000 MiB 1 Y=self._fit_X,
730 127.344 MiB 0.000 MiB 1 radius=self.radius,
731 127.344 MiB 0.000 MiB 1 weights=self.weights,
732 127.344 MiB 0.000 MiB 1 labels=self._y,
733 127.344 MiB 0.000 MiB 1 unique_labels=self.classes_,
734 127.344 MiB 0.000 MiB 1 outlier_label=self.outlier_label,
735 127.344 MiB 0.000 MiB 1 metric=metric,
736 127.344 MiB 0.000 MiB 1 metric_kwargs=metric_kwargs,
737 127.344 MiB 0.000 MiB 1 strategy="parallel_on_X",
738 )
739 else:
740 neigh_dist, neigh_ind = self.radius_neighbors(X)
741 outlier_mask = np.zeros(n_queries, dtype=bool)
742 outlier_mask[:] = [len(nind) == 0 for nind in neigh_ind]
743 outliers = np.flatnonzero(outlier_mask)
744 inliers = np.flatnonzero(~outlier_mask)
745
746 classes_ = self.classes_
747 _y = self._y
748 if not self.outputs_2d_:
749 _y = self._y.reshape((-1, 1))
750 classes_ = [self.classes_]
751
752 if self.outlier_label_ is None and outliers.size > 0:
753 raise ValueError(
754 "No neighbors found for test samples %r, "
755 "you can try using larger radius, "
756 "giving a label for outliers, "
757 "or considering removing them from your dataset." % outliers
758 )
759
760 weights = _get_weights(neigh_dist, self.weights)
761 if weights is not None:
762 weights = weights[inliers]
763
764 probabilities = []
765 # iterate over multi-output, measure probabilities of the k-th output.
766 for k, classes_k in enumerate(classes_):
767 pred_labels = np.zeros(len(neigh_ind), dtype=object)
768 pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
769
770 proba_k = np.zeros((n_queries, classes_k.size))
771 proba_inl = np.zeros((len(inliers), classes_k.size))
772
773 # samples have different size of neighbors within the same radius
774 if weights is None:
775 for i, idx in enumerate(pred_labels[inliers]):
776 proba_inl[i, :] = np.bincount(idx, minlength=classes_k.size)
777 else:
778 for i, idx in enumerate(pred_labels[inliers]):
779 proba_inl[i, :] = np.bincount(
780 idx, weights[i], minlength=classes_k.size
781 )
782 proba_k[inliers, :] = proba_inl
783
784 if outliers.size > 0:
785 _outlier_label = self.outlier_label_[k]
786 label_index = np.flatnonzero(classes_k == _outlier_label)
787 if label_index.size == 1:
788 proba_k[outliers, label_index[0]] = 1.0
789 else:
790 warnings.warn(
791 "Outlier label {} is not in training "
792 "classes. All class probabilities of "
793 "outliers will be assigned with 0."
794 "".format(self.outlier_label_[k])
795 )
796
797 # normalize 'votes' into real [0,1] probabilities
798 normalizer = proba_k.sum(axis=1)[:, np.newaxis]
799 normalizer[normalizer == 0.0] = 1.0
800 proba_k /= normalizer
801
802 probabilities.append(proba_k)
803
804 if not self.outputs_2d_:
805 probabilities = probabilities[0]
806
807 901.719 MiB 0.000 MiB 1 return probabilities |
Thanks for the report. I do not know if the memory is being released after returning. I think I should have been clearer and I should have asked to report plots using I have low bandwidth to guide you at the moment but will try providing more precise suggestions the the previous ones. |
No problem! Thanks for the guidance. |
Yup! Just to ensure that the memory use isn't significantly greater so as to introduce a regression. |
Nice! We actually have some good savings on memory overhead. I still need to make a pass through the code, but will do soon. Thank you for your work @OmarManzoor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work done so far! Sorry for the delay in review. Let me know if you have any questions/concerns.
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Outdated
Show resolved
Hide resolved
and not issparse(X) | ||
and not issparse(Y) | ||
# TODO: implement Euclidean specialization with GEMM. | ||
and metric not in ("euclidean", "sqeuclidean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, we should probably specify this in cls.valid_metrics
instead
outlier_label : int, default=None | ||
Label for outlier samples (samples with no neighbors in given | ||
radius). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we add to the description the default behavior when outlier_label=None
?
if outlier_label is not None: | ||
self.outlier_label_exists = True | ||
for idx in range(self.unique_labels.shape[0]): | ||
if self.unique_labels[idx] == outlier_label: | ||
self.outlier_index = idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to use an outlier_label
value that is not included in the list of unique labels? Mainly, the unique labels are inferred from the provided labels at the call-site, so if a user wanted to mark all outliers with an out-of-distribution label (e.g. -1
) then they'd need to know a-priori to add it to unique_labels
.
Either we allow them to provide it independent of unique_labels
, or clearly document that it must be present in unique_labels
. If we want to keep identical semantics to RadiusNeighborsClassifier
, we should do the latter.
From a practitioner perspective, I think this is less than ideal. I imagine, in practice, it is helpful to be able to do this ad-hoc rather than a-priori. That is to say, if you have unique_labels=[1, 2, 3]
then allow outlier_label=5
without issue. Even so, that would be a different PR I suppose, since we don't intend for this to really be publicly used...
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Show resolved
Hide resolved
outlier_label=self.outlier_label, | ||
metric=metric, | ||
metric_kwargs=metric_kwargs, | ||
strategy="parallel_on_X", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there benchmarks to support a fixed dispatch on X
rather than an arbitrary heuristic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is based on the current call to ArgKminClassMode
:
scikit-learn/sklearn/neighbors/_classification.py
Lines 327 to 345 in 59048f9
probabilities = ArgKminClassMode.compute( | |
X, | |
self._fit_X, | |
k=self.n_neighbors, | |
weights=self.weights, | |
Y_labels=self._y, | |
unique_Y_labels=self.classes_, | |
metric=metric, | |
metric_kwargs=metric_kwargs, | |
# `strategy="parallel_on_X"` has in practice be shown | |
# to be more efficient than `strategy="parallel_on_Y`` | |
# on many combination of datasets. | |
# Hence, we choose to enforce it here. | |
# For more information, see: | |
# https://github.com/scikit-learn/scikit-learn/pull/24076#issuecomment-1445258342 # noqa | |
# TODO: adapt the heuristic for `strategy="auto"` for | |
# `ArgKminClassMode` and use `strategy="auto"`. | |
strategy="parallel_on_X", | |
) |
Benchmarks you previously made are reported in #24076 (comment).
I think it is worth replicating the comment here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I did not run any specific benchmarks with regards to this but it does seem that parallel_on_X runs generally better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a first review of the code completing @Micky774's.
Looking back, I think a few adaptations could be made for ArgKminClassMode
. Also we might be able to favor some logic, but we better do it after this PR.
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Outdated
Show resolved
Hide resolved
class_membership=np.array(labels, dtype=np.intp), | ||
unique_labels=np.array(unique_labels, dtype=np.intp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Micky774: This is taken from ArgKminClassMode
. Do you remember why the conversion to NumPy arrays are needed here? Is it to support Python lists?
If so, could we perform the conversion on dedicated lines and use np.asarray
instead?
class_membership=np.array(labels, dtype=np.intp), | ||
unique_labels=np.array(unique_labels, dtype=np.intp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question and comment as the previous one.
RadiusNeighborsClassMode is typically used to perform bruteforce | ||
radius neighbors queries when the weighted mode of the labels for | ||
the nearest neighbors within the specified radius are required, | ||
such as in `predict` methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR: I think such parts of the doc-string of {ArgKmin,RadiusNeighbors}ClassMode
can be made clearer in another pull request.
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Outdated
Show resolved
Hide resolved
) | ||
|
||
|
||
cdef inline void weighted_histogram_mode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side-note: {ArgKmin,RadiusNeighbors}.weighted_histogram_mode
are notably similar. Potentially this and the enum
might be factorizable elements for those implementations. I would study this in another PR.
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another review.
Looking back, I am a bit surprised by the result of asv reported in #26828 (comment): the performance of those implementations reaches the one on main
when datasets get taller (i.e. when there are more samples).
I would rerun benchmarks on more configurations and also on a machine with more cores than laptops' to better understand the dynamics.
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Outdated
Show resolved
Hide resolved
weights=weights, | ||
Y_labels=Y_labels, | ||
unique_Y_labels=unique_Y_labels, | ||
outlier_label=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we parametrize this test to also test cases where outlier_label
is not None
?
else: | ||
self.weight_type = WeightingStrategy.callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note related to this PR: As of the presence of the GIL, I do not think we should support Python callable in Cython because this won't be efficient (callable are object which needs to be access holding the GIL, which meant no effective parallelized execution).
sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors_classmode.pyx.tp
Show resolved
Hide resolved
I hardly have bandwidth for scikit-learn for now, but I might in the future. |
Hi @OmarManzoor, I reran your ASV benchmark with the DetailsNote that I have adapted
I am currently rerunning them without the |
Thank you for running the benchmarks @jjerphan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adapted the benchmark script to test on the sparse-dense case as well.
The performance improvements are compelling to me and can get to 3× speed-ups as fast as the current implementation.
There's only one regression out of all the runs; might it be alleviated by #27145?
The implementation looks good to me, from a maintenance perspective it adds a lot of lines but I think this is acceptable.
Yet, I would wait for one or two more approvals to get other point of views on maintenance and on the benchmarked cases before merging this PR.
asv benchmark script
import numpy as np
from scipy.spatial.distance import cdist
from scipy.sparse import rand as sparse_rand
from .common import Benchmark
from sklearn.neighbors import RadiusNeighborsClassifier
class BruteRadiusNeighborsClass(Benchmark):
param_names = [
"n_train",
"n_test",
"n_features",
"metric",
"dtype",
"X_train",
"X_test",
]
params = [
[1000, 10_000],
[1000, 10_000],
[10, 50, 100],
["manhattan"],
[np.float64],
["dense", "sparse"],
["dense"],
]
def setup(self, *params):
n_train, n_test, n_features, metric, dtype, X_train, X_test = params
rng = np.random.RandomState(0)
self.X_train = (
rng.rand(n_train, n_features).astype(dtype)
if X_train == "dense"
else sparse_rand(
n_train,
n_features,
density=0.05,
format="csr",
dtype=dtype,
random_state=rng,
)
)
self.X_test = (
rng.rand(n_test, n_features).astype(dtype)
if X_test == "dense"
else sparse_rand(
n_test,
n_features,
density=0.05,
format="csr",
dtype=dtype,
random_state=rng,
)
)
self.y_train = rng.randint(low=-1, high=1, size=(n_train,))
self.metric = metric
dist_mat = cdist(
(self.X_train if X_train == "dense" else self.X_train.toarray())[:1000],
(self.X_test if X_test == "dense" else self.X_test.toarray())[:10],
)
self.radius = np.quantile(a=dist_mat.ravel(), q=0.05) * 20
self.rc = RadiusNeighborsClassifier(
radius=self.radius, algorithm="brute", metric=self.metric,
)
self.rc.fit(X=self.X_train, y=self.y_train)
def time_predict_proba(self, *params):
self.rc.predict_proba(self.X_test)
asv benchmark results
| Change | Before [c634b8ab] <main> | After [a9733250] <radius_neighbors_predict_pwdb> | Ratio | Benchmark (Parameter) |
|----------|----------------------------|----------------------------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| + | 379±5ms | 466±4ms | 1.23 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 201±2ms | 180±3ms | 0.9 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 378±6ms | 311±2ms | 0.82 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 287±4ms | 223±3ms | 0.78 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 141±3ms | 111±2ms | 0.78 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 60.4±0.8ms | 45.3±0.2ms | 0.75 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 4.19±0.1s | 3.13±0.01s | 0.75 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 238±10ms | 168±4ms | 0.71 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 129±0.5ms | 90.0±3ms | 0.7 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 36.1±0.7ms | 24.8±0.6ms | 0.69 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 48.4±0.4ms | 32.5±0.3ms | 0.67 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 199±3ms | 134±3ms | 0.67 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 3.28±0.01s | 2.11±0.01s | 0.64 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 100, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 2.64±0.01s | 1.62±0s | 0.61 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 29.7±0.9ms | 17.5±0.09ms | 0.59 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 134±0.6ms | 78.9±4ms | 0.59 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 2.28±0.01s | 1.22±0s | 0.53 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 50, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 20.2±0.8ms | 10.1±0.09ms | 0.5 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 124±0.7ms | 48.6±1ms | 0.39 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 1.64±0.02s | 644±5ms | 0.39 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'sparse', 'dense') |
| - | 1.52±0.01s | 548±8ms | 0.36 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(10000, 10000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
| - | 23.2±3ms | 7.76±0.07ms | 0.33 | radius_neighbors_classifier_predict_proba_asv.BruteRadiusNeighborsClass.time_predict_proba(1000, 1000, 10, 'manhattan', <class 'numpy.float64'>, 'dense', 'dense') |
PairwiseDistancesReduction
backend for RadiusNeighbors.predict_proba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @OmarManzoor! I left a couple comments, one nit and one recommendation for enforcing contiguous memory. Aside from that, +1 from me
const intp_t[:] Y_labels | ||
const intp_t[:] unique_Y_labels | ||
intp_t outlier_label_index | ||
bint outlier_label_exists | ||
bint outliers_exist | ||
unsigned char[::1] outliers | ||
object outlier_label | ||
float64_t[:, :] class_scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can expect/enforce contiguous memory here.
const intp_t[:] Y_labels | |
const intp_t[:] unique_Y_labels | |
intp_t outlier_label_index | |
bint outlier_label_exists | |
bint outliers_exist | |
unsigned char[::1] outliers | |
object outlier_label | |
float64_t[:, :] class_scores | |
const intp_t[::1] Y_labels | |
const intp_t[::1] unique_Y_labels | |
intp_t outlier_label_index | |
bint outlier_label_exists | |
bint outliers_exist | |
unsigned char[::1] outliers | |
object outlier_label | |
float64_t[:, ::1] class_scores |
self.outlier_label_index = -1 | ||
self.outliers_exist = False | ||
self.outlier_label = outlier_label | ||
self.outlier_label_exists = outlier_label is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think we can replace self.outlier_label_exists
with its literal self.outlier_label is not None
. IMO it is actually a bit more readable, but this is personal preference so feel free to dismiss this suggestion
Thank you very much for the work @OmarManzoor, and thank you @jjerphan for keeping up with review (and helping remind me). Congrats! |
…ors.predict_proba` (scikit-learn#26828) Co-authored-by: Omar Salman <omar.salman@arbisoft> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…ors.predict_proba` (scikit-learn#26828) Co-authored-by: Omar Salman <omar.salman@arbisoft> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Towards: #25888
What does this implement/fix? Explain your changes.
Up to 3× speed-ups, see: #26828 (review)
Any other comments?
cc: @jjerphan