-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN: RadiusNeighborRegressor speedup #24053
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JoOkuma.
Thank you for this contribution. I appreciate the reported performance improvements, unfortunately relying on this trick using a CSR matrix might in fact be detrimental in some configurations.
Also, it is more appropriate for the long term to introduce a specialised back-end for RadiusNeighborRegressor.predict
, improving performances for all configurations.
See: #23721 (comment) and #22587.
algorithm, dimension, n_jobs = params | ||
|
||
estimator = RadiusNeighborsRegressor( | ||
algorithm=algorithm, n_jobs=n_jobs, radius=0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the curse of dimensionality, the radius is notably small for dim_to_number >= 20
. Is it possible to scale it by log(dim_to_number)
or to define it as the 10th percentile of the observed distances?
|
||
n_features = self.dim_to_number[dimension] | ||
|
||
data = _synth_regression_dataset(n_samples=10000, n_features=n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar improvement have been tried in #23721, but suffered for memory overconsumption as n_samples
got larger (see #23721 (review)).
Is it possible to report ASV results for n_samples=int(1e6)
or even n_samples=int(1e7)
?
] | ||
) | ||
neigh_dst = np.concatenate(neigh_ind, axis=0) | ||
neigh_src = np.repeat(np.arange(n_samples), repeats=n_neigh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matrix might have a significant memory footprint.
Hi @JoOkuma, do you still have time to pursue this work? |
Hi, not right now. I might give it a shot in a few weeks. |
What does this implement/fix? Explain your changes.
When working with low-dimensional data the neighborhood queries are very fast and most of the time is spent iterating through the neighborhood to compute the predictions since the neighborhoods might not have the same length it employs a different mechanism than the
KNNRegressor
.This PR converts the loop operations to sparse matrix multiplications, providing 10x speed on low-dimensional datasets without performance decrease on higher dimensions (where most of the time is spent on neighbor's query).
Relevant results from ASV:
Any other comments?
After these changes, profiling indicates that more than half of the time (for low-dimension datasets) is spent on the
_get_weights
function. I was able to get a speedup on it, but the behavior changed a little bit when there's a point identical to the training set. Hence, I'm not pushing this change, but additional suggestions are welcome.