Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: RadiusNeighborRegressor speedup #24053

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JoOkuma
Copy link

@JoOkuma JoOkuma commented Jul 29, 2022

What does this implement/fix? Explain your changes.

When working with low-dimensional data the neighborhood queries are very fast and most of the time is spent iterating through the neighborhood to compute the predictions since the neighborhoods might not have the same length it employs a different mechanism than the KNNRegressor.

This PR converts the loop operations to sparse matrix multiplications, providing 10x speed on low-dimensional datasets without performance decrease on higher dimensions (where most of the time is spent on neighbor's query).

Relevant results from ASV:

THIS BRANCH:

[16.67%] ··· neighbors.RadiusNeighborsRegressorBenchmark.peakmem_predict                                                                             ok
[16.67%] ··· =========== ============== ========= ==========
             --                   dimension / n_jobs        
             ----------- -----------------------------------
              algorithm   very-low / 4   low / 4   high / 4 
             =========== ============== ========= ==========
                brute         788M         777M      686M   
               kd_tree       87.9M        91.7M      128M   
              ball_tree      87.4M        91.4M      127M   
             =========== ============== ========= ==========

[33.33%] ··· neighbors.RadiusNeighborsRegressorBenchmark.time_predict                                                                                ok
[33.33%] ··· =========== ============== ========== ============
             --                    dimension / n_jobs          
             ----------- --------------------------------------
              algorithm   very-low / 4   low / 4     high / 4  
             =========== ============== ========== ============
                brute       825±30ms     847±30ms    920±30ms  
               kd_tree      15.1±3ms     36.3±6ms    211±50ms  
              ball_tree     63.7±8ms     785±60ms   4.31±0.03s 
             =========== ============== ========== ============

MAIN BRANCH:

[66.67%] ··· neighbors.RadiusNeighborsRegressorBenchmark.peakmem_predict                                                                             ok
[66.67%] ··· =========== ============== ========= ==========
             --                   dimension / n_jobs        
             ----------- -----------------------------------
              algorithm   very-low / 4   low / 4   high / 4 
             =========== ============== ========= ==========
                brute         817M         729M      641M   
               kd_tree       88.1M         92M       128M   
              ball_tree      87.9M        91.7M      127M   
             =========== ============== ========= ==========
             
[83.33%] ··· neighbors.RadiusNeighborsRegressorBenchmark.time_predict                                                                                ok
[83.33%] ··· =========== ============== ========== ============
             --                    dimension / n_jobs          
             ----------- --------------------------------------
              algorithm   very-low / 4   low / 4     high / 4  
             =========== ============== ========== ============
                brute       906±40ms     939±30ms   1.02±0.04s 
               kd_tree      116±20ms     130±20ms    320±60ms  
              ball_tree     166±20ms     890±40ms   4.31±0.04s 
             =========== ============== ========== ============

Any other comments?

After these changes, profiling indicates that more than half of the time (for low-dimension datasets) is spent on the _get_weights function. I was able to get a speedup on it, but the behavior changed a little bit when there's a point identical to the training set. Hence, I'm not pushing this change, but additional suggestions are welcome.

@JoOkuma JoOkuma changed the title EHN: RadiusNeaighborRegressor speedup EHN: RadiusNeighborRegressor speedup Jul 29, 2022
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JoOkuma.

Thank you for this contribution. I appreciate the reported performance improvements, unfortunately relying on this trick using a CSR matrix might in fact be detrimental in some configurations.

Also, it is more appropriate for the long term to introduce a specialised back-end for RadiusNeighborRegressor.predict, improving performances for all configurations.

See: #23721 (comment) and #22587.

algorithm, dimension, n_jobs = params

estimator = RadiusNeighborsRegressor(
algorithm=algorithm, n_jobs=n_jobs, radius=0.05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the curse of dimensionality, the radius is notably small for dim_to_number >= 20. Is it possible to scale it by log(dim_to_number) or to define it as the 10th percentile of the observed distances?


n_features = self.dim_to_number[dimension]

data = _synth_regression_dataset(n_samples=10000, n_features=n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar improvement have been tried in #23721, but suffered for memory overconsumption as n_samples got larger (see #23721 (review)).

Is it possible to report ASV results for n_samples=int(1e6) or even n_samples=int(1e7)?

]
)
neigh_dst = np.concatenate(neigh_ind, axis=0)
neigh_src = np.repeat(np.arange(n_samples), repeats=n_neigh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matrix might have a significant memory footprint.

@jjerphan
Copy link
Member

Hi @JoOkuma, do you still have time to pursue this work?

@JoOkuma
Copy link
Author

JoOkuma commented Jan 12, 2023

Hi, not right now. I might give it a shot in a few weeks.
Please feel free to close it if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants