EHN: RadiusNeighborRegressor speedup #24053

JoOkuma · 2022-07-29T18:04:10Z

What does this implement/fix? Explain your changes.

When working with low-dimensional data the neighborhood queries are very fast and most of the time is spent iterating through the neighborhood to compute the predictions since the neighborhoods might not have the same length it employs a different mechanism than the KNNRegressor.

This PR converts the loop operations to sparse matrix multiplications, providing 10x speed on low-dimensional datasets without performance decrease on higher dimensions (where most of the time is spent on neighbor's query).

Relevant results from ASV:

THIS BRANCH:

[16.67%] ··· neighbors.RadiusNeighborsRegressorBenchmark.peakmem_predict                                                                             ok
[16.67%] ··· =========== ============== ========= ==========
             --                   dimension / n_jobs        
             ----------- -----------------------------------
              algorithm   very-low / 4   low / 4   high / 4 
             =========== ============== ========= ==========
                brute         788M         777M      686M   
               kd_tree       87.9M        91.7M      128M   
              ball_tree      87.4M        91.4M      127M   
             =========== ============== ========= ==========

[33.33%] ··· neighbors.RadiusNeighborsRegressorBenchmark.time_predict                                                                                ok
[33.33%] ··· =========== ============== ========== ============
             --                    dimension / n_jobs          
             ----------- --------------------------------------
              algorithm   very-low / 4   low / 4     high / 4  
             =========== ============== ========== ============
                brute       825±30ms     847±30ms    920±30ms  
               kd_tree      15.1±3ms     36.3±6ms    211±50ms  
              ball_tree     63.7±8ms     785±60ms   4.31±0.03s 
             =========== ============== ========== ============

MAIN BRANCH:

[66.67%] ··· neighbors.RadiusNeighborsRegressorBenchmark.peakmem_predict                                                                             ok
[66.67%] ··· =========== ============== ========= ==========
             --                   dimension / n_jobs        
             ----------- -----------------------------------
              algorithm   very-low / 4   low / 4   high / 4 
             =========== ============== ========= ==========
                brute         817M         729M      641M   
               kd_tree       88.1M         92M       128M   
              ball_tree      87.9M        91.7M      127M   
             =========== ============== ========= ==========
             
[83.33%] ··· neighbors.RadiusNeighborsRegressorBenchmark.time_predict                                                                                ok
[83.33%] ··· =========== ============== ========== ============
             --                    dimension / n_jobs          
             ----------- --------------------------------------
              algorithm   very-low / 4   low / 4     high / 4  
             =========== ============== ========== ============
                brute       906±40ms     939±30ms   1.02±0.04s 
               kd_tree      116±20ms     130±20ms    320±60ms  
              ball_tree     166±20ms     890±40ms   4.31±0.04s 
             =========== ============== ========== ============

Any other comments?

After these changes, profiling indicates that more than half of the time (for low-dimension datasets) is spent on the _get_weights function. I was able to get a speedup on it, but the behavior changed a little bit when there's a point identical to the training set. Hence, I'm not pushing this change, but additional suggestions are welcome.

jjerphan

Hi @JoOkuma.

Thank you for this contribution. I appreciate the reported performance improvements, unfortunately relying on this trick using a CSR matrix might in fact be detrimental in some configurations.

Also, it is more appropriate for the long term to introduce a specialised back-end for RadiusNeighborRegressor.predict, improving performances for all configurations.

See: #23721 (comment) and #22587.

jjerphan · 2022-08-01T06:52:07Z

asv_benchmarks/benchmarks/neighbors.py

+        algorithm, dimension, n_jobs = params
+
+        estimator = RadiusNeighborsRegressor(
+            algorithm=algorithm, n_jobs=n_jobs, radius=0.05


Due to the curse of dimensionality, the radius is notably small for dim_to_number >= 20. Is it possible to scale it by log(dim_to_number) or to define it as the 10th percentile of the observed distances?

jjerphan · 2022-08-01T06:52:45Z

asv_benchmarks/benchmarks/neighbors.py

+
+        n_features = self.dim_to_number[dimension]
+
+        data = _synth_regression_dataset(n_samples=10000, n_features=n_features)


A similar improvement have been tried in #23721, but suffered for memory overconsumption as n_samples got larger (see #23721 (review)).

Is it possible to report ASV results for n_samples=int(1e6) or even n_samples=int(1e7)?

jjerphan · 2022-08-01T06:55:28Z

sklearn/neighbors/_regression.py

-                ]
-            )
+        neigh_dst = np.concatenate(neigh_ind, axis=0)
+        neigh_src = np.repeat(np.arange(n_samples), repeats=n_neigh)


This matrix might have a significant memory footprint.

jjerphan · 2023-01-12T14:51:40Z

Hi @JoOkuma, do you still have time to pursue this work?

JoOkuma · 2023-01-12T17:03:34Z

Hi, not right now. I might give it a shot in a few weeks.
Please feel free to close it if you'd like.

JoOkuma added 2 commits July 29, 2022 09:30

converted RadiusNeighRegression predict for loop to sparse matrix mult

139682f

added RadiusNeighborsRegressor asv benchmark

7de716c

JoOkuma changed the title ~~EHN: RadiusNeaighborRegressor speedup~~ EHN: RadiusNeighborRegressor speedup Jul 29, 2022

github-actions bot added the module:neighbors label Jul 29, 2022

running pre-commit

c552387

Micky774 added the Performance label Jul 29, 2022

jjerphan reviewed Aug 1, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EHN: RadiusNeighborRegressor speedup #24053

EHN: RadiusNeighborRegressor speedup #24053

JoOkuma commented Jul 29, 2022 •

edited

Loading

jjerphan left a comment

jjerphan Aug 1, 2022

jjerphan Aug 1, 2022

jjerphan Aug 1, 2022

jjerphan commented Jan 12, 2023

JoOkuma commented Jan 12, 2023


		n_features = self.dim_to_number[dimension]

		data = _synth_regression_dataset(n_samples=10000, n_features=n_features)

EHN: RadiusNeighborRegressor speedup #24053

Are you sure you want to change the base?

EHN: RadiusNeighborRegressor speedup #24053

Conversation

JoOkuma commented Jul 29, 2022 • edited Loading

What does this implement/fix? Explain your changes.

Any other comments?

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan Aug 1, 2022

Choose a reason for hiding this comment

jjerphan Aug 1, 2022

Choose a reason for hiding this comment

jjerphan Aug 1, 2022

Choose a reason for hiding this comment

jjerphan commented Jan 12, 2023

JoOkuma commented Jan 12, 2023

JoOkuma commented Jul 29, 2022 •

edited

Loading