Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add support for float32 on PairwiseDistancesReduction using Tempita #23865

Merged
merged 22 commits into from
Aug 10, 2022

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jul 8, 2022

Reference Issues/PRs

Follows-up #22134

What does this implement/fix? Explain your changes.

This ports PairwiseDistancesReduction and other implementations to 32bit using Tempita.

Benchmarks results

The hardware scalability plateaus at 64 threads because, asymptotically and using Adham's law, 2.5% of the code (which parts of it are due to interaction with CPython) is sequential.

Improved hardware scalability beyond that point, mean removing the last portions of sequential code accounting for the few points of percentage left.

1

Raw results
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1   100000  100000          50     54.270973               0
1           2   100000  100000          50     27.357690               0
2           4   100000  100000          50     13.772927               0
3           8   100000  100000          50      7.034176               0
4          16   100000  100000          50      3.851457               0
5          32   100000  100000          50      2.134666               0
6          64   100000  100000          50      1.482027               0
7         128   100000  100000          50      2.239688               0
8           1   100000  100000         100     77.925089               0
9           2   100000  100000         100     39.125349               0
10          4   100000  100000         100     19.810733               0
11          8   100000  100000         100     10.130284               0
12         16   100000  100000         100      5.506694               0
13         32   100000  100000         100      3.067685               0
14         64   100000  100000         100      2.061337               0
15        128   100000  100000         100      3.396916               0
16          1   100000  100000         500    274.099079               0
17          2   100000  100000         500    138.078319               0
18          4   100000  100000         500     70.136737               0
19          8   100000  100000         500     35.598209               0
20         16   100000  100000         500     19.321611               0
21         32   100000  100000         500     10.415704               0
22         64   100000  100000         500      7.194686               0
23        128   100000  100000         500     12.095341               0

2

Details
    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1  1000000   10000          50     53.670283               0
1           2  1000000   10000          50     27.603052               0
2           4  1000000   10000          50     14.014151               0
3           8  1000000   10000          50      7.138670               0
4          16  1000000   10000          50      3.810226               0
5          32  1000000   10000          50      2.129321               0
6          64  1000000   10000          50      1.363076               0
7         128  1000000   10000          50      1.540974               0
8           1  1000000   10000         100     77.725753               0
9           2  1000000   10000         100     39.835435               0
10          4  1000000   10000         100     20.107033               0
11          8  1000000   10000         100     10.242633               0
12         16  1000000   10000         100      5.499012               0
13         32  1000000   10000         100      3.151450               0
14         64  1000000   10000         100      2.051802               0
15        128  1000000   10000         100      2.319589               0
16          1  1000000   10000         500    274.992947               0
17          2  1000000   10000         500    140.689740               0
18          4  1000000   10000         500     70.843511               0
19          8  1000000   10000         500     36.023845               0
20         16  1000000   10000         500     19.761463               0
21         32  1000000   10000         500     10.633548               0
22         64  1000000   10000         500      7.017808               0
23        128  1000000   10000         500      8.313477               0

Benchmarks results between main (a5d50cf) and this PR @ 31b8b28 (via 2c842bd)

Between ×1.2 and, well, ×250+ speed-ups: it looks like it just scales linearly.
Regressions are due to using too many cores when the size of the problem (i.e. n_train and n_test) is small.

1 thread
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         2/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     11.3±0.2ms     105±2ms       1.06±0s    
                10000     89.6±0.9ms     878±4ms      8.76±0.01s  
               10000000    1.41±0m        failed        failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     20.1±0.1ms     193±1ms       2.02±0s    
                10000      203±1ms       2.12±0s       21.0±0s    
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      20.1±0.1ms       11.3±0.2ms     0.56  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         193±1ms          105±2ms     0.54  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         2.02±0s          1.06±0s     0.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         203±1ms       89.6±0.9ms     0.44  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         21.0±0s       8.76±0.01s     0.42  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)
-         2.12±0s          878±4ms     0.41  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
2 threads
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         2/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     6.89±0.2ms    53.9±0.7ms     531±3ms    
                10000     46.7±0.2ms    443±0.5ms      4.33±0s    
               10000000   42.7±0.06s      failed        failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     18.3±0.2ms     171±1ms       1.78±0s    
                10000     179±0.9ms      1.86±0s       18.6±0s    
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      18.3±0.2ms       6.89±0.2ms     0.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         171±1ms       53.9±0.7ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.78±0s          531±3ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-       179±0.9ms       46.7±0.2ms     0.26  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.86±0s        443±0.5ms     0.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         18.6±0s          4.33±0s     0.23  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
4 threads
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         2/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     5.60±0.07ms    29.6±0.4ms     276±2ms    
                10000      27.3±0.3ms     230±1ms      2.23±0.01s  
               10000000    21.7±0.01s      failed        failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     17.4±0.2ms    163±0.8ms      1.68±0s    
                10000      172±1ms       1.77±0s      17.5±0.02s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      17.4±0.2ms      5.60±0.07ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-       163±0.8ms       29.6±0.4ms     0.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.68±0s          276±2ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         172±1ms       27.3±0.3ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.77±0s          230±1ms     0.13  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
8 threads
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     6.83±0.03ms    17.1±0.3ms     150±2ms    
                10000      17.3±0.3ms     122±1ms      1.16±0.01s  
               10000000    11.5±0.01s     1.89±0m        failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     18.8±0.2ms    167±0.7ms      1.71±0s    
                10000     176±0.6ms      1.80±0s      17.7±0.01s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      18.8±0.2ms      6.83±0.03ms     0.36  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-       167±0.7ms       17.1±0.3ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-       176±0.6ms       17.3±0.3ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.71±0s          150±2ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         1.80±0s          122±1ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
16 threads
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     9.97±0.08ms    60.3±0.3ms    87.2±0.6ms  
                10000      15.7±0.2ms    106±0.6ms      631±3ms    
               10000000    6.22±0.02s    59.4±0.2s       failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     20.5±0.3ms     168±1ms       1.67±0s    
                10000      175±1ms       1.82±0s      18.0±0.02s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      20.5±0.3ms      9.97±0.08ms     0.49  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         168±1ms       60.3±0.3ms     0.36  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         175±1ms       15.7±0.2ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.82±0s        106±0.6ms     0.06  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.67±0s       87.2±0.6ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
32 threads
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     17.4±0.08ms    95.7±0.3ms    59.3±0.3ms  
                10000      21.0±0.3ms    92.8±0.3ms     366±20ms   
               10000000    3.49±0.02s    32.3±0.1s       failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     21.4±0.5ms     173±2ms       1.64±0s    
                10000      179±3ms       1.82±0s      18.1±0.01s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      21.4±0.5ms      17.4±0.08ms     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         173±2ms       95.7±0.3ms     0.55  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         179±3ms       21.0±0.3ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.82±0s       92.8±0.3ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.64±0s       59.3±0.3ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
64 threads
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     33.2±10ms      185±20ms     49.0±0.2ms  
                10000     31.2±0.3ms     169±10ms      246±20ms   
               10000000   2.50±0.04s    20.0±0.02s      failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000      37.9±5ms      200±2ms       1.76±0s    
                10000     208±0.8ms     2.07±0.01s    20.1±0.04s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-       208±0.8ms       31.2±0.3ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-      2.07±0.01s         169±10ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.76±0s       49.0±0.2ms     0.03  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
128 threads
· Creating environments
· Discovering benchmarks
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000      250±30ms     1.45±0.1s     13.9±0.1s   
                10000      239±10ms     1.42±0.05s    12.8±0.1s   
               10000000   1.66±0.02s    13.1±0.08s      failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     51.2±10ms      225±2ms      1.85±0.04s  
                10000      226±2ms      2.22±0.03s    21.3±0.03s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
+         225±2ms        1.45±0.1s     6.45  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
+       51.2±10ms         250±30ms     4.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-      2.22±0.03s       1.42±0.05s     0.64  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-      21.3±0.03s        12.8±0.1s     0.60  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
Benchmarks information
Machine specification
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             3388.360
BogoMIPS:            4491.59
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2022

Thanks for the updated PR. I assume that merging with main is needed before starting to review this.

Could you run the benchmarks with more imbalanced train / tests, e.g. n_samples_train = int(1e7) and n_samples_test = 1000?

I wonder if the performance slowdown for a very large number of threads is caused by the fact that we have two few chunks to execute per-thread and using more imbalanced benchmark cases might validate (or invalidate) this hypothesis.

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2022

still, a 50x speed-up w.r.t. main is nice :)

@jjerphan
Copy link
Member Author

Thanks for the updated PR. I assume that merging with main is needed before starting to review this.

You're welcome! I did not know if this PR was submitted correctly while travelling (🍀), another one is to come for a new pairwise_distances back-end.

still, a 50x speed-up w.r.t. main is nice :)

Yes, I am quite glad we can reach those performance. I don't think we need to adapt the chunk size for the float32 case because there's little additional memory due to data-structures (the extra datastructures memory-wise are just the original X_c and Y_c) but this can be tried in another PR. :)

Could you run the benchmarks with more imbalanced train / tests, e.g. n_samples_train = int(1e7) and n_samples_test = 1000?

Yes. Let's try that.

I wonder if the performance slowdown for a very large number of threads is caused by the fact that we have two few chunks to execute per-thread and using more imbalanced benchmark cases might validate (or invalidate) this hypothesis.

I share the same hypothesis. I think we can explore a strategies to have a minimal number of batch per thread in another PR (a task I have added in the TODO list in the description).

@jjerphan
Copy link
Member Author

jjerphan commented Jul 17, 2022

I've adapted the description with updated benchmarks script and results.

It looks like the implementations scales well on the (n_samples_train, n_samples_test) = (int(1e7), 1000) case. On main, the execution times out after 500s in all the cases, even when using 128 threads, but complete in less than 2s, reaching "×100+ speed-ups" in this case (see the raw logs in this PR description).

The drop is mainly present when using too much threads. I think the PairwiseDistancesArgKmin used under the hood scales even better and that it is the sequential part at the beginning of kneighbors which might be costly.

Probably we could have had advertised the practical support for really large datasets more in the changelog for 1.1? :)

@adrinjalali
Copy link
Member

WOW

@jjerphan jjerphan marked this pull request as ready for review July 17, 2022 11:23
@jjerphan jjerphan added Performance float32 Issues related to support for 32bit data labels Jul 17, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are on the path to make everything use Tempita :)

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
jjerphan and others added 2 commits July 18, 2022 20:51
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@jjerphan
Copy link
Member Author

We are on the path to make everything use Tempita :)

It looks like yes. IMO, even if it's suboptimal, restrictive and hard to maintain on the long run, it's rather a pragmatic solution given where we are at today.

From IRL discussions this week 🍀, it looks like @adrinjalali is interested to experiment with alternatives, like Rust. I think it's worth exploring, but might add complexity especially on the build setup and on interfacing with other libraries like BLAS and colleagues.
Similar concerns also apply for C++.

Probably the work on #22438 might help?

sklearn/metrics/_dist_metrics.pxd.tp Outdated Show resolved Hide resolved
doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_base.pxd.tp Outdated Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a minor nit: otherwise LGTM

Done with:

grep -rl need_upcast . | xargs sed -i's/need_upcast/upcast_to_float64/g'

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@jjerphan jjerphan changed the title FEA Port PairwiseDistancesReduction to 32bit using Tempita FEA Add support for float32 on PairwiseDistancesReduction using Tempita Jul 27, 2022
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another quick pass. LGTM. Let's merge and handle type renaming in dedicated PRs (e.g. #24153).

@ogrisel ogrisel merged commit b7d0171 into scikit-learn:main Aug 10, 2022
@ogrisel
Copy link
Member

ogrisel commented Aug 10, 2022

Thanks @jjerphan 🎉

@jjerphan
Copy link
Member Author

Thanks @ogrisel and @thomasjpfan for the reviews!

@jjerphan jjerphan deleted the feat/pdr-32bit branch August 10, 2022 15:36
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Aug 11, 2022
This update the branch after the merge of scikit-learn#23865.
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 12, 2022
…pita (scikit-learn#23865)


Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython float32 Issues related to support for 32bit data module:metrics Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants