-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add support for float32 on PairwiseDistancesReduction
using Tempita
#23865
Conversation
Thanks for the updated PR. I assume that merging with main is needed before starting to review this. Could you run the benchmarks with more imbalanced train / tests, e.g. I wonder if the performance slowdown for a very large number of threads is caused by the fact that we have two few chunks to execute per-thread and using more imbalanced benchmark cases might validate (or invalidate) this hypothesis. |
still, a 50x speed-up w.r.t. |
You're welcome! I did not know if this PR was submitted correctly while travelling (🍀), another one is to come for a new
Yes, I am quite glad we can reach those performance. I don't think we need to adapt the chunk size for the float32 case because there's little additional memory due to data-structures (the extra datastructures memory-wise are just the original
Yes. Let's try that.
I share the same hypothesis. I think we can explore a strategies to have a minimal number of batch per thread in another PR (a task I have added in the TODO list in the description). |
I've adapted the description with updated benchmarks script and results. It looks like the implementations scales well on the The drop is mainly present when using too much threads. I think the Probably we could have had advertised the practical support for really large datasets more in the changelog for 1.1? :) |
WOW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are on the path to make everything use Tempita :)
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
It looks like yes. IMO, even if it's suboptimal, restrictive and hard to maintain on the long run, it's rather a pragmatic solution given where we are at today. From IRL discussions this week 🍀, it looks like @adrinjalali is interested to experiment with alternatives, like Rust. I think it's worth exploring, but might add complexity especially on the build setup and on interfacing with other libraries like BLAS and colleagues. Probably the work on #22438 might help? |
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd.tp
Outdated
Show resolved
Hide resolved
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx.tp
Outdated
Show resolved
Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a minor nit: otherwise LGTM
sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pxd.tp
Outdated
Show resolved
Hide resolved
Done with: grep -rl need_upcast . | xargs sed -i's/need_upcast/upcast_to_float64/g' Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
PairwiseDistancesReduction
to 32bit using TempitaPairwiseDistancesReduction
using Tempita
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another quick pass. LGTM. Let's merge and handle type renaming in dedicated PRs (e.g. #24153).
Thanks @jjerphan 🎉 |
Thanks @ogrisel and @thomasjpfan for the reviews! |
This update the branch after the merge of scikit-learn#23865.
…pita (scikit-learn#23865) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Follows-up #22134
What does this implement/fix? Explain your changes.
This ports
PairwiseDistancesReduction
and other implementations to 32bit using Tempita.Benchmarks results
The hardware scalability plateaus at 64 threads because, asymptotically and using Adham's law, 2.5% of the code (which parts of it are due to interaction with CPython) is sequential.
Improved hardware scalability beyond that point, mean removing the last portions of sequential code accounting for the few points of percentage left.
Raw results
Details
Benchmarks results between
main
(a5d50cf) and this PR @ 31b8b28 (via 2c842bd)Between ×1.2 and, well, ×250+ speed-ups: it looks like it just scales linearly.
Regressions are due to using too many cores when the size of the problem (i.e.
n_train
andn_test
) is small.1 thread
2 threads
4 threads
8 threads
16 threads
32 threads
64 threads
128 threads
Benchmarks information
Machine specification