-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: Support OpenMP for multi-thread processing #7606
Conversation
Here is the same test with NKVO that I have been running in the other thread pools PRs:
I have also tested with CPU only, on native Windows with MSVC, and on a Apple MBP with an M3 Max, and I haven't found any performance regressions. So I have to say congratulations, this is the first thread pool attempt that achieves a very large reduction in the thread launch overhead, without any significant downsides. I think we could use OMP by default. Some concerns:
|
On an old 4 core/8 thread Xeon v2 there's a small regression with four threads but it's the same as master with eight. Typically with these threadpool implementations the speedup only happens when you use a lot of threads and thus have to deal with the overhead.
|
I tried with a very small stories260k model, since that should exacerbate the overhead of starting the threads, but it is still consistently faster with this PR (except 1 threads). Is it really a given that waking up a thread needs to be more expensive than starting a thread?
OMP_PROC_BIND=TRUE
|
@slaren Thank you for the comments and performance test! I have fixed the code to your feedback.
As you say, deadlocks can occur since the number of threads specified in the Instead, how about limiting the number of threads by omp_get_max_threads() in ggml_graph_plan as like this?
Without this modification, deadlock occurred in the above cases.
Thanks, I checked here:
If n_threads is 1, main thread is the master of OMP, so the cost of waking up the thread may not be a problem. However, the modification is simple, so it is possible to branch only when n_threads=1. |
I changed it to update |
Thanks for the correction! |
e0e880a
to
5970a26
Compare
|
@ggerganov PTAL and let's merge this if you don't have any objections, it will significantly improve performance in some cases and it will allow me to continue work on #6210. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay - I have missed the request for review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
We need to install
https://github.com/ggerganov/llama.cpp/blob/master/.devops/server.Dockerfile#L19 |
Fixed in #7780 |
Fyi, this changes throws "omp.h not found error" on Ubuntu 22.04. |
Currently, ggml multi-thread processing is implemented with
pthread_create
andpthread_join
. This is simple and straightforward, but has the following issuesperf
) because each thread is created each time a token is generating.Therefore, I suggest an option using OpenMP, which is general multi-threaded library. it is optional, and by default the previous implementation works.
Below are the results measured on an AWS m7i.24xlarge (Intel Xeon 8488).
[token/sec]
[token/sec]
OMP_PROC_BIND=TRUE
[token/sec]
OpenMP version has slightly better performance than the pthread implementation. And setting OMP_PROC_BIND=TRUE improves performance on many threads (Threads >~ 36).
It would be good for users to have additional OpenMP options to tuning for their system.
Any comments are welcome!
related : #7342 #7526