Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : automatic selection of best CPU backend #10606

Merged
merged 4 commits into from
Dec 1, 2024
Merged

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Nov 30, 2024

This the way it works:

  • Backends can export a function called ggml_backend_score
  • When loading a backend, all the available variants are checked and the highest score one is loaded
  • A score of 0 means that the backend cannot be used in the current system
  • The available variants are discovered automatically based on the file name, for example, when loading the CPU backend, all files that match libggml-cpu-*.so (or ggml-cpu-*.dll on windows) are checked.

The CPU backend implements this functionality for x86-64 and returns a score depending on the features included in the build that are supported on the running system.

The llama-server docker image has been updated to include variants for AVX, AVX2, AVX512 and AMX.

Caveat: the AVX and AVX2 variants still require FMA and F16C, which will limit the number of processors supported. More variants may be needed to fully support some microarchitectures.

@github-actions github-actions bot added script Script related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Nov 30, 2024
@slaren slaren force-pushed the sl/dl-backend-4 branch 2 times, most recently from ea35fd8 to dadab7c Compare November 30, 2024 20:04
@github-actions github-actions bot added the build Compilation issues label Dec 1, 2024
@slaren slaren merged commit 3420909 into master Dec 1, 2024
50 checks passed
@slaren slaren deleted the sl/dl-backend-4 branch December 1, 2024 15:12
@giladgd
Copy link
Contributor

giladgd commented Dec 1, 2024

I think it may be worth having a cmake flag to build all the common CPU backend variations in a single build rather than having to build multiple times and combining the backend libraries.
Having this would make it easier to maintain a centralized list of the common configurations that would be supported by projects that use llama.cpp.

@slaren
Copy link
Collaborator Author

slaren commented Dec 2, 2024

Yes, I agree that would be better. TBH selecting the variants is a headache because there are so many options and each microarchitecture supports a different subset of them, so I didn't want to think too much about it. It might make more sense to build a variant for each microarchitecture, but there is going to be a lot of them.

@giladgd
Copy link
Contributor

giladgd commented Dec 2, 2024

Is there an advantage to building a separate binary for each microarchitecture over determining what features can used on runtime?
I've seen both runtime and compile-time feature detection in the codebase, but I'm not sure I understand why some features are detected only on compile-time.
Having a single binary that can adapt to the system it runs on would be much easier to use, so I'm wondering what are the limitations of this approach.

@slaren
Copy link
Collaborator Author

slaren commented Dec 2, 2024

That's not really an option for many reasons. It would probably require rewriting all the code in ASM and using a JIT compiler.
Why would that be easier to use? The best backend is loaded automatically, it doesn't require you to do anything.

@giladgd
Copy link
Contributor

giladgd commented Dec 2, 2024

The more variations of the CPU backend compiled with different flags means more duplication of the same code, which would increase the total build size.
Also, having many different variations with different combinations of flags would produce many more files and can take much longer to compile as more variations are added over time.
I was thinking more towards extracting the relevant functions to very small libraries that can be loaded dynamically based on runtime feature detection for this, rather than loading the entire backend compiled with different combinations of flags.

@slaren
Copy link
Collaborator Author

slaren commented Dec 2, 2024

I was thinking more towards extracting the relevant functions to very small libraries that can be loaded dynamically based on runtime feature detection

That's pretty much what this is already. The build time and build size is not really significant, the CPU backend builds very quickly and most variants are below 500kB in size. E.g. these are the variants built in #10626:

-rwxr-xr-x 1 diego diego 412K Dec  2 23:58 libggml-cpu-alderlake.so*
-rwxr-xr-x 1 diego diego 412K Dec  2 23:58 libggml-cpu-haswell.so*
-rwxr-xr-x 1 diego diego 488K Dec  2 23:58 libggml-cpu-icelake.so*
-rwxr-xr-x 1 diego diego 412K Dec  2 23:58 libggml-cpu-sandybridge.so*
-rwxr-xr-x 1 diego diego 709K Dec  2 23:58 libggml-cpu-sapphirerapids.so*
-rwxr-xr-x 1 diego diego 488K Dec  2 23:58 libggml-cpu-skylakex.so*

@giladgd
Copy link
Contributor

giladgd commented Dec 2, 2024

Then all is good. Thanks for explaining :)

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024
* ggml : automatic selection of best CPU backend

* amx : minor opt

* add GGML_AVX_VNNI to enable avx-vnni, fix checks
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* ggml : automatic selection of best CPU backend

* amx : minor opt

* add GGML_AVX_VNNI to enable avx-vnni, fix checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning script Script related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants