ggml : automatic selection of best CPU backend #10606

slaren · 2024-11-30T19:41:14Z

This the way it works:

Backends can export a function called ggml_backend_score
When loading a backend, all the available variants are checked and the highest score one is loaded
A score of 0 means that the backend cannot be used in the current system
The available variants are discovered automatically based on the file name, for example, when loading the CPU backend, all files that match libggml-cpu-*.so (or ggml-cpu-*.dll on windows) are checked.

The CPU backend implements this functionality for x86-64 and returns a score depending on the features included in the build that are supported on the running system.

The llama-server docker image has been updated to include variants for AVX, AVX2, AVX512 and AMX.

Caveat: the AVX and AVX2 variants still require FMA and F16C, which will limit the number of processors supported. More variants may be needed to fully support some microarchitectures.

ggml-ci

giladgd · 2024-12-01T23:24:13Z

I think it may be worth having a cmake flag to build all the common CPU backend variations in a single build rather than having to build multiple times and combining the backend libraries.
Having this would make it easier to maintain a centralized list of the common configurations that would be supported by projects that use llama.cpp.

slaren · 2024-12-02T01:10:53Z

Yes, I agree that would be better. TBH selecting the variants is a headache because there are so many options and each microarchitecture supports a different subset of them, so I didn't want to think too much about it. It might make more sense to build a variant for each microarchitecture, but there is going to be a lot of them.

giladgd · 2024-12-02T22:46:35Z

Is there an advantage to building a separate binary for each microarchitecture over determining what features can used on runtime?
I've seen both runtime and compile-time feature detection in the codebase, but I'm not sure I understand why some features are detected only on compile-time.
Having a single binary that can adapt to the system it runs on would be much easier to use, so I'm wondering what are the limitations of this approach.

slaren · 2024-12-02T22:52:19Z

That's not really an option for many reasons. It would probably require rewriting all the code in ASM and using a JIT compiler.
Why would that be easier to use? The best backend is loaded automatically, it doesn't require you to do anything.

giladgd · 2024-12-02T23:13:03Z

The more variations of the CPU backend compiled with different flags means more duplication of the same code, which would increase the total build size.
Also, having many different variations with different combinations of flags would produce many more files and can take much longer to compile as more variations are added over time.
I was thinking more towards extracting the relevant functions to very small libraries that can be loaded dynamically based on runtime feature detection for this, rather than loading the entire backend compiled with different combinations of flags.

slaren · 2024-12-02T23:17:59Z

I was thinking more towards extracting the relevant functions to very small libraries that can be loaded dynamically based on runtime feature detection

That's pretty much what this is already. The build time and build size is not really significant, the CPU backend builds very quickly and most variants are below 500kB in size. E.g. these are the variants built in #10626:

-rwxr-xr-x 1 diego diego 412K Dec  2 23:58 libggml-cpu-alderlake.so*
-rwxr-xr-x 1 diego diego 412K Dec  2 23:58 libggml-cpu-haswell.so*
-rwxr-xr-x 1 diego diego 488K Dec  2 23:58 libggml-cpu-icelake.so*
-rwxr-xr-x 1 diego diego 412K Dec  2 23:58 libggml-cpu-sandybridge.so*
-rwxr-xr-x 1 diego diego 709K Dec  2 23:58 libggml-cpu-sapphirerapids.so*
-rwxr-xr-x 1 diego diego 488K Dec  2 23:58 libggml-cpu-skylakex.so*

giladgd · 2024-12-02T23:28:41Z

Then all is good. Thanks for explaining :)

* ggml : automatic selection of best CPU backend * amx : minor opt * add GGML_AVX_VNNI to enable avx-vnni, fix checks

github-actions bot added script Script related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Nov 30, 2024

slaren force-pushed the sl/dl-backend-4 branch 2 times, most recently from ea35fd8 to dadab7c Compare November 30, 2024 20:04

ggml : automatic selection of best CPU backend

8bfef91

slaren force-pushed the sl/dl-backend-4 branch from dadab7c to 8bfef91 Compare November 30, 2024 20:20

ggerganov approved these changes Dec 1, 2024

View reviewed changes

amx : minor opt

b14b9bf

ggml-ci

slaren mentioned this pull request Dec 1, 2024

ggml : move AMX to the CPU backend #10570

Merged

add cpuid check for avx-vnni

6d78e0f

github-actions bot added the build Compilation issues label Dec 1, 2024

add GGML_AVX_VNNI to enable avx-vnni, fix checks

854eff8

slaren force-pushed the sl/dl-backend-4 branch from c6cfe31 to 854eff8 Compare December 1, 2024 14:51

slaren merged commit 3420909 into master Dec 1, 2024
50 checks passed

slaren deleted the sl/dl-backend-4 branch December 1, 2024 15:12

slaren mentioned this pull request Dec 4, 2024

Performance bug: Android aarch64 Neon Performance Regression and i8mm Detection Issues in New Version of llama.cpp #10662

Closed

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024

ggml : automatic selection of best CPU backend (ggerganov#10606)

1041eab

* ggml : automatic selection of best CPU backend * amx : minor opt * add GGML_AVX_VNNI to enable avx-vnni, fix checks

ggerganov mentioned this pull request Dec 7, 2024

Misc. bug: Embedding crashes on macOS when not using Metal #10702

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

ggml : automatic selection of best CPU backend (ggerganov#10606)

59c8895

* ggml : automatic selection of best CPU backend * amx : minor opt * add GGML_AVX_VNNI to enable avx-vnni, fix checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : automatic selection of best CPU backend #10606

ggml : automatic selection of best CPU backend #10606

slaren commented Nov 30, 2024 •

edited

Loading

giladgd commented Dec 1, 2024

slaren commented Dec 2, 2024

giladgd commented Dec 2, 2024

slaren commented Dec 2, 2024

giladgd commented Dec 2, 2024 •

edited

Loading

slaren commented Dec 2, 2024 •

edited

Loading

giladgd commented Dec 2, 2024

ggml : automatic selection of best CPU backend #10606

ggml : automatic selection of best CPU backend #10606

Conversation

slaren commented Nov 30, 2024 • edited Loading

giladgd commented Dec 1, 2024

slaren commented Dec 2, 2024

giladgd commented Dec 2, 2024

slaren commented Dec 2, 2024

giladgd commented Dec 2, 2024 • edited Loading

slaren commented Dec 2, 2024 • edited Loading

giladgd commented Dec 2, 2024

slaren commented Nov 30, 2024 •

edited

Loading

giladgd commented Dec 2, 2024 •

edited

Loading

slaren commented Dec 2, 2024 •

edited

Loading