Allow building with BLAS/BLIS now that Ollama's runners are not pure native builds of llama.cpp anymore #8402
Description
A regression of ~18tks/s to ~8tks/s eval for llama3.2 in a Ryzen Threadripper 1820X.
Up to version v.0.5.1
I was able to build the official llama-server
from llama.cpp and use it as part of an Ollama build that skips generation. I'm using AMD's AOCC compiler and AOCL (A BLIS-flavored implementation tunned for AMD cores) on Linux with -march=znver1
.
I was building llama-server
with (with AOCC and AOCL configured) with:
cmake -G Ninja -B build \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=AOCL_mt \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DGGML_NATIVE=OFF \
-DLLAMA_BUILD_TESTS:BOOL=0 \
-DCMAKE_BUILD_TYPE:STRING=Release \
-DGGML_AVX:BOOL=1 \
-DGGML_AVX2:BOOL=1 \
-DGGML_BLAS:BOOL=1 \
-DGGML_BUILD_EXAMPLES:BOOL=0 \
-DBUILD_SHARED_LIBS:BOOL=0 \
-DGGML_NATIVE:BOOL=0 \
-DGGML_FMA:BOOL=1 \
-DGGML_F16C:BOOL=1 \
-DGGML_LTO:BOOL=1 \
-DCMAKE_C_FLAGS:STRING="-march=znver1" \
-DCMAKE_CXX_FLAGS:STRING="-march=znver1" \
-DCMAKE_INSTALL_PREFIX:PATH=/root/llama.cpp/install \
-DBLAS_INCLUDE_DIRS:PATH=/root/aocl/5.0.0/aocc/include
But now, Ollama doesn't use a pure build of llama.cpp anymore. And to make things worse, it's passing a runner
argument that llama-server
doesn't accept.
From my understand the runners are now Go applications that link to llama.cpp at build time.
How can I have a custom build of these Go runners that use BLIS and allow me to pass -march=znver1
at build time?
p.s.: I'm not a Go developer :(