Description
What is the issue?
I have 4 * A30 GPU cards (24G*4) and a piece of content with 111k context. I used 3 models that can support 128k context, which are
llama3.2:latest
llama3.1:8b
glm4:9b
The models were set with the parameter num_ctx=121k. After testing, none of the models could successfully summarize the context content (if the context content is sufficiently small, all three models can succeed).
Moreover, by monitoring the GPU usage with gpustat -i, it was found that only one model can utilize multiple GPUs for processing, while the other two models can only use one GPU.
Through the ollama logs, it was discovered that almost every model has to repeatedly load the context content, which takes a long time, but ultimately fails, resulting in a poor user experience.
Could you please help analyze the logs to figure out why it always fails?
The logs for the three models are as follows attached..
glm4.log
Llama 3.1 8B Instruct.log
Llama 3.2 3B Instruct.log
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
ollama version is 0.3.11