Failed to summarize  the long context

### What is the issue?

I have 4 * A30 GPU cards (24G*4) and a piece of content with 111k context. I used 3 models that can support 128k context, which are

llama3.2:latest
llama3.1:8b
glm4:9b
The models were set with the parameter num_ctx=121k. After testing, none of the models could successfully summarize the context content (if the context content is sufficiently small, all three models can succeed).

Moreover, by monitoring the GPU usage with gpustat -i, it was found that only one model can utilize multiple GPUs for processing, while the other two models can only use one GPU. 

Through the ollama logs, it was discovered that almost every model has to repeatedly load the context content, which takes a long time, but ultimately fails, resulting in a poor user experience.

Could you please help analyze the logs to figure out why it always fails?

The logs for the three models are as follows attached..
[glm4.log](https://github.com/user-attachments/files/18396940/glm4.log)
[Llama 3.1 8B Instruct.log](https://github.com/user-attachments/files/18396944/Llama.3.1.8B.Instruct.log)
[Llama 3.2 3B Instruct.log](https://github.com/user-attachments/files/18396945/Llama.3.2.3B.Instruct.log)




### OS

Linux

### GPU

Nvidia

### CPU

Intel

### Ollama version

ollama version is 0.3.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to summarize the long context #8401

What is the issue?

OS

GPU

CPU

Ollama version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development