Skip to content

Failed to summarize the long context #8401

Closed
@SDAIer

Description

What is the issue?

I have 4 * A30 GPU cards (24G*4) and a piece of content with 111k context. I used 3 models that can support 128k context, which are

llama3.2:latest
llama3.1:8b
glm4:9b
The models were set with the parameter num_ctx=121k. After testing, none of the models could successfully summarize the context content (if the context content is sufficiently small, all three models can succeed).

Moreover, by monitoring the GPU usage with gpustat -i, it was found that only one model can utilize multiple GPUs for processing, while the other two models can only use one GPU.

Through the ollama logs, it was discovered that almost every model has to repeatedly load the context content, which takes a long time, but ultimately fails, resulting in a poor user experience.

Could you please help analyze the logs to figure out why it always fails?

The logs for the three models are as follows attached..
glm4.log
Llama 3.1 8B Instruct.log
Llama 3.2 3B Instruct.log

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.3.11

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions