Skip to content

Support loading concurrent model(s) on CPU when GPU is full #6950

Open
@Han-Huaqiao

Description

What is the issue?

I deployed the qwen2.5:72b-instruct-q6_K model, which occupies 4*3090 and a total of 75G GPU memory. When I use llama3:latest, it will not use RAM and CPU (755G/128 core), it will unload qwen2.5:72b-instruct-q6_K and load llama3:latest to GPU, even though qwen2.5:72b-instruct-q6_K is in use at this time.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.10

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions