Support loading concurrent model(s) on CPU when GPU is full

### What is the issue?

I deployed the qwen2.5:72b-instruct-q6_K model, which occupies 4*3090 and a total of 75G GPU memory. When I use llama3:latest, it will not use RAM and CPU (755G/128 core), it will unload qwen2.5:72b-instruct-q6_K and load llama3:latest to GPU, even though qwen2.5:72b-instruct-q6_K is in use at this time.

### OS

Linux

### GPU

Nvidia

### CPU

Intel

### Ollama version

0.1.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support loading concurrent model(s) on CPU when GPU is full #6950

What is the issue?

OS

GPU

CPU

Ollama version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development