Support loading concurrent model(s) on CPU when GPU is full #6950
Open
Description
What is the issue?
I deployed the qwen2.5:72b-instruct-q6_K model, which occupies 4*3090 and a total of 75G GPU memory. When I use llama3:latest, it will not use RAM and CPU (755G/128 core), it will unload qwen2.5:72b-instruct-q6_K and load llama3:latest to GPU, even though qwen2.5:72b-instruct-q6_K is in use at this time.
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.1.10