-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3.2-vision doesn't utilize my GPU. #8310
Comments
Maybe not enough free VRAM on your system, depending on what else you are running. The output of |
this is my smi: |
server logs will aid in identifying the cause. |
this is my log: |
Having this same issue as well. Latest version of Ollama and NVIDIA GTX 1650 SUPER graphics card. (Yes, I know it isn't powerful, this is just for a POC). Here are my logs when I run the "ollama serve" and then in another window the "ollama run llama3.2-vision" command. If I can provide any other information that would help, please let me know. 2025/01/06 22:07:11 routes.go:1259: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\Users\chand\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" |
When ollama started, there was 10.8G free VRAM. When it came time to load a model, something else was running and only 3.5G was free. The llama3.2-vision model won't fit, so it loads it into RAM.
The model is unloaded after 5 minutes and then a bit later another request comes in for the model. This time there is 10.5G available and ollama does a partial load (33 of 41 layers) into the GPU. NAME ID SIZE PROCESSOR UNTIL
llama3.2-vision:latest 085a1fdae525 12 GB 100% GPU Forever Your GPU is too small to host the entire model, and other GPU users are occasionally taking VRAM to the point where ollama can't even do a partial load. |
So I don't have enough vram? |
Correct. |
Is there a way to get around this? |
Stop other applications from using the GPU. https://www.google.com/search?q=windows+switch+default+gpu+to+integrated |
Is there any other way to use less vram like low vram mode? |
There are two components that take up VRAM - context and weights. The usual ways of reducing context size ( One last alternative would be to force llama.cpp to load all layers into VRAM and then have the GPU overflow to RAM, rather than having ollama decide on the RAM allocation. This will maximize VRAM usage at the cost of a performance penalty for the layers residing in RAM. However, because you can almost fit the model in VRAM, only a few layers will spill into RAM, and the penalty might not be noticeable. You can force this by setting |
Your GPU has 3.2G free. Just the projector (2.8G) and context space (656M) add up to more than this, so there is no way to run llama3.2-vison on your GPU, even with the |
Why is my gpu so full? I just built this pc a week ago. Will the full log of nvidia smi help identify what takes all of the vram? |
I'm not a Windows user so fine details of process usage escape me. Try this for help: https://saturncloud.io/blog/how-to-find-and-limit-gpu-usage-by-process-in-windows/#finding-gpu-usage-by-process |
i really didnt understand, this page, if someone can understand this and help me it would be great. Tue Jan 7 16:34:55 2025 +-----------------------------------------------------------------------------------------+ |
Unfortunately this is not a really useful output, as it doesn't contain the VRAM usage and the process names are incomplete, so it's not possible to identify the large users of VRAM. But there may be low hanging fruit. Does your machine have an integrated graphics processor? If so, it may be possible to set that as the default GPU for the system in the BIOS, so that when Windows starts it doesn't allocate VRAM from the 4070. The alternative is to set the preferred GPU on a program by program basis as discussed here. |
I completely disabled the integrated graphics, may it cause it? Because I thought it might use my integrated graphics instead of my gpu. |
ollama will not use integrated graphics, there is very little support for those types of GPUs. Enable it, make it the default. |
Your RAM should be sufficient. This is really strange. I found conflicting statements about your available VRAM in your logfile #8310 (comment) just a second apart:
I observed a similar behavior to your 4060 with two GTX 1060 6GB. Starting Can you try other models (like phi4), that should fit into your VRAM to narrow down this behavior? You have more than 10 GB free, and
I checked my logfile, and got a statement
|
A similar behavior was noticed with 6GB VRAM graphics cards in November 2024: #7509 It works with my 8GB card and the problem described here is for more than 10 GB available VRAM. |
It's sufficient if there are no other processes using the GPU. Switching to integrated graphics will help.
These are 67 seconds apart.
Vision models have extra requirements that make it harder to fit them in limited VRAM as discussed in #7509 (comment). |
Thanks @rick-github for the feedback and double-checking my comment. Sorry for the mistake, I should learn how to read the time! I tested this scenario again, and I'm not sure if llama3.2-vision will fit entirely into 12GB of VRAM. The use of the integrated graphics might be the only way. As pointed out by others above. First I tried to run llama3.2-vision just on the CPU. To do this I set the parameter On another system with a 8GB card and a 6GB card I got llama3.2-vision almost entirely into the VRAM, just 4% were still processed by the CPU. It resulted in 15 token/s. Following the advice given in this thread I switched to my iGPU of my processor and gained a few Megabyte on the larger card - and finally got 100% GPU utilization. The responsiveness increased by 55% to 23.3 token/s! That's the reward to have all layers in the fast GPU memory! Here I checked the combined unitization of the GPUs with One thing I still don't unterstand is how the memory requirements for the projector combined equal to something very close to 8GB, so any system with graphics cards smaller than 8GB might not even split the models to use the combined VRAM. It was already stated that the vision model is unique in this regard, and need one continous chunk of RAM to opereate. The logfile states:
I can't see how 1.8 + 2.8 results in something 6.837 GiB, even if I add the 656 MiB for kv. Can someone explain the math to me? When using the system with only one 8 GB card the logfile (see above) states that only 7 of the 41 layers were offloaded to the GPU:
This seems to be the "minimum pieces of the model that have to be loaded in VRAM in their entirety for anything to run on the GPU" that @jessegross mentioned in the issue 7509 in November 6, 2024. #7509 (comment) |
What is the issue?
I bought a new pc with 4070 Super to do some AI tasks using Ollama, but when I tried to run llama3.2-vision it just didn't utilize my GPU and only utilize my CPU, llama3.2 does utilize my GPU, so why is that? thank you.
OS
Windows
GPU
Nvidia
CPU
AMD
Ollama version
0.5.4
The text was updated successfully, but these errors were encountered: