-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Shader Refactor, Memory Debugging Option #7947
Conversation
…into vulkan-shaders directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't done tests, only skimmed through the changes.
Might have been better to use dashes in the .comp
filenames for consistency with the rest of the codebase (e.g. ggml-cuda
), but not we keep it as it is. This convention is already incompatible with the python scripts as they cannot work with dashes
I found and fixed a bug that caused more VRAM than necessary being used, it was very noticeable in llama 3 models. The memory debug option makes it much easier to debug memory issues with the backend. |
@0cc4m Thank you for the update! Unfortunately, it didn't fix the RAM problem on my side. I tested with Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, but it still uses way too much RAM. Relevant part of logs with It's blazing fast now compared to Clblast (especially pp, very noticeable), but the model barely fits on my 16GB system. |
Additionally, here's a report where the model simply stopped working - no gpu activity, with 8 gpu layers: vk_report_halted.txt |
@MaggotHATE Thank you, that report shows that your high RAM use is not Vulkan host memory, since it only shows 80MB used. According to the log it should only be around 4.5GB RAM used. When I record the total RAM used of CPU, CUDA and Vulkan with your model and ngl 9 on my system, I get 6.8GB for CPU, 7GB for CUDA and 5.4GB for Vulkan. ngl 1000 leads to 0.8GB for CUDA and 1.2GB for Vulkan. I'm not sure how to reproduce your issue. What I did was run the new llama-cli with |
The |
Now that I remember, |
Thank you, that's a good hint. I'll look into that. I'm reworking the backend anyways to solve #7575. |
It seems that this change broke Vulkan support for some GPUs (#8092) |
I can confirm that this PR breaks Vulkan for me as mentioned in #8092 (using a Nvidia RTX 2060 (laptop), and running the model https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q4_K_S.gguf) |
Here is the long-awaited extraction of the Vulkan shader code into separate files. This improves readability and makes them much easier to work with. Let me know if you have further suggestions on the structure.
For now it still uses the
ggml_vk_generate_shaders.py
script, but the plan is to move this into CMake/Make eventually (#5356).I also improved the debug output code and added a new memory debugging option (
LLAMA_VULKAN_MEMORY_DEBUG
) to get to the bottom of some inconsistencies in RAM/VRAM uses of the backend.