Vulkan Shader Refactor, Memory Debugging Option #7947

0cc4m · 2024-06-15T09:40:54Z

Self Reported Review Complexity:
- Review Complexity : Low
- Review Complexity : Medium
- Review Complexity : High
I have read the contributing guidelines

Here is the long-awaited extraction of the Vulkan shader code into separate files. This improves readability and makes them much easier to work with. Let me know if you have further suggestions on the structure.

For now it still uses the ggml_vk_generate_shaders.py script, but the plan is to move this into CMake/Make eventually (#5356).

I also improved the debug output code and added a new memory debugging option (LLAMA_VULKAN_MEMORY_DEBUG) to get to the bottom of some inconsistencies in RAM/VRAM uses of the backend.

…into vulkan-shaders directory

ggerganov

Haven't done tests, only skimmed through the changes.

Might have been better to use dashes in the .comp filenames for consistency with the rest of the codebase (e.g. ggml-cuda), but not we keep it as it is. This convention is already incompatible with the python scripts as they cannot work with dashes

0cc4m · 2024-06-16T05:03:10Z

I found and fixed a bug that caused more VRAM than necessary being used, it was very noticeable in llama 3 models. The memory debug option makes it much easier to debug memory issues with the backend.

MaggotHATE · 2024-06-16T13:11:38Z

@0cc4m Thank you for the update! Unfortunately, it didn't fix the RAM problem on my side.

I tested with Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, but it still uses way too much RAM.

Relevant part of logs with GGML_VULKAN_MEMORY_DEBUG enabled: vk_report2.txt

It's blazing fast now compared to Clblast (especially pp, very noticeable), but the model barely fits on my 16GB system.

MaggotHATE · 2024-06-16T13:20:33Z

Additionally, here's a report where the model simply stopped working - no gpu activity, with 8 gpu layers: vk_report_halted.txt

0cc4m · 2024-06-16T15:49:23Z

@MaggotHATE Thank you, that report shows that your high RAM use is not Vulkan host memory, since it only shows 80MB used. According to the log it should only be around 4.5GB RAM used.

When I record the total RAM used of CPU, CUDA and Vulkan with your model and ngl 9 on my system, I get 6.8GB for CPU, 7GB for CUDA and 5.4GB for Vulkan. ngl 1000 leads to 0.8GB for CUDA and 1.2GB for Vulkan. I'm not sure how to reproduce your issue.

What I did was run the new llama-cli with build/bin/llama-cli -f ~/llama.cpp/input.txt -c 2048 --ignore-eos -m models/Hermes-2-Pro-Llama-3-8B-Q6_K.gguf -ngl 9, wait until it is generating and then check the process memory usage with smem -c "pid user command rss". Can you check what that reports for you?

slaren · 2024-06-16T16:20:08Z

The CPU buffer appearing two times in llm_load_tensors is unexpected. I investigated why this happens, and it seems that ggml_backend_vk_host_buffer_type returns different values at different times during llm_load_tensors. I think this is because vk_instance.contexts[0].initialized is only set after the get_device_memory function is called, but there are calls to ggml_backend_vk_host_buffer_type before that happens, so it returns the cpu buffer type.

MaggotHATE · 2024-06-16T16:21:42Z

smem -c "pid user command rss"

Sorry, I didn't include more info: I use Windows 10, so I'm not sure it's relevant for me. I have just checked llama-cli with your parameters, but the RAM issue is the same. I don't know how to report process memory usage is a detailed way in Windows.

Since I have Vulkan SDK installation, maybe its Configurator would be of any help?

MaggotHATE · 2024-06-16T16:46:35Z

The CPU buffer appearing two times in llm_load_tensors is unexpected.

Now that I remember, InfinityKuno-2x7B.Q3_K_S.gguf doesn't give me this RAM problem. So I checked, and the report shows one instance of llm_load_tensors: CPU buffer size vk_report_MOE.txt

0cc4m · 2024-06-16T16:57:32Z

Thank you, that's a good hint. I'll look into that. I'm reworking the backend anyways to solve #7575.

giladgd · 2024-06-29T20:51:53Z

It seems that this change broke Vulkan support for some GPUs (#8092)

LostRuins · 2024-07-14T07:19:47Z

I can confirm that this PR breaks Vulkan for me as mentioned in #8092 (using a Nvidia RTX 2060 (laptop), and running the model https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q4_K_S.gguf)

0cc4m added 3 commits June 15, 2024 08:55

Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py …

3ac05b4

…into vulkan-shaders directory

Improve debug log code

8abb23f

Add memory debug output option

00aaaab

github-actions bot added build Compilation issues Vulkan Issues specific to the Vulkan backend python python script changes labels Jun 15, 2024

Fix flake8

9bcf695

ggerganov approved these changes Jun 15, 2024

View reviewed changes

Fix unnecessary high llama-3 VRAM use

309ef24

0cc4m merged commit 7c7836d into master Jun 16, 2024
63 of 68 checks passed

0cc4m deleted the 0cc4m/vulkan-shader-refactor branch June 16, 2024 05:17

0cc4m mentioned this pull request Jun 18, 2024

Bug: Vulkan backend unable to allocate memory when run across multiple GPUs for larger models #7804

Closed

Adriankhl mentioned this pull request Jun 24, 2024

Vulkan backend regression: gibberish output when layers offloaded to GPU #8092

Closed

netrunnereve mentioned this pull request Jun 29, 2024

vulkan : cmake integration #8119

Merged

4 tasks

0cc4m mentioned this pull request Jul 14, 2024

Vulkan MMQ Fix #8479

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan Shader Refactor, Memory Debugging Option #7947

Vulkan Shader Refactor, Memory Debugging Option #7947

0cc4m commented Jun 15, 2024

ggerganov left a comment

0cc4m commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

0cc4m commented Jun 16, 2024

slaren commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

0cc4m commented Jun 16, 2024

giladgd commented Jun 29, 2024

LostRuins commented Jul 14, 2024

Vulkan Shader Refactor, Memory Debugging Option #7947

Vulkan Shader Refactor, Memory Debugging Option #7947

Conversation

0cc4m commented Jun 15, 2024

ggerganov left a comment

Choose a reason for hiding this comment

0cc4m commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

0cc4m commented Jun 16, 2024

slaren commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

MaggotHATE commented Jun 16, 2024

0cc4m commented Jun 16, 2024

giladgd commented Jun 29, 2024

LostRuins commented Jul 14, 2024