Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan Shader Refactor, Memory Debugging Option #7947

Merged
merged 5 commits into from
Jun 16, 2024

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Jun 15, 2024

  • Self Reported Review Complexity:
    • Review Complexity : Low
    • Review Complexity : Medium
    • Review Complexity : High
  • I have read the contributing guidelines

Here is the long-awaited extraction of the Vulkan shader code into separate files. This improves readability and makes them much easier to work with. Let me know if you have further suggestions on the structure.

For now it still uses the ggml_vk_generate_shaders.py script, but the plan is to move this into CMake/Make eventually (#5356).

I also improved the debug output code and added a new memory debugging option (LLAMA_VULKAN_MEMORY_DEBUG) to get to the bottom of some inconsistencies in RAM/VRAM uses of the backend.

@github-actions github-actions bot added build Compilation issues Vulkan Issues specific to the Vulkan backend python python script changes labels Jun 15, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't done tests, only skimmed through the changes.

Might have been better to use dashes in the .comp filenames for consistency with the rest of the codebase (e.g. ggml-cuda), but not we keep it as it is. This convention is already incompatible with the python scripts as they cannot work with dashes

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 16, 2024

I found and fixed a bug that caused more VRAM than necessary being used, it was very noticeable in llama 3 models. The memory debug option makes it much easier to debug memory issues with the backend.

@0cc4m 0cc4m merged commit 7c7836d into master Jun 16, 2024
63 of 68 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-shader-refactor branch June 16, 2024 05:17
@MaggotHATE
Copy link
Contributor

@0cc4m Thank you for the update! Unfortunately, it didn't fix the RAM problem on my side.

I tested with Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, but it still uses way too much RAM.

Relevant part of logs with GGML_VULKAN_MEMORY_DEBUG enabled: vk_report2.txt

It's blazing fast now compared to Clblast (especially pp, very noticeable), but the model barely fits on my 16GB system.

@MaggotHATE
Copy link
Contributor

Additionally, here's a report where the model simply stopped working - no gpu activity, with 8 gpu layers: vk_report_halted.txt

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 16, 2024

@MaggotHATE Thank you, that report shows that your high RAM use is not Vulkan host memory, since it only shows 80MB used. According to the log it should only be around 4.5GB RAM used.

When I record the total RAM used of CPU, CUDA and Vulkan with your model and ngl 9 on my system, I get 6.8GB for CPU, 7GB for CUDA and 5.4GB for Vulkan. ngl 1000 leads to 0.8GB for CUDA and 1.2GB for Vulkan. I'm not sure how to reproduce your issue.

What I did was run the new llama-cli with build/bin/llama-cli -f ~/llama.cpp/input.txt -c 2048 --ignore-eos -m models/Hermes-2-Pro-Llama-3-8B-Q6_K.gguf -ngl 9, wait until it is generating and then check the process memory usage with smem -c "pid user command rss". Can you check what that reports for you?

@slaren
Copy link
Collaborator

slaren commented Jun 16, 2024

The CPU buffer appearing two times in llm_load_tensors is unexpected. I investigated why this happens, and it seems that ggml_backend_vk_host_buffer_type returns different values at different times during llm_load_tensors. I think this is because vk_instance.contexts[0].initialized is only set after the get_device_memory function is called, but there are calls to ggml_backend_vk_host_buffer_type before that happens, so it returns the cpu buffer type.

@MaggotHATE
Copy link
Contributor

smem -c "pid user command rss"

Sorry, I didn't include more info: I use Windows 10, so I'm not sure it's relevant for me. I have just checked llama-cli with your parameters, but the RAM issue is the same. I don't know how to report process memory usage is a detailed way in Windows.

Since I have Vulkan SDK installation, maybe its Configurator would be of any help?
image

@MaggotHATE
Copy link
Contributor

The CPU buffer appearing two times in llm_load_tensors is unexpected.

Now that I remember, InfinityKuno-2x7B.Q3_K_S.gguf doesn't give me this RAM problem. So I checked, and the report shows one instance of llm_load_tensors: CPU buffer size vk_report_MOE.txt

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 16, 2024

Thank you, that's a good hint. I'll look into that. I'm reworking the backend anyways to solve #7575.

@giladgd
Copy link
Contributor

giladgd commented Jun 29, 2024

It seems that this change broke Vulkan support for some GPUs (#8092)

@LostRuins
Copy link
Collaborator

I can confirm that this PR breaks Vulkan for me as mentioned in #8092 (using a Nvidia RTX 2060 (laptop), and running the model https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q4_K_S.gguf)

@0cc4m 0cc4m mentioned this pull request Jul 14, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues python python script changes Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants