Add Support for mixed quantized BitNet Architecture Inference #2683
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
Hello again! My name is José Carlos, and I am fully focused on advancing the capabilities of BitNet within the Candle project. BitNet remains a personal passion of mine due to its unique ability to balance performance and efficiency in language models.
This PR builds on my previous work by introducing advanced quantization support tailored specifically for BitNet models. My primary goal is to make BitNet as efficient and accessible as possible for research and real-world applications.
Changes Made
Added Support for New Quantization Method
q2_b0
:q2_b0
method.q2_b0
method works by splitting the weight matrix into two smaller matrices containing only binary values (0 and 1), which significantly optimizes storage and computation.Extended Quantization CLI:
cargo run quantize ~/Downloads/Falcon3-1B-Instruct-1.58bit/model*.safetensors ~/Downloads/Falcon3-1B-Instruct-1.58bit/config.json --out-file ggml-model.gguf --quantization q4_0 -b --bitnet-quantization q2b0
Support for Quantifying Models Directly in Candle:
Known Limitations
q2_b0
quantization method is currently under development. I welcome collaboration from anyone interested in accelerating this feature.Roadmap
Finalize GPU Implementation for BitNet Quantization:
q2_b0
quantization method.Optimize BitNet-Specific Quantization:
Support Additional BitNet Models:
Testing
To test the quantized BitNet model, you can use the following command:
cargo run --example quantized-bitnet --release -- --model tensor-tools/ggml-model.gguf --verbose-prompt --prompt "chat" --temperature 0.01
Feedback and Collaboration
This PR is currently a draft, as I continue to develop and refine GPU support for BitNet. If you are interested in contributing or have feedback on the current implementation, I would love to hear from you.
Thank you for your time and support as I continue to focus solely on advancing BitNet in Candle! 😊