Add Support for mixed quantized BitNet Architecture Inference #2683

JoseCarlosGarcia95 · 2024-12-27T09:29:57Z

Introduction

Hello again! My name is José Carlos, and I am fully focused on advancing the capabilities of BitNet within the Candle project. BitNet remains a personal passion of mine due to its unique ability to balance performance and efficiency in language models.

This PR builds on my previous work by introducing advanced quantization support tailored specifically for BitNet models. My primary goal is to make BitNet as efficient and accessible as possible for research and real-world applications.

Changes Made

Added Support for New Quantization Method q2_b0:
- Implemented a double quantization strategy specifically for BitNet models:
  - BitLinear Layers: Quantized separately using the q2_b0 method.
  - Non-BitLinear Layers: Quantized independently to maximize overall model efficiency.
- The q2_b0 method works by splitting the weight matrix into two smaller matrices containing only binary values (0 and 1), which significantly optimizes storage and computation.
Extended Quantization CLI:
- Enhanced the CLI to support BitNet-specific quantization workflows:
cargo run quantize ~/Downloads/Falcon3-1B-Instruct-1.58bit/model*.safetensors ~/Downloads/Falcon3-1B-Instruct-1.58bit/config.json --out-file ggml-model.gguf --quantization q4_0 -b --bitnet-quantization q2b0
Support for Quantifying Models Directly in Candle:
- Models can now be quantized directly within Candle. This new capability adds metadata during the quantization process, which was not possible before. This makes the workflow more seamless and eliminates the need for external tools to manage metadata.

Known Limitations

GPU Support: The GPU implementation for the q2_b0 quantization method is currently under development. I welcome collaboration from anyone interested in accelerating this feature.
Focus Exclusively on BitNet: At this time, my contributions are exclusively focused on improving BitNet models. Other architectures are not within the scope of this PR.

Roadmap

Finalize GPU Implementation for BitNet Quantization:
- Work is ongoing to enable GPU acceleration for the q2_b0 quantization method.
Optimize BitNet-Specific Quantization:
- Investigate performance improvements to make BitNet quantization more efficient and scalable.
Support Additional BitNet Models:
- Expand compatibility to include more variations of BitNet architectures.

Testing

To test the quantized BitNet model, you can use the following command:

cargo run --example quantized-bitnet --release -- --model tensor-tools/ggml-model.gguf --verbose-prompt --prompt "chat" --temperature 0.01

Feedback and Collaboration

This PR is currently a draft, as I continue to develop and refine GPU support for BitNet. If you are interested in contributing or have feedback on the current implementation, I would love to hear from you.

Thank you for your time and support as I continue to focus solely on advancing BitNet in Candle! 😊

Signed-off-by: José Carlos García <hola@josecarlos.me>

JoseCarlosGarcia95 added 5 commits December 22, 2024 07:59

wip

36e1dcc

Signed-off-by: José Carlos García <hola@josecarlos.me>

initial: q support

23373d1

Initial quantized support

e7e23e3

wip

81fe483

Pre-eliminar qbitnet implementation

677d03a

JoseCarlosGarcia95 mentioned this pull request Dec 27, 2024

Add Support for BitNet Architecture Inference #2664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for mixed quantized BitNet Architecture Inference #2683

Add Support for mixed quantized BitNet Architecture Inference #2683

JoseCarlosGarcia95 commented Dec 27, 2024

Add Support for mixed quantized BitNet Architecture Inference #2683

Are you sure you want to change the base?

Add Support for mixed quantized BitNet Architecture Inference #2683

Conversation

JoseCarlosGarcia95 commented Dec 27, 2024

Introduction

Changes Made

Known Limitations

Roadmap

Testing

Feedback and Collaboration