Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for mixed quantized BitNet Architecture Inference #2683

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

JoseCarlosGarcia95
Copy link

Introduction

Hello again! My name is José Carlos, and I am fully focused on advancing the capabilities of BitNet within the Candle project. BitNet remains a personal passion of mine due to its unique ability to balance performance and efficiency in language models.

This PR builds on my previous work by introducing advanced quantization support tailored specifically for BitNet models. My primary goal is to make BitNet as efficient and accessible as possible for research and real-world applications.


Changes Made

  1. Added Support for New Quantization Method q2_b0:

    • Implemented a double quantization strategy specifically for BitNet models:
      • BitLinear Layers: Quantized separately using the q2_b0 method.
      • Non-BitLinear Layers: Quantized independently to maximize overall model efficiency.
    • The q2_b0 method works by splitting the weight matrix into two smaller matrices containing only binary values (0 and 1), which significantly optimizes storage and computation.
  2. Extended Quantization CLI:

    • Enhanced the CLI to support BitNet-specific quantization workflows:

    cargo run quantize ~/Downloads/Falcon3-1B-Instruct-1.58bit/model*.safetensors ~/Downloads/Falcon3-1B-Instruct-1.58bit/config.json --out-file ggml-model.gguf --quantization q4_0 -b --bitnet-quantization q2b0

  3. Support for Quantifying Models Directly in Candle:

    • Models can now be quantized directly within Candle. This new capability adds metadata during the quantization process, which was not possible before. This makes the workflow more seamless and eliminates the need for external tools to manage metadata.

Known Limitations

  • GPU Support: The GPU implementation for the q2_b0 quantization method is currently under development. I welcome collaboration from anyone interested in accelerating this feature.
  • Focus Exclusively on BitNet: At this time, my contributions are exclusively focused on improving BitNet models. Other architectures are not within the scope of this PR.

Roadmap

  1. Finalize GPU Implementation for BitNet Quantization:

    • Work is ongoing to enable GPU acceleration for the q2_b0 quantization method.
  2. Optimize BitNet-Specific Quantization:

    • Investigate performance improvements to make BitNet quantization more efficient and scalable.
  3. Support Additional BitNet Models:

    • Expand compatibility to include more variations of BitNet architectures.

Testing

To test the quantized BitNet model, you can use the following command:

cargo run --example quantized-bitnet --release -- --model tensor-tools/ggml-model.gguf --verbose-prompt --prompt "chat" --temperature 0.01


Feedback and Collaboration

This PR is currently a draft, as I continue to develop and refine GPU support for BitNet. If you are interested in contributing or have feedback on the current implementation, I would love to hear from you.

Thank you for your time and support as I continue to focus solely on advancing BitNet in Candle! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant