Module weight quantization #2000

laggui · 2024-07-09T19:41:08Z

Checklist

Confirmed that run-checks all script has been executed.
Made sure the book is up to date with changes in this PR.

Changes

Static per-tensor module quantization support added with quantize_weights method

Added Quantizer, MinMaxCalibration and QuantizationScheme to define the quantization
Changed the tensor record implementation to keep DType::QFloat tensors as is (no conversion)
Added methods to QTensorOps to support quantized float tensors
Added quantization docs

Testing

Added unit tests for MinMaxCalibration

codecov · 2024-07-09T20:09:21Z

Codecov Report

Attention: Patch coverage is 20.72727% with 218 lines in your changes missing coverage. Please review.

Project coverage is 84.25%. Comparing base (c30ffcf) to head (ddc8791).
Report is 4 commits behind head on main.

Files	Patch %	Lines
crates/burn-tch/src/ops/qtensor.rs	0.00%	53 Missing ⚠️
crates/burn-ndarray/src/ops/qtensor.rs	0.00%	29 Missing ⚠️
crates/burn-tch/src/ops/base.rs	0.00%	22 Missing ⚠️
crates/burn-autodiff/src/ops/qtensor.rs	0.00%	17 Missing ⚠️
crates/burn-candle/src/ops/qtensor.rs	0.00%	17 Missing ⚠️
crates/burn-jit/src/ops/qtensor.rs	0.00%	16 Missing ⚠️
crates/burn-fusion/src/ops/qtensor.rs	0.00%	15 Missing ⚠️
crates/burn-tensor/src/tensor/data.rs	0.00%	11 Missing ⚠️
crates/burn-tensor/src/tensor/ops/qtensor.rs	0.00%	11 Missing ⚠️
crates/burn-tensor/src/tensor/api/base.rs	43.75%	9 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2000      +/-   ##
==========================================
- Coverage   84.45%   84.25%   -0.20%     
==========================================
  Files         840      845       +5     
  Lines      104346   105439    +1093     
==========================================
+ Hits        88125    88838     +713     
- Misses      16221    16601     +380

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

laggui · 2024-07-11T19:34:04Z

Turns out the weights were still being automatically dequantized to a TensorPrimitive::Float because Param sets tensor.require_grad() when loading, which currently always returns a float primitive (i.e., dequantizes QFloat tensors).

In my tests with TinyLlama, running inference in f16 is slower when we have to dequantize before every op instead of already having the dequantized weights loaded.

So until we introduce layers with ops supported in quantized types (e.g., int8), we should dequantize the weights. Once I figure out how this should be handled (as cleanly and explicitly as possible), I'll re-open for review.

/edit: I think the best way to go about this is documentation. I feel like adding a dequantize() or float() method on modules is not required, so instead I added subsection to describe how this can be achieved (and why someone might want to do this) with a ModuleMapper. This is the solution I am currently using for loading quantized Llama weights.

nathanielsimard

LGTM very clean

laggui added 4 commits July 11, 2024 09:05

Add q_into_data and q_reshape

0005aed

Fix tch quantize f16 and q_into_data

06c4552

Convert to actual dtype/kind in dequantize

1e354d0

Add module quantization and q_from_data

b07f980

laggui force-pushed the feat/quant/module branch from e07739d to b07f980 Compare July 11, 2024 13:06

laggui added 4 commits July 11, 2024 11:12

Fix clippy

b3ee626

Add documentation

93196b5

Handle deserialize data conversion

0d4bfd2

Fix typo

5087f6e

laggui marked this pull request as ready for review July 11, 2024 17:30

Add calibration tests

bad900b

laggui requested a review from nathanielsimard July 11, 2024 17:33

Fix clippy precision

c3c4b2e

laggui marked this pull request as draft July 11, 2024 19:21

laggui added 3 commits July 12, 2024 08:55

Add QTensorOps require_grad methods to avoid dequantizing

feea531

Add Dequantize mapper docs

db40bea

Remove dead code

ddc8791

laggui marked this pull request as ready for review July 12, 2024 14:44

nathanielsimard approved these changes Jul 12, 2024

View reviewed changes

laggui merged commit 3afff43 into main Jul 15, 2024
15 checks passed

laggui deleted the feat/quant/module branch July 15, 2024 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module weight quantization #2000

Module weight quantization #2000

laggui commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading

laggui commented Jul 11, 2024 •

edited

Loading

nathanielsimard left a comment

Module weight quantization #2000

Module weight quantization #2000

Conversation

laggui commented Jul 9, 2024 • edited Loading

Checklist

Changes

Testing

codecov bot commented Jul 9, 2024 • edited Loading

Codecov Report

laggui commented Jul 11, 2024 • edited Loading

nathanielsimard left a comment

Choose a reason for hiding this comment

laggui commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading

laggui commented Jul 11, 2024 •

edited

Loading