-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module weight quantization #2000
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2000 +/- ##
==========================================
- Coverage 84.45% 84.25% -0.20%
==========================================
Files 840 845 +5
Lines 104346 105439 +1093
==========================================
+ Hits 88125 88838 +713
- Misses 16221 16601 +380 ☔ View full report in Codecov by Sentry. |
e07739d
to
b07f980
Compare
Turns out the weights were still being automatically dequantized to a In my tests with TinyLlama, running inference in f16 is slower when we have to dequantize before every op instead of already having the dequantized weights loaded. So until we introduce layers with ops supported in quantized types (e.g., int8), we should dequantize the weights. Once I figure out how this should be handled (as cleanly and explicitly as possible), I'll re-open for review. /edit: I think the best way to go about this is documentation. I feel like adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM very clean
Checklist
run-checks all
script has been executed.Changes
Static per-tensor module quantization support added with
quantize_weights
methodQuantizer
,MinMaxCalibration
andQuantizationScheme
to define the quantizationDType::QFloat
tensors as is (no conversion)QTensorOps
to support quantized float tensorsTesting
Added unit tests for
MinMaxCalibration