[Quantization] Switch to optimum-quanto #31732

SunMarc · 2024-07-01T15:59:18Z

What does this PR do ?

This PR updates the quanto library package since we moved it under optimum. The new package name is optimum-quanto. Now, we need to import with optimum.quanto instead quanto.
quanto will be deprecated in v4.47 as suggested by Amy !

Tests are passing ! We only check for optimum-quanto now but I checked locally that there were no issues with quanto.

amyeroberts

Thanks for handling this!

Do we have numbers on the use of quanto? This will determine how cautious we should be.

I'd suggest instead having a fallback for one version cycle e.g. for imports

if is_optimim_quanto_available():
    from optimum.quanto import y
elif is_quanto_available():
    warnings.warn("Importing from quanto will be deprecated in v4.44. Please install optimum-quanto instrad `pip install optimum-quanto`")
    from quanto import y

and update the quanto check

def is_quanto_available():
    warning.warn("....")
    ...

In the tests etc. we can still use just is_optimum_quanto_available directly.

src/transformers/generation/utils.py

github-actions · 2024-08-01T08:03:57Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

BenjaminBossan · 2024-09-06T13:16:48Z

It would be great if this PR could be revived :)

Btw if you do, I think you can set is_trainable to True. Even though the PEFT support for quanto is still in PR state, it still works as is. The reason is that quanto QLinear modules are subclasses of nn.Linear, thus PEFT just applies a normal lora.Linear layer. Some features like merging won't work (hence the need for the PR), but training and inference work. Here is some code for testing, which passed with PEFT v0.12.0:

import os

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer
from transformers.utils.quantization_config import QuantizationMethod
from peft import LoraConfig, get_peft_model
from optimum.quanto import QuantizedModelForCausalLM, qint2, qint4, qint8


os.environ["CUDA_VISISBLE_DEVICES"] = "0"

device = "cpu"
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
data = load_dataset("ybelkada/english_quotes_copy")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)


def main(weights):
    config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )

    model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", device_map=device)
    QuantizedModelForCausalLM.quantize(model, weights=weights)

    class QuantizationConfig:
        quant_method = QuantizationMethod.QUANTO

    class HfQuantizer:
        is_trainable = True  #                                  <======= is False right now
        quantization_config = QuantizationConfig()

    model.hf_quantizer = HfQuantizer()
    model.is_quantized = True

    model = get_peft_model(model, config)
    assert "QLinear" in repr(model)
    assert "lora.Linear" in repr(model)

    trainer = Trainer(
        model=model,
        train_dataset=data["train"],
        args=TrainingArguments(
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=5,
            learning_rate=2e-4,
            logging_steps=1,
            output_dir="/tmp/peft/quanto",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )
    model.config.use_cache = False
    trainer.train()

if __name__ == "__main__":
    main(qint2)
    main(qint4)
    main(qint8)

(I do get a RuntimeError: CUDA error: an illegal memory access was encountered from quanto when running with CUDA, but that's a separate issue)

dacorvo · 2024-09-30T13:23:42Z

Hi there. I am considering pushing one last quanto version that:

has optimum-quanto as single dependency,
raises an exception whenever a symbol is imported from quanto (instead of optimum.quanto).

This is because there are more than 50k downloads of the obsolete quanto version every week, and I would like to redirect the users to the correct library.

I am however worried about the implications in transformers, so it would be great to either merge this pull-request or make sure it is harmless.

SunMarc · 2024-09-30T13:58:57Z

Hi there. I am considering pushing one last quanto version that:

has optimum-quanto as single dependency,
raises an exception whenever a symbol is imported from quanto (instead of optimum.quanto).
This is because there are more than 50k downloads of the obsolete quanto version every week, and I would like to redirect the users to the correct library.

I am however worried about the implications in transformers, so it would be great to either merge this pull-request or make sure it is harmless.

Thanks for the heads-up ! I will spend a bit of time today to finish this PR !

HuggingFaceDocBuilderDev · 2024-09-30T17:21:27Z

Hey! 🤗 Thanks for your contribution to the transformers library!

Before merging this pull request, slow tests CI should be triggered. To enable this:

Add the run-slow label to the PR
When your PR is ready for merge and all reviewers' comments have been addressed, push an empty commit with the command [run-slow] followed by a comma separated list of all the models to be tested, i.e. [run_slow] model_to_test_1, model_to_test_2
- If the pull request affects a lot of models, put at most 10 models in the commit message
A transformers maintainer will then approve the workflow to start the tests

(For maintainers) The documentation for slow tests CI on PRs is here.

SunMarc · 2024-09-30T17:24:38Z

src/transformers/cache_utils.py

+        # We have two different API since in optimum-optimun, we don't use AffineQuantizer anymore
+        if is_optimum_quanto_available():
+            from optimum.quanto import QBitsTensor
+
+            qtensor = QBitsTensor.quantize(tensor, self.qtype, axis, self.q_group_size, scale, zeropoint)
+            return qtensor
+        elif is_quanto_available():
+            logger.warning_once(
+                "Importing from quanto will be deprecated in v4.47. Please install optimum-quanto instead `pip install optimum-quanto`"
+            )
+            from quanto import AffineQuantizer
+
+            qtensor = AffineQuantizer.apply(tensor, self.qtype, axis, self.q_group_size, scale, zeropoint)
+


Does this sounds good @dacorvo ? I checked quantize_weights but there are breaking changes between main and the latest version with args being removed. Also you use introduced WeightQBitsTensor recently in quantize_weights.

This will break on next release: I am doing it right now so that at least you can use the latest API.

Here the new release: https://github.com/huggingface/optimum-quanto/releases/tag/v0.2.5

HuggingFaceDocBuilderDev · 2024-09-30T17:46:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Generally LGTM, thanks Marc (and David for the recent release).

Just a question, should the is_trainable property also be updated (see my comment above)?

transformers/src/transformers/quantizers/quantizer_quanto.py

Lines 195 to 196 in 22266be

    
           def is_trainable(self, model: Optional["PreTrainedModel"] = None): 
        
               return False

src/transformers/cache_utils.py

src/transformers/quantizers/quantizer_quanto.py

SunMarc · 2024-10-01T13:07:33Z

Generally LGTM, thanks Marc (and David for the recent release).

Just a question, should the is_trainable property also be updated (see #31732 (comment))?

Oh indeed ! I'll change it !

dacorvo

LGTM, thanks !

amyeroberts

LGTM - nice handling of the versions. Thanks for adding!

ArthurZucker · 2024-10-02T13:01:58Z

src/transformers/quantizers/quantizer_quanto.py

@@ -193,7 +213,7 @@ def _process_model_after_weight_loading(self, model):

    @property
    def is_trainable(self, model: Optional["PreTrainedModel"] = None):
-        return False
+        return True


switching to optimum quanto makes them trainable? 👁️ 👁️
If yes, should be version specific no? Or am I missing something?!

Trainable with peft ! No changes in quanto, just in peft cc @BenjaminBossan and this comment: #31732 (comment)

* switch to optimum-quanto rebase squach * fix import check * again * test try-except * style

SunMarc requested review from amyeroberts and dacorvo July 1, 2024 15:59

amyeroberts reviewed Jul 2, 2024

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

github-actions bot closed this Aug 10, 2024

SunMarc reopened this Aug 12, 2024

github-actions bot closed this Aug 21, 2024

SunMarc reopened this Sep 6, 2024

SunMarc commented Sep 30, 2024

View reviewed changes

SunMarc requested review from amyeroberts, BenjaminBossan and ArthurZucker September 30, 2024 17:25

BenjaminBossan approved these changes Oct 1, 2024

View reviewed changes

dacorvo requested changes Oct 1, 2024

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

src/transformers/quantizers/quantizer_quanto.py Outdated Show resolved Hide resolved

SunMarc requested a review from dacorvo October 1, 2024 14:39

dacorvo approved these changes Oct 1, 2024

View reviewed changes

amyeroberts approved these changes Oct 1, 2024

View reviewed changes

switch to optimum-quanto rebase squach

f4ab762

SunMarc force-pushed the update-quanto-library branch from 08da020 to f4ab762 Compare October 2, 2024 11:42

SunMarc added 5 commits October 2, 2024 14:01

fix import check

5cfc56c

again

b13bb51

test try-except

600003a

style

76a197e

Merge remote-tracking branch 'upstream/main' into update-quanto-library

089415f

ArthurZucker approved these changes Oct 2, 2024

View reviewed changes

SunMarc merged commit cac4a48 into main Oct 2, 2024
25 checks passed

SunMarc deleted the update-quanto-library branch October 2, 2024 13:14

BenjaminBossan mentioned this pull request Oct 8, 2024

[FEAT] Add support for optimum-quanto huggingface/peft#2000

Open

7 tasks

NielsRogge pushed a commit to NielsRogge/transformers that referenced this pull request Oct 21, 2024

[Quantization] Switch to optimum-quanto (huggingface#31732)

1579e7a

* switch to optimum-quanto rebase squach * fix import check * again * test try-except * style

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

[Quantization] Switch to optimum-quanto (huggingface#31732)

262c022

* switch to optimum-quanto rebase squach * fix import check * again * test try-except * style

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024

[Quantization] Switch to optimum-quanto (huggingface#31732)

d859f93

* switch to optimum-quanto rebase squach * fix import check * again * test try-except * style

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] Switch to optimum-quanto #31732

[Quantization] Switch to optimum-quanto #31732

SunMarc commented Jul 1, 2024 •

edited

Loading

amyeroberts left a comment

github-actions bot commented Aug 1, 2024

BenjaminBossan commented Sep 6, 2024 •

edited

Loading

dacorvo commented Sep 30, 2024

SunMarc commented Sep 30, 2024

HuggingFaceDocBuilderDev commented Sep 30, 2024

SunMarc Sep 30, 2024 •

edited

Loading

dacorvo Oct 1, 2024

dacorvo Oct 1, 2024

HuggingFaceDocBuilderDev commented Sep 30, 2024

BenjaminBossan left a comment

SunMarc commented Oct 1, 2024

dacorvo left a comment

amyeroberts left a comment

ArthurZucker Oct 2, 2024

SunMarc Oct 2, 2024 •

edited

Loading

	def is_trainable(self, model: Optional["PreTrainedModel"] = None):
	return False

[Quantization] Switch to optimum-quanto #31732

[Quantization] Switch to optimum-quanto #31732

Conversation

SunMarc commented Jul 1, 2024 • edited Loading

What does this PR do ?

amyeroberts left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 1, 2024

BenjaminBossan commented Sep 6, 2024 • edited Loading

dacorvo commented Sep 30, 2024

SunMarc commented Sep 30, 2024

HuggingFaceDocBuilderDev commented Sep 30, 2024

SunMarc Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

dacorvo Oct 1, 2024

Choose a reason for hiding this comment

dacorvo Oct 1, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 30, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

SunMarc commented Oct 1, 2024

dacorvo left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 2, 2024

Choose a reason for hiding this comment

SunMarc Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

SunMarc commented Jul 1, 2024 •

edited

Loading

BenjaminBossan commented Sep 6, 2024 •

edited

Loading

SunMarc Sep 30, 2024 •

edited

Loading

SunMarc Oct 2, 2024 •

edited

Loading