Restore fp16 support on xla gpu device #22300

ymwangg · 2023-03-21T19:08:10Z

#20684 accidentally disabled fp16 support on xla gpu device, which leads to significant performance regression. This PR restores this feature.

cc @jeffhataws @sgugger @Lokiiiiii

Tested with

GPU_NUM_DEVICES=1 python run_mlm.py \
    --model_name_or_path bert-base-uncased \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --overwrite_output_dir true \
    --output_dir /tmp/test-mlm \
    --per_gpu_train_batch_size 24 \
    --do_eval \
    --fp16 true \
    --do_train \
    --num_train_epochs 3 \
    --optim adamw_torch_xla

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     1.7725
  train_runtime            = 0:04:58.00
  train_samples            =       4627
  train_samples_per_second =      46.58
  train_steps_per_second   =      1.943
INFO:__main__:*** Evaluate ***
[INFO|trainer.py:739] 2023-03-21 19:05:53,483 >> The following columns in the evaluation set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
[INFO|trainer.py:3072] 2023-03-21 19:05:53,487 >> ***** Running Evaluation *****
[INFO|trainer.py:3074] 2023-03-21 19:05:53,487 >>   Num examples = 479
[INFO|trainer.py:3077] 2023-03-21 19:05:53,487 >>   Batch size = 8
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:07<00:00,  8.38it/s]
***** eval metrics *****
  epoch                   =        3.0
  eval_loss               =     1.5811
  eval_runtime            = 0:00:29.83
  eval_samples            =        479
  eval_samples_per_second =     16.055
  eval_steps_per_second   =      2.011
  perplexity              =     4.8601

sgugger

Should work as #21428 changed the logic so nothing needs to be done for bfloat16 anymore.

Thanks for the fix!

jeffhataws · 2023-03-21T19:14:48Z

src/transformers/trainer.py

@@ -598,7 +598,7 @@ def __init__(
            logger.info(f"Using {args.half_precision_backend} half precision backend")

        self.do_grad_scaling = False
-        if (args.fp16 or args.bf16) and not (args.deepspeed or is_sagemaker_mp_enabled() or is_torch_tpu_available()):
+        if (args.fp16 or args.bf16) and not (args.deepspeed or is_sagemaker_mp_enabled()):


Will you separate bf16 out? We don't need to scale for bf16.

See comment and PR mentioned above.

Would this work for bf16 on XLA?

if args.fp16 and not (args.deepspeed or is_sagemaker_mp_enabled()):

Thanks! Let's merge this for XLA GPU then. I will check on my side and fix if needed.

I think this check ensures bf16 won't run grad scaling.

self.do_grad_scaling = self.amp_dtype == torch.float16 if self.do_grad_scaling:

Please revise if you see anything wrong for trainium.

#22307 is the fix for Neuron.

HuggingFaceDocBuilderDev · 2023-03-21T19:23:15Z

The documentation is not available anymore as the PR was closed or merged.

sgugger · 2023-03-21T20:32:39Z

Failure is unrelated and due to the branch being old, it's fixed on main, so merging.

This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300

This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: #20684 #22300

This reverts commit fd81746.

…ingface#22307) This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300

…" (huggingface#22451) This reverts commit fd81746.

…ingface#22307) This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300

…" (huggingface#22451) This reverts commit fd81746.

Restore fp16 support on xla gpu device

984eb9a

sgugger approved these changes Mar 21, 2023

View reviewed changes

jeffhataws reviewed Mar 21, 2023

View reviewed changes

jeffhataws approved these changes Mar 21, 2023

View reviewed changes

sgugger merged commit d35f729 into huggingface:main Mar 21, 2023

jeffhataws mentioned this pull request Mar 22, 2023

Fix --bf16 option support for Neuron after PR #22300 #22307

Merged

5 tasks

sgugger pushed a commit that referenced this pull request Mar 23, 2023

Fix --bf16 option support for Neuron after PR #22300 (#22307)

ec9b18f

This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: #20684 #22300

jeffhataws added a commit to jeffhataws/transformers that referenced this pull request Mar 29, 2023

Revert "Fix --bf16 option support for Neuron after PR huggingface#22300"

bc8f00e

This reverts commit fd81746.

sgugger pushed a commit that referenced this pull request Mar 29, 2023

Revert "Fix --bf16 option support for Neuron after PR #22300" (#22451)

5e89a43

This reverts commit fd81746.

raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023

Restore fp16 support on xla gpu device (huggingface#22300)

5d15c1b

raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023

Revert "Fix --bf16 option support for Neuron after PR huggingface#22300…

e832063

…" (huggingface#22451) This reverts commit fd81746.

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

Restore fp16 support on xla gpu device (huggingface#22300)

f41ff75

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

Revert "Fix --bf16 option support for Neuron after PR huggingface#22300…

5cf5c92

…" (huggingface#22451) This reverts commit fd81746.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore fp16 support on xla gpu device #22300

Restore fp16 support on xla gpu device #22300

ymwangg commented Mar 21, 2023

sgugger left a comment

jeffhataws Mar 21, 2023

sgugger Mar 21, 2023

jeffhataws Mar 21, 2023 •

edited

Loading

jeffhataws Mar 21, 2023

ymwangg Mar 21, 2023

jeffhataws Mar 22, 2023

HuggingFaceDocBuilderDev commented Mar 21, 2023 •

edited

Loading

sgugger commented Mar 21, 2023

Restore fp16 support on xla gpu device #22300

Restore fp16 support on xla gpu device #22300

Conversation

ymwangg commented Mar 21, 2023

sgugger left a comment

Choose a reason for hiding this comment

jeffhataws Mar 21, 2023

Choose a reason for hiding this comment

sgugger Mar 21, 2023

Choose a reason for hiding this comment

jeffhataws Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

jeffhataws Mar 21, 2023

Choose a reason for hiding this comment

ymwangg Mar 21, 2023

Choose a reason for hiding this comment

jeffhataws Mar 22, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 21, 2023 • edited Loading

sgugger commented Mar 21, 2023

jeffhataws Mar 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 21, 2023 •

edited

Loading