-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore fp16 support on xla gpu device #22300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work as #21428 changed the logic so nothing needs to be done for bfloat16 anymore.
Thanks for the fix!
@@ -598,7 +598,7 @@ def __init__( | |||
logger.info(f"Using {args.half_precision_backend} half precision backend") | |||
|
|||
self.do_grad_scaling = False | |||
if (args.fp16 or args.bf16) and not (args.deepspeed or is_sagemaker_mp_enabled() or is_torch_tpu_available()): | |||
if (args.fp16 or args.bf16) and not (args.deepspeed or is_sagemaker_mp_enabled()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you separate bf16 out? We don't need to scale for bf16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment and PR mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this work for bf16 on XLA?
if args.fp16 and not (args.deepspeed or is_sagemaker_mp_enabled()):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's merge this for XLA GPU then. I will check on my side and fix if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check ensures bf16 won't run grad scaling.
self.do_grad_scaling = self.amp_dtype == torch.float16
if self.do_grad_scaling:
Please revise if you see anything wrong for trainium.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#22307 is the fix for Neuron.
The documentation is not available anymore as the PR was closed or merged. |
Failure is unrelated and due to the branch being old, it's fixed on main, so merging. |
This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300
This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300
This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300
This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300
This reverts commit fd81746.
…ingface#22307) This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300
…" (huggingface#22451) This reverts commit fd81746.
…ingface#22307) This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: huggingface#20684 huggingface#22300
…" (huggingface#22451) This reverts commit fd81746.
#20684 accidentally disabled fp16 support on xla gpu device, which leads to significant performance regression. This PR restores this feature.
cc @jeffhataws @sgugger @Lokiiiiii
Tested with