Experimental support for accelerate #826

BenjaminBossan · 2021-12-23T15:03:52Z

Description

Proposal to implement huggingface accelerate (#760). If this is adopted, we would probably no longer need implement AMP support, since this is a subset of the offered functionality of accelerate.

For lack of appropriate hardware, I could not test if this would actually work or not.

Implementation

First idea was to implement this as a callback but that would require to add a lot of new on_ -methods that would serve only this callback for now.

Therefore, this feature is implemented as a mixin class to serve as proof of concept of how this could look like if implemented in the NeuralNet class itself.

I'm not sure if we would need to use accelerator.utils.save.

The failing tests seem to be unrelated to this change.

As of now, implemented as a mixin class to serve as proof of concept.

thomasjpfan

Thank you for working on this! I tested on hardware and can confirm this works.

Using a mixin adds another design pattern, which can signal a limitation of the callbacks. The most user-friendly way would be to add a kwarg into NeuraNet, but that adds more complexity to NeuraNet. It's all trade-offs now.

I am happy to try mixins out.

skorch/helper.py

thomasjpfan · 2021-12-23T16:42:52Z

skorch/helper.py

+class AccelerateMixin:
+    """Mixin class to add support for huggingface's accelerate
+
+    *Experimental*


We can include a link to https://huggingface.co/docs/accelerate/index.html and how to install in the docstring

If we agree to proceed with this design, I would add links and more information.

skorch/helper.py

thomasjpfan · 2021-12-23T16:58:06Z

skorch/tests/test_helper.py

+
+    def test_fit_fp16(self, net, data):
+        X, y = data
+        net.fit(X, y)  # does not raise


I do not see a public way to check that accerlate did something without mocking. All module and optimizer state in Accelerator is stored as a private attribute.

I will probably add tests with a fake accelerator that sets an attribute on the object (module.accelerated = True) or something like this for testing.

thomasjpfan · 2021-12-23T16:59:05Z

skorch/tests/test_helper.py

+
+    @pytest.fixture
+    def net(self, net_cls, module_cls, accelerator):
+        return net_cls(


As a sanity check, it would be good to check if the NeuralNet can be pickled and unpickled with an accelerator.

Good point.

BenjaminBossan · 2021-12-23T19:04:01Z

I tested on hardware and can confirm this works.

Nice, thanks for testing this out. I would assume that certain combinations of settings might cause problems, but it will be impossible to check all of them out (especially not with the existing CI). I would also be happy if we could cover the most common use cases, everything else would be almost unfeasible, given the multitude of options available.

The most user-friendly way would be to add a kwarg into NeuraNet, but that adds more complexity to NeuraNet. It's all trade-offs now.

This is what I meant when I wrote

this feature is implemented as a mixin class to serve as proof of concept of how this could look like if implemented in the NeuralNet class itself.

but I could have written that more clearly. I agree, the reason why I started like this is just to get something out quickly and collect feedback (especially since it might not have worked at all). If we agree with the design, I could move the functionality to NeuralNet itself.

If we decide to go this way, what do you think would be better: Having a couple of if self.accelerator is not None checks (verbose but more transparent) or having a fake accelerator object that just performs a no-op when self.accelerator.prepare is called (less verbose but less transparent).

thomasjpfan · 2021-12-24T14:58:30Z

Having a couple of if self.accelerator is not None checks (verbose but more transparent) or having a fake accelerator object that just performs a no-op when self.accelerator.prepare is called (less verbose but less transparent).

If we place the logic into NeuralNet, I would go with the None check.

BenjaminBossan · 2022-02-13T13:54:04Z

@thomasjpfan I discussed this with @ottonemo and we decided to leave the feature as a mixin class for now and not to integrate it into NeuralNet. The reason is that accelerate is still quite young and its API might change in the future. By not integrating it into NeuralNet yet, it becomes clear that this is an experimental feature with possible BC changes coming.

thomasjpfan · 2022-02-13T16:12:34Z

Okay, lets go with the mixin.

- Add install instruction to the docstring - Add section to helper.rst - Minor fixes

BenjaminBossan · 2022-02-13T23:04:38Z

@thomasjpfan I added more documentation of this new feature. This should hopefully help users to discover the feature (and provide feedback if there is any issue).

thomasjpfan · 2022-02-16T02:47:41Z

docs/user/helper.rst

+        MyModule,
+        accelerator=accelerator,
+        device=None,
+        callbacks__print_log__sink=accelerator.print)


Should we do set callbacks__print_log__sink automatically? It feels like more boilerplate for a user to write.

Yes, true. On the other hand, since the accelerator does not exist when we define AccelerateMixin, we cannot set it as default argument. We would need to do something else, like:

def initialize(self): super().initialize() self.set_params(callbacks__print_log__sink=self.accelerator.print) return self

Not sure if the user would have any reason not to set this, but if they had, this hard-coding would prove annoying.

I think having good defaults is generally better. If a user does not want to set it like this, I'm guessing they are more advanced and would need to to override initialize to undo it.

It's a toss up for me.

I tried a different solution which hopefully lets us have the cake and eat it too. Please take a look: 9d3e5e0

This is achieved by setting the default callbacks__print_log__print to 'auto'. When this is detected later, replace it with self.accelerator.print if available. This way, we can use the sane default even if we cannot directly set it as default (because the accelerator instance does not exist yet) but still give the user the option to set a different argument.

skorch/tests/test_helper.py

thomasjpfan · 2022-02-19T16:12:04Z

skorch/helper.py

+    >>> net = AcceleratedNet(
+    ...     MyModule,
+    ...     accelerator=accelerator,
+    ...     device=None,


We likely can make device="auto" too.

I'm not so sure on this one, as it's not as clear cut as the other default. Also, changing device is not as obscure as the print log sink.

As with print, I can not really think of a case where device is not None when using Accelerator.

Okay, you convinced me, I changed the default to None.

It's necessary to assign the prepared components, pure inplace mutation seems to create bugs.

thomasjpfan

Thanks for the update! I think we can install accelerate on the CI and run the CPU tests.

skorch/tests/test_helper.py

When using AMP, accelerate applies grad scaling under the hood using GradScaler. That does not support passing the train step as a closure to optimizer.step. Therefore, we need to step explicitly. We could use a more sophisticated approach of trying to figure out if grad scaler is actually being used and only stepping explicitly if needed. However, the need for the closure is quite rare and we want to treat accelerate as a black box instead of relying on implementation details (which we would have to in order to figure out when grad scaling is applied).

This way, it is being tested (as far as possible with only CPU) during CI.

BenjaminBossan · 2022-03-02T18:58:24Z

@thomasjpfan I did run into one more bug when I got my hands on an NVIDIA T4 GPU. Namely, when using fp16, accelerate uses GradScaler under the hood, which does not support passing the train step as a 'closure'. However, this is what we do by default. Thus I had to sidestep that.

I think in the future, we might consider having the 'closure' pattern be an explicit option that the user can opt into via a parameter, instead of doing it implicitly. Then we can also give a proper error when a user tries to use closure with accelerate.

thomasjpfan

Thanks for the fix with closures!

I ran the tests locally with an Ampere GPU and everything passes. I also experimented with notebooks/MNIST.ipynb to use Accelerator and fp16 works as expected.

LGTM

BenjaminBossan · 2022-03-04T09:28:33Z

Thanks @thomasjpfan for doing another round of testing. So now we know it works on Ampere, Turing, and CPU :D That's probably sufficient testing for an experimental feature.

[WIP] Experimental support for accelerate

e070abe

As of now, implemented as a mixin class to serve as proof of concept.

BenjaminBossan self-assigned this Dec 23, 2021

BenjaminBossan requested review from ottonemo and thomasjpfan December 23, 2021 15:11

thomasjpfan reviewed Dec 23, 2021

View reviewed changes

Merge branch 'master' into support-huggingface-accelerate

08f495d

BenjaminBossan assigned ottonemo and unassigned ottonemo Jan 5, 2022

More extensive tests, documentation

b541ea0

BenjaminBossan marked this pull request as ready for review February 13, 2022 17:29

Add more documentation about the new feature

9c22965

- Add install instruction to the docstring - Add section to helper.rst - Minor fixes

thomasjpfan reviewed Feb 16, 2022

View reviewed changes

thomasjpfan reviewed Feb 19, 2022

View reviewed changes

BenjaminBossan added 3 commits February 21, 2022 15:39

Add nan check to test

f195e8c

Set device default to None with accelerate

58310c1

Fix some issues

8409c05

It's necessary to assign the prepared components, pure inplace mutation seems to create bugs.

thomasjpfan reviewed Mar 2, 2022

View reviewed changes

skorch/tests/test_helper.py Show resolved Hide resolved

BenjaminBossan added 2 commits March 2, 2022 19:49

Add accelerate to dev requirements

08681f8

This way, it is being tested (as far as possible with only CPU) during CI.

thomasjpfan approved these changes Mar 4, 2022

View reviewed changes

BenjaminBossan merged commit c0df4d1 into master Mar 4, 2022

BenjaminBossan mentioned this pull request Mar 4, 2022

Consider integrating Huggingface Accelerate #760

Closed

BenjaminBossan mentioned this pull request Mar 4, 2022

Native automatic mixed precision for Skorch #611

Closed

BenjaminBossan mentioned this pull request Mar 20, 2022

[WIP] Initial implementation of AMP support #707

Closed

12 tasks

BenjaminBossan deleted the support-huggingface-accelerate branch April 18, 2022 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental support for accelerate #826

Experimental support for accelerate #826

BenjaminBossan commented Dec 23, 2021 •

edited

Loading

thomasjpfan left a comment

thomasjpfan Dec 23, 2021

BenjaminBossan Dec 23, 2021

thomasjpfan Dec 23, 2021

BenjaminBossan Dec 23, 2021

thomasjpfan Dec 23, 2021

BenjaminBossan Dec 23, 2021

BenjaminBossan commented Dec 23, 2021

thomasjpfan commented Dec 24, 2021

BenjaminBossan commented Feb 13, 2022

thomasjpfan commented Feb 13, 2022

BenjaminBossan commented Feb 13, 2022

thomasjpfan Feb 16, 2022 •

edited

Loading

BenjaminBossan Feb 16, 2022

thomasjpfan Feb 16, 2022

BenjaminBossan Feb 18, 2022

thomasjpfan Feb 19, 2022

BenjaminBossan Feb 21, 2022

thomasjpfan Feb 21, 2022

BenjaminBossan Feb 27, 2022

thomasjpfan left a comment •

edited

Loading

BenjaminBossan commented Mar 2, 2022

thomasjpfan left a comment

BenjaminBossan commented Mar 4, 2022

Experimental support for accelerate #826

Experimental support for accelerate #826

Conversation

BenjaminBossan commented Dec 23, 2021 • edited Loading

Description

Implementation

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Dec 23, 2021

thomasjpfan commented Dec 24, 2021

BenjaminBossan commented Feb 13, 2022

thomasjpfan commented Feb 13, 2022

BenjaminBossan commented Feb 13, 2022

thomasjpfan Feb 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan left a comment • edited Loading

Choose a reason for hiding this comment

BenjaminBossan commented Mar 2, 2022

thomasjpfan left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Mar 4, 2022

BenjaminBossan commented Dec 23, 2021 •

edited

Loading

thomasjpfan Feb 16, 2022 •

edited

Loading

thomasjpfan left a comment •

edited

Loading