Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for accelerate #826

Merged
merged 10 commits into from
Mar 4, 2022

Conversation

BenjaminBossan
Copy link
Collaborator

@BenjaminBossan BenjaminBossan commented Dec 23, 2021

Description

Proposal to implement huggingface accelerate (#760). If this is adopted, we would probably no longer need implement AMP support, since this is a subset of the offered functionality of accelerate.

For lack of appropriate hardware, I could not test if this would actually work or not.

Implementation

First idea was to implement this as a callback but that would require to add a lot of new on_ -methods that would serve only this callback for now.

Therefore, this feature is implemented as a mixin class to serve as proof of concept of how this could look like if implemented in the NeuralNet class itself.

I'm not sure if we would need to use accelerator.utils.save.

The failing tests seem to be unrelated to this change.

As of now, implemented as a mixin class to serve as proof of concept.
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! I tested on hardware and can confirm this works.

Using a mixin adds another design pattern, which can signal a limitation of the callbacks. The most user-friendly way would be to add a kwarg into NeuraNet, but that adds more complexity to NeuraNet. It's all trade-offs now.

I am happy to try mixins out.

skorch/helper.py Show resolved Hide resolved
skorch/helper.py Outdated
class AccelerateMixin:
"""Mixin class to add support for huggingface's accelerate

*Experimental*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can include a link to https://huggingface.co/docs/accelerate/index.html and how to install in the docstring

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree to proceed with this design, I would add links and more information.

skorch/helper.py Show resolved Hide resolved

def test_fit_fp16(self, net, data):
X, y = data
net.fit(X, y) # does not raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a public way to check that accerlate did something without mocking. All module and optimizer state in Accelerator is stored as a private attribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will probably add tests with a fake accelerator that sets an attribute on the object (module.accelerated = True) or something like this for testing.


@pytest.fixture
def net(self, net_cls, module_cls, accelerator):
return net_cls(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sanity check, it would be good to check if the NeuralNet can be pickled and unpickled with an accelerator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@BenjaminBossan
Copy link
Collaborator Author

I tested on hardware and can confirm this works.

Nice, thanks for testing this out. I would assume that certain combinations of settings might cause problems, but it will be impossible to check all of them out (especially not with the existing CI). I would also be happy if we could cover the most common use cases, everything else would be almost unfeasible, given the multitude of options available.

The most user-friendly way would be to add a kwarg into NeuraNet, but that adds more complexity to NeuraNet. It's all trade-offs now.

This is what I meant when I wrote

this feature is implemented as a mixin class to serve as proof of concept of how this could look like if implemented in the NeuralNet class itself.

but I could have written that more clearly. I agree, the reason why I started like this is just to get something out quickly and collect feedback (especially since it might not have worked at all). If we agree with the design, I could move the functionality to NeuralNet itself.

If we decide to go this way, what do you think would be better: Having a couple of if self.accelerator is not None checks (verbose but more transparent) or having a fake accelerator object that just performs a no-op when self.accelerator.prepare is called (less verbose but less transparent).

@thomasjpfan
Copy link
Member

Having a couple of if self.accelerator is not None checks (verbose but more transparent) or having a fake accelerator object that just performs a no-op when self.accelerator.prepare is called (less verbose but less transparent).

If we place the logic into NeuralNet, I would go with the None check.

@BenjaminBossan BenjaminBossan assigned ottonemo and unassigned ottonemo Jan 5, 2022
@BenjaminBossan
Copy link
Collaborator Author

@thomasjpfan I discussed this with @ottonemo and we decided to leave the feature as a mixin class for now and not to integrate it into NeuralNet. The reason is that accelerate is still quite young and its API might change in the future. By not integrating it into NeuralNet yet, it becomes clear that this is an experimental feature with possible BC changes coming.

@thomasjpfan
Copy link
Member

Okay, lets go with the mixin.

@BenjaminBossan BenjaminBossan marked this pull request as ready for review February 13, 2022 17:29
- Add install instruction to the docstring
- Add section to helper.rst
- Minor fixes
@BenjaminBossan
Copy link
Collaborator Author

@thomasjpfan I added more documentation of this new feature. This should hopefully help users to discover the feature (and provide feedback if there is any issue).

MyModule,
accelerator=accelerator,
device=None,
callbacks__print_log__sink=accelerator.print)
Copy link
Member

@thomasjpfan thomasjpfan Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do set callbacks__print_log__sink automatically? It feels like more boilerplate for a user to write.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true. On the other hand, since the accelerator does not exist when we define AccelerateMixin, we cannot set it as default argument. We would need to do something else, like:

def initialize(self):
    super().initialize()
    self.set_params(callbacks__print_log__sink=self.accelerator.print)
    return self

Not sure if the user would have any reason not to set this, but if they had, this hard-coding would prove annoying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having good defaults is generally better. If a user does not want to set it like this, I'm guessing they are more advanced and would need to to override initialize to undo it.

It's a toss up for me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a different solution which hopefully lets us have the cake and eat it too. Please take a look: 9d3e5e0

This is achieved by setting the default callbacks__print_log__print to
'auto'. When this is detected later, replace it with
self.accelerator.print if available.

This way, we can use the sane default even if we cannot directly set it
as default (because the accelerator instance does not exist yet) but
still give the user the option to set a different argument.
skorch/tests/test_helper.py Show resolved Hide resolved
skorch/helper.py Outdated
>>> net = AcceleratedNet(
... MyModule,
... accelerator=accelerator,
... device=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely can make device="auto" too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure on this one, as it's not as clear cut as the other default. Also, changing device is not as obscure as the print log sink.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with print, I can not really think of a case where device is not None when using Accelerator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you convinced me, I changed the default to None.

It's necessary to assign the prepared components, pure inplace mutation
seems to create bugs.
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I think we can install accelerate on the CI and run the CPU tests.

skorch/tests/test_helper.py Show resolved Hide resolved
When using AMP, accelerate applies grad scaling under the hood using
GradScaler. That does not support passing the train step as a closure to
optimizer.step. Therefore, we need to step explicitly.

We could use a more sophisticated approach of trying to figure out if
grad scaler is actually being used and only stepping explicitly if
needed. However, the need for the closure is quite rare and we want to
treat accelerate as a black box instead of relying on implementation
details (which we would have to in order to figure out when grad scaling
is applied).
This way, it is being tested (as far as possible with only CPU) during
CI.
@BenjaminBossan
Copy link
Collaborator Author

@thomasjpfan I did run into one more bug when I got my hands on an NVIDIA T4 GPU. Namely, when using fp16, accelerate uses GradScaler under the hood, which does not support passing the train step as a 'closure'. However, this is what we do by default. Thus I had to sidestep that.

I think in the future, we might consider having the 'closure' pattern be an explicit option that the user can opt into via a parameter, instead of doing it implicitly. Then we can also give a proper error when a user tries to use closure with accelerate.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix with closures!

I ran the tests locally with an Ampere GPU and everything passes. I also experimented with notebooks/MNIST.ipynb to use Accelerator and fp16 works as expected.

LGTM

@BenjaminBossan
Copy link
Collaborator Author

Thanks @thomasjpfan for doing another round of testing. So now we know it works on Ampere, Turing, and CPU :D That's probably sufficient testing for an experimental feature.

@BenjaminBossan BenjaminBossan deleted the support-huggingface-accelerate branch April 18, 2022 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants