Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 10b experiment to flava native and fix checkpoint wrapper #309

Closed
wants to merge 10 commits into from

Conversation

edward-io
Copy link
Contributor

@edward-io edward-io commented Sep 12, 2022

Summary:

Adds 10B experiment config to FLAVA native training script and fixes issues with checkpointing due to kwargs and re-entrant.

Test plan:

torchrun --nproc_per_node=8 -m flava.native.train config=flava/native/configs/10b.yaml

Fixes #{issue number}

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2022

Codecov Report

Base: 93.63% // Head: 93.63% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (a6fff47) compared to base (b1c6813).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #309   +/-   ##
=======================================
  Coverage   93.63%   93.63%           
=======================================
  Files          54       54           
  Lines        3235     3236    +1     
=======================================
+ Hits         3029     3030    +1     
  Misses        206      206           
Impacted Files Coverage Δ
torchmultimodal/modules/layers/transformer.py 99.18% <ø> (ø)
torchmultimodal/models/gpt.py 98.96% <0.00%> (ø)
torchmultimodal/utils/generate.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@langong347
Copy link
Contributor

I know this is still a draft version. Would you mind putting a proper summary and a test plan so members from the team who do not have prior context about this work-stream can understand. Thanks :)

@edward-io edward-io changed the title [draft] flava native 10b manual checkpoint add 10b experiment to flava native and fix checkpoint wrapper Sep 16, 2022
@facebook-github-bot
Copy link
Contributor

@edward-io has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

examples/flava/native/train.py Outdated Show resolved Hide resolved
offload_to_cpu=False,
checkpoint_impl=checkpoint_impl,
)
apply_activation_checkpointing_wrapper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that this is the state that we're in, but can we but apply_ac_wrapper after FSDP wrap if using non-reentrant, and before FSDP wrap using reentrant? This is the way that both of them would work (the difference is not good).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to keep non-reentrant if the memory usage is greater than reentrant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let's just remove non-reentrant.

@edward-io edward-io marked this pull request as ready for review September 16, 2022 00:38
examples/flava/native/train.py Show resolved Hide resolved
examples/flava/native/train.py Outdated Show resolved Hide resolved
examples/flava/native/configs/10b.yaml Show resolved Hide resolved
examples/flava/native/configs/10b.yaml Outdated Show resolved Hide resolved
enable_half_reduce_in_fsdp: True # handles the reduction across devices in half precision

activation_checkpointing: True
activation_checkpointing_reentrant: True # false for non-reentrant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make this true in other configs or just remove as per the other comment

auto_wrap_policy=partial(
transformer_auto_wrap_policy,
transformer_layer_cls={TransformerEncoderLayer},
transformer_layer_cls={
TransformerEncoderLayer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankitade i thought we wanted to keep this for the imagenet validation?

@facebook-github-bot
Copy link
Contributor

@edward-io has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

strategy: fsdp # can be changed to ddp or fsdp
seed: 1337

batch_size: 12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
batch_size: 12
batch_size: 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants