add 10b experiment to flava native and fix checkpoint wrapper #309

edward-io · 2022-09-12T20:51:39Z

Summary:

Adds 10B experiment config to FLAVA native training script and fixes issues with checkpointing due to kwargs and re-entrant.

Test plan:

torchrun --nproc_per_node=8 -m flava.native.train config=flava/native/configs/10b.yaml

Fixes #{issue number}

codecov-commenter · 2022-09-12T20:56:10Z

Codecov Report

Base: 93.63% // Head: 93.63% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (a6fff47) compared to base (b1c6813).
Patch has no changes to coverable lines.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #309   +/-   ##
=======================================
  Coverage   93.63%   93.63%           
=======================================
  Files          54       54           
  Lines        3235     3236    +1     
=======================================
+ Hits         3029     3030    +1     
  Misses        206      206

Impacted Files	Coverage Δ
torchmultimodal/modules/layers/transformer.py	`99.18% <ø> (ø)`
torchmultimodal/models/gpt.py	`98.96% <0.00%> (ø)`
torchmultimodal/utils/generate.py	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

langong347 · 2022-09-14T01:44:15Z

I know this is still a draft version. Would you mind putting a proper summary and a test plan so members from the team who do not have prior context about this work-stream can understand. Thanks :)

facebook-github-bot · 2022-09-16T00:34:37Z

@edward-io has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

examples/flava/native/train.py

rohan-varma · 2022-09-16T00:36:49Z

examples/flava/native/train.py

+                offload_to_cpu=False,
+                checkpoint_impl=checkpoint_impl,
+            )
+            apply_activation_checkpointing_wrapper(


I don't like that this is the state that we're in, but can we but apply_ac_wrapper after FSDP wrap if using non-reentrant, and before FSDP wrap using reentrant? This is the way that both of them would work (the difference is not good).

do we want to keep non-reentrant if the memory usage is greater than reentrant?

Good point, let's just remove non-reentrant.

examples/flava/native/train.py

examples/flava/native/configs/10b.yaml

ankitade · 2022-09-16T20:54:52Z

examples/flava/native/configs/10b.yaml

+  enable_half_reduce_in_fsdp: True  # handles the reduction across devices in half precision
+
+  activation_checkpointing: True
+  activation_checkpointing_reentrant: True # false for non-reentrant


can u make this true in other configs or just remove as per the other comment

ankitade · 2022-09-16T21:02:51Z

examples/flava/native/train.py

                auto_wrap_policy=partial(
                    transformer_auto_wrap_policy,
-                    transformer_layer_cls={TransformerEncoderLayer},
+                    transformer_layer_cls={
+                        TransformerEncoderLayer,


@ankitade i thought we wanted to keep this for the imagenet validation?

facebook-github-bot · 2022-09-16T21:12:45Z

@edward-io has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ankitade · 2022-09-16T22:05:28Z

examples/flava/native/configs/4.8b.yaml

+  strategy: fsdp # can be changed to ddp or fsdp
+  seed: 1337
+
+  batch_size: 12


Suggested change

batch_size: 12

batch_size: 8

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2022

edward-io added 3 commits September 15, 2022 22:30

[draft] flava native 10b manual checkpoint

6098c3e

move wrapper before fsdp

3a13004

code cleanup

e231efe

edward-io force-pushed the main branch from a6fff47 to e231efe Compare September 15, 2022 23:25

edward-io changed the title ~~[draft] flava native 10b manual checkpoint~~ add 10b experiment to flava native and fix checkpoint wrapper Sep 16, 2022

edward-io added 2 commits September 16, 2022 00:35

remove cuda call

4b50c32

remove outdated comments

25ae15d

rohan-varma reviewed Sep 16, 2022

View reviewed changes

edward-io marked this pull request as ready for review September 16, 2022 00:38

ankitade reviewed Sep 16, 2022

View reviewed changes

examples/flava/native/train.py Show resolved Hide resolved

examples/flava/native/train.py Outdated Show resolved Hide resolved

examples/flava/native/configs/10b.yaml Show resolved Hide resolved

examples/flava/native/configs/10b.yaml Outdated Show resolved Hide resolved

edward-io added 4 commits September 16, 2022 01:58

address comments

cc641e5

formatting

94c1e89

flake8

a036d1b

remove imagenet check

f9f31a4

ankitade reviewed Sep 16, 2022

View reviewed changes

add 4.8b yaml and address comments

11b98c5

ankitade approved these changes Sep 16, 2022

View reviewed changes

facebook-github-bot closed this in 53ab78b Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add 10b experiment to flava native and fix checkpoint wrapper #309

add 10b experiment to flava native and fix checkpoint wrapper #309

edward-io commented Sep 12, 2022 •

edited

Loading

codecov-commenter commented Sep 12, 2022 •

edited

Loading

langong347 commented Sep 14, 2022

facebook-github-bot commented Sep 16, 2022

rohan-varma Sep 16, 2022

edward-io Sep 16, 2022

rohan-varma Sep 16, 2022

ankitade Sep 16, 2022

ankitade Sep 16, 2022

edward-io Sep 16, 2022

facebook-github-bot commented Sep 16, 2022

ankitade Sep 16, 2022

add 10b experiment to flava native and fix checkpoint wrapper #309

add 10b experiment to flava native and fix checkpoint wrapper #309

Conversation

edward-io commented Sep 12, 2022 • edited Loading

codecov-commenter commented Sep 12, 2022 • edited Loading

Codecov Report

langong347 commented Sep 14, 2022

facebook-github-bot commented Sep 16, 2022

rohan-varma Sep 16, 2022

Choose a reason for hiding this comment

edward-io Sep 16, 2022

Choose a reason for hiding this comment

rohan-varma Sep 16, 2022

Choose a reason for hiding this comment

ankitade Sep 16, 2022

Choose a reason for hiding this comment

ankitade Sep 16, 2022

Choose a reason for hiding this comment

edward-io Sep 16, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Sep 16, 2022

ankitade Sep 16, 2022

Choose a reason for hiding this comment

edward-io commented Sep 12, 2022 •

edited

Loading

codecov-commenter commented Sep 12, 2022 •

edited

Loading