-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add 10b experiment to flava native and fix checkpoint wrapper #309
Conversation
Codecov ReportBase: 93.63% // Head: 93.63% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #309 +/- ##
=======================================
Coverage 93.63% 93.63%
=======================================
Files 54 54
Lines 3235 3236 +1
=======================================
+ Hits 3029 3030 +1
Misses 206 206
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I know this is still a draft version. Would you mind putting a proper summary and a test plan so members from the team who do not have prior context about this work-stream can understand. Thanks :) |
@edward-io has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
offload_to_cpu=False, | ||
checkpoint_impl=checkpoint_impl, | ||
) | ||
apply_activation_checkpointing_wrapper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that this is the state that we're in, but can we but apply_ac_wrapper after FSDP wrap if using non-reentrant, and before FSDP wrap using reentrant? This is the way that both of them would work (the difference is not good).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to keep non-reentrant if the memory usage is greater than reentrant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let's just remove non-reentrant.
enable_half_reduce_in_fsdp: True # handles the reduction across devices in half precision | ||
|
||
activation_checkpointing: True | ||
activation_checkpointing_reentrant: True # false for non-reentrant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u make this true in other configs or just remove as per the other comment
auto_wrap_policy=partial( | ||
transformer_auto_wrap_policy, | ||
transformer_layer_cls={TransformerEncoderLayer}, | ||
transformer_layer_cls={ | ||
TransformerEncoderLayer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ankitade i thought we wanted to keep this for the imagenet validation?
@edward-io has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
strategy: fsdp # can be changed to ddp or fsdp | ||
seed: 1337 | ||
|
||
batch_size: 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch_size: 12 | |
batch_size: 8 |
Summary:
Adds 10B experiment config to FLAVA native training script and fixes issues with checkpointing due to kwargs and re-entrant.
Test plan:
torchrun --nproc_per_node=8 -m flava.native.train config=flava/native/configs/10b.yaml
Fixes #{issue number}