Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vit.py #274

Closed
wants to merge 1 commit into from
Closed

Conversation

LuYuchenOrRobert
Copy link

In the official code of Vision Transformer, the final LayerNorm is conducted after self.transformer(x) but before taking the mean or class token from x. It is revised here to be consistent with the official version as chaging the position of LayerNorm does affect the calculation results (not sure about whether affects the model performance).

In the official code of Vision Transformer, the final LayerNorm is conducted after `self.transformer(x)` but before taking the mean or class token from `x`. Therefore, it is revised here to be consistent with the official version.
@lucidrains
Copy link
Owner

lucidrains commented Aug 5, 2023

@LuYuchenOrRobert oh yes, this would affect mean pooling (but not cls token)

i should change all the ViTs (whichever ones are using global mean pooling) to have this 'correct' order

lucidrains added a commit that referenced this pull request Aug 9, 2023
@lucidrains
Copy link
Owner

@LuYuchenOrRobert you want to see if 3e5d1be looks right to you? i'll finish converting the rest of the vision transformers using global mean pooling later this week

@LuYuchenOrRobert
Copy link
Author

@lucidrains Sorry for my late response. It looks right now. Thank you for your contribution!

@vivekh2000
Copy link

vivekh2000 commented Apr 21, 2024

Hi, @lucidrains. If you look at the forward function of the Transformer class in vit.py, it is doing the layer norm after the Transformer block.
image
This means that after every transformer block layer, the norm operation is performed. But it is required only after the last Transformer block, i.e., after this line,

x = self.transformer(x)

Because Attention and Feedforward class already have their LayerNorm.

I also think the call to to_latent(x) is doing nothing as it is nn.Identity only. This can also be removed. Thank you for the simple implementation. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants