A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).
Pre-LN applies LayerNorm to the input of every sublayers instead of the residual connection part in Post-LN. The proposed model architecture in the paper was in Post-LN, however the official implementation has been changed into Pre-LN version. The experiment result shows that Pre-LN transformer converges faster while doesn't even need warming up, and is less sensitive to hyperparameters. For more detail about the difference between them, check out the paper On Layer Normalization in the Transformer Architecture.
The English-German small-dataset WMT 2016 multimodal task from torchtext.
- Python3
- PyTorch >= 1.2.0
- torchtext
- spacy
- nltk
- tqdm
- Beam search is not supported.
- Label smoothing is not implemented.
- BPE is not adapted.
- Run
transformer.ipynb
to download dataset and train the model. - Change the flag
pre_lnorm
to determine which to use.
- Parameter settings
- hidden size: 512
- feed forward size: 2048
- num head: 8
- layer: 6
- warm-up: 2000
- batch size: 128
Here's an example from test data:
- source
eine frau verwendet eine bohrmaschine während ein mann sie fotografiert .
- gold
a woman uses a drill while another man takes her picture .
- inference
a woman uses an electric drill as a man takes a picture .
- Label smoothing
- Attention visualization