Merge branch 'master' of github.com:jxbz/fromage

jxbz · Jun 9, 2020 · 6457498 · 6457498
2 parents 13d114f + 14091e5
commit 6457498
Show file tree

Hide file tree

Showing 4 changed files with 33 additions and 88 deletions.
diff --git a/README.md b/README.md
@@ -18,9 +18,11 @@ Fromage 🧀 optimiser
 To get started with Fromage in your Pytorch code, copy the file `fromage.py` into your project directory, then write:
 ```
 from fromage import Fromage
-optimizer = Fromage(net.parameters(), lr=0.01)
+optimizer = Fromage(net.parameters(), lr=0.01, p_bound=None)
 ```
-We found an initial learning rate of 0.01 worked well in all experiments except model fine-tuning, where we used 0.001. You may want to experiment with learning rate decay schedules.
+An initial learning rate of 0.01 has worked well in all our experiments except model fine-tuning, where 0.001 worked well. Decaying the learning rate when the loss plateaus is a good idea. 
+
+On some benchmarks, Fromage heavily overfit the training set. We were able to control this behaviour by setting the p_bound regularisation flag. This constrains the norm of each layer's weights to lie within a factor of p_bound times its intial value.
 
 ## About this repository
 
@@ -37,24 +39,21 @@ If something isn't clear or isn't working, let us know in the *Issues section* o
 Here is the structure of this repository.
 
     .
-    ├── classify-cifar/         # CIFAR-10 classification experiments. ✅
-    ├── classify-imagenet/      # Imagenet classification experiments. Coming soon! 🕒
-    ├── classify-mnist/         # MNIST classification experiments. ✅
-    ├── finetune-transformer/   # Transformer fine-tuning experiments. ✅
-    ├── generate-cifar/         # CIFAR-10 class-conditional GAN experiments. ✅
-    ├── make-plots/             # Code to reproduce the figures in the paper. ✅
-    ├── LICENSE                 # The license on our algorithm. ✅
-    ├── README.md               # The very page you're reading now. ✅
-    └── fromage.py              # Pytorch code for the Fromage optimiser. ✅
+    ├── classify-cifar/         # CIFAR-10 classification experiments.
+    ├── classify-imagenet/      # Imagenet classification experiments.
+    ├── classify-mnist/         # MNIST classification experiments.
+    ├── transformer-wikitext2/  # Transformer training experiments.
+    ├── generate-cifar/         # CIFAR-10 class-conditional GAN experiments.
+    ├── make-plots/             # Code to reproduce the figures in the paper.
+    ├── LICENSE                 # The license on our algorithm.
+    ├── README.md               # The very page you're reading now.
+    └── fromage.py              # Pytorch code for the Fromage optimiser.
 
-Check back in a few days if the code you're after is missing. We're currently cleaning and posting it.
-
 ## Acknowledgements
 
 - This research was supported by [Caltech](https://www.caltech.edu/) and [NVIDIA](https://www.nvidia.com/).
-- Our code is written in [Pytorch](https://pytorch.org/).
 - Our GAN implementation is based on a codebase by [Jiahui Yu](http://jiahuiyu.com/).
-- Our Transformer code is from [🤗 Transformers](https://github.com/huggingface/transformers).
+- Our Transformer code is from the [Pytorch example](https://github.com/pytorch/examples/tree/master/word_language_model).
 - Our CIFAR-10 classification code is orginally by [kuangliu](https://github.com/kuangliu/pytorch-cifar).
 - Our MNIST code was originally forked from the [Pytorch example](https://github.com/pytorch/examples/tree/master/mnist).
 - See [here](https://arxiv.org/abs/1708.03888) and [here](https://people.eecs.berkeley.edu/~youyang/publications/batch32k.pdf) for closely related work by [Yang You](https://people.eecs.berkeley.edu/~youyang/), [Igor Gitman](https://scholar.google.com/citations?user=8r9aWLIAAAAJ&hl=en) and [Boris Ginsburg](https://scholar.google.com/citations?user=7BRYaGcAAAAJ&hl=nl).

diff --git a/classify-imagenet/README.md b/classify-imagenet/README.md
@@ -1,8 +1,14 @@
- ## Requirements
+<h1 align="center">
+Fromage 🧀 optimiser
+</h1>
+
+## Imagenet classification experiments 
+
+### Requirements
  - [PyTorch](http://pytorch.org)
  - [NVIDIA APEX](https://github.com/NVIDIA/apex#quick-start)
 
-## Data Preparation
+### Data Preparation
 Download the ImageNet 2012 dataset and structure the dataset under 
 train and val subfloders. You can follow [this page](https://github.com/pytorch/examples/tree/master/imagenet#requirements) 
 to structure the dataset. The data directory should be in the form:
@@ -17,11 +23,11 @@ to structure the dataset. The data directory should be in the form:
             ├── n01443537/
             ├── ...        
 
- ## COMMANDS
+### Commands
 ```
 cd classify-imagenet
 python -m torch.distributed.launch --nproc_per_node=8 train_imagenet.py --data $DATA_DIR --results_dir $RESULTS_DIR \
 --save $EXPR_NAME --optimizer fromage --learning_rate 1e-2 --seed 0
 ```
 Above `$DATA_DIR` refers to the dataset directory path, `$RESULTS_DIR` is the results directory with `$EXPR_NAME` giving
-a name for the experiment.
+a name for the experiment.
diff --git a/generate-cifar/README.md b/generate-cifar/README.md
@@ -8,19 +8,9 @@ The following Python packages are required: numpy, torch, torchvision, tqdm.
 
 An example job is
 ```
-python main.py --optim fromage --lrG 0.01 --lrD 0.01 --epochs 121 --seed 0
+python main.py --seed 0 --optim fromage --initial_lr 0.01
 ```
-See inside `main.py` for additional command line arguments.
-
-## Results
-
-Runnning `sh batch.sh`, we obtain the following results:
-
-|         | train FID  | test FID   |
-|---------|------------|------------|
-| Fromage | 16.4 ± 0.5 | 16.3 ± 0.8 |
-| Adam    | 19.1 ± 0.9 | 19.4 ± 1.1 |
-| SGD     | 36.4 ± 2.5 | 36.7 ± 2.7 |
+See inside `batch.sh` for the settings used in the paper.
 
 ## Acknowledgements
 - The self attention block implementation is originally by https://github.com/zhaoyuzhi.

diff --git a/transformer-wikitext2/README.md b/transformer-wikitext2/README.md
@@ -1,61 +1,11 @@
-# Word-level language modeling RNN
+<h1 align="center">
+Fromage 🧀 optimiser
+</h1>
 
-This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task.
-By default, the training script uses the Wikitext-2 dataset, provided.
-The trained model can then be used by the generate script to generate new text.
+## Transformer training on Wikitext-2
 
-```bash 
-python main.py --cuda --epochs 6           # Train a LSTM on Wikitext-2 with CUDA
-python main.py --cuda --epochs 6 --tied    # Train a tied LSTM on Wikitext-2 with CUDA
-python main.py --cuda --epochs 6 --model Transformer --lr 5   
-                                           # Train a Transformer model on Wikitext-2 with CUDA
-python main.py --cuda --tied               # Train a tied LSTM on Wikitext-2 with CUDA for 40 epochs
-python generate.py                         # Generate samples from the trained LSTM model.
-python generate.py --cuda --model Transformer
-                                           # Generate samples from the trained Transformer model.
+This codebase is from the [Pytorch example](https://github.com/pytorch/examples/tree/master/word_language_model). To run the training script, use a command like:
 ```
-
-The model uses the `nn.RNN` module (and its sister modules `nn.GRU` and `nn.LSTM`)
-which will automatically use the cuDNN backend if run on CUDA with cuDNN installed.
-
-During training, if a keyboard interrupt (Ctrl-C) is received,
-training is stopped and the current model is evaluated against the test dataset.
-
-The `main.py` script accepts the following arguments:
-
-```bash
-optional arguments:
-  -h, --help                       show this help message and exit
-  --data DATA                      location of the data corpus
-  --model MODEL                    type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)
-  --emsize EMSIZE                  size of word embeddings
-  --nhid NHID                      number of hidden units per layer
-  --nlayers NLAYERS                number of layers
-  --lr LR                          initial learning rate
-  --clip CLIP                      gradient clipping
-  --epochs EPOCHS                  upper epoch limit
-  --batch_size N                   batch size
-  --bptt BPTT                      sequence length
-  --dropout DROPOUT                dropout applied to layers (0 = no dropout)
-  --decay DECAY                    learning rate decay per epoch
-  --tied                           tie the word embedding and softmax weights
-  --seed SEED                      random seed
-  --cuda                           use CUDA
-  --log-interval N                 report interval
-  --save SAVE                      path to save the final model
-  --onnx-export                    path to export the final model in onnx format
-  --transformer_head N             the number of heads in the encoder/decoder of the transformer model
-  --transformer_encoder_layers N   the number of layers in the encoder of the transformer model
-  --transformer_decoder_layers N   the number of layers in the decoder of the transformer model
-  --transformer_d_ff N             the number of nodes on the hidden layer in feed forward nn
-```
-
-With these arguments, a variety of models can be tested.
-As an example, the following arguments produce slower but better models:
-
-```bash
-python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40           
-python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied    
-python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40        
-python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 --tied 
+python main.py --cuda --epochs 20 --model Transformer --optim fromage --lr 0.01 --p_bound 1.0
 ```
+We provide the shell script `batch.sh` to run multiple experiments.