diff --git a/README.md b/README.md
index c73de25..8cf386e 100755
--- a/README.md
+++ b/README.md
@@ -18,9 +18,11 @@ Fromage 🧀 optimiser
To get started with Fromage in your Pytorch code, copy the file `fromage.py` into your project directory, then write:
```
from fromage import Fromage
-optimizer = Fromage(net.parameters(), lr=0.01)
+optimizer = Fromage(net.parameters(), lr=0.01, p_bound=None)
```
-We found an initial learning rate of 0.01 worked well in all experiments except model fine-tuning, where we used 0.001. You may want to experiment with learning rate decay schedules.
+An initial learning rate of 0.01 has worked well in all our experiments except model fine-tuning, where 0.001 worked well. Decaying the learning rate when the loss plateaus is a good idea.
+
+On some benchmarks, Fromage heavily overfit the training set. We were able to control this behaviour by setting the p_bound regularisation flag. This constrains the norm of each layer's weights to lie within a factor of p_bound times its intial value.
## About this repository
@@ -37,24 +39,21 @@ If something isn't clear or isn't working, let us know in the *Issues section* o
Here is the structure of this repository.
.
- ├── classify-cifar/ # CIFAR-10 classification experiments. ✅
- ├── classify-imagenet/ # Imagenet classification experiments. Coming soon! 🕒
- ├── classify-mnist/ # MNIST classification experiments. ✅
- ├── finetune-transformer/ # Transformer fine-tuning experiments. ✅
- ├── generate-cifar/ # CIFAR-10 class-conditional GAN experiments. ✅
- ├── make-plots/ # Code to reproduce the figures in the paper. ✅
- ├── LICENSE # The license on our algorithm. ✅
- ├── README.md # The very page you're reading now. ✅
- └── fromage.py # Pytorch code for the Fromage optimiser. ✅
+ ├── classify-cifar/ # CIFAR-10 classification experiments.
+ ├── classify-imagenet/ # Imagenet classification experiments.
+ ├── classify-mnist/ # MNIST classification experiments.
+ ├── transformer-wikitext2/ # Transformer training experiments.
+ ├── generate-cifar/ # CIFAR-10 class-conditional GAN experiments.
+ ├── make-plots/ # Code to reproduce the figures in the paper.
+ ├── LICENSE # The license on our algorithm.
+ ├── README.md # The very page you're reading now.
+ └── fromage.py # Pytorch code for the Fromage optimiser.
-Check back in a few days if the code you're after is missing. We're currently cleaning and posting it.
-
## Acknowledgements
- This research was supported by [Caltech](https://www.caltech.edu/) and [NVIDIA](https://www.nvidia.com/).
-- Our code is written in [Pytorch](https://pytorch.org/).
- Our GAN implementation is based on a codebase by [Jiahui Yu](http://jiahuiyu.com/).
-- Our Transformer code is from [🤗 Transformers](https://github.com/huggingface/transformers).
+- Our Transformer code is from the [Pytorch example](https://github.com/pytorch/examples/tree/master/word_language_model).
- Our CIFAR-10 classification code is orginally by [kuangliu](https://github.com/kuangliu/pytorch-cifar).
- Our MNIST code was originally forked from the [Pytorch example](https://github.com/pytorch/examples/tree/master/mnist).
- See [here](https://arxiv.org/abs/1708.03888) and [here](https://people.eecs.berkeley.edu/~youyang/publications/batch32k.pdf) for closely related work by [Yang You](https://people.eecs.berkeley.edu/~youyang/), [Igor Gitman](https://scholar.google.com/citations?user=8r9aWLIAAAAJ&hl=en) and [Boris Ginsburg](https://scholar.google.com/citations?user=7BRYaGcAAAAJ&hl=nl).
diff --git a/classify-imagenet/README.md b/classify-imagenet/README.md
index ccb8c84..44dce8a 100644
--- a/classify-imagenet/README.md
+++ b/classify-imagenet/README.md
@@ -1,8 +1,14 @@
- ## Requirements
+
+Fromage 🧀 optimiser
+
+
+## Imagenet classification experiments
+
+### Requirements
- [PyTorch](http://pytorch.org)
- [NVIDIA APEX](https://github.com/NVIDIA/apex#quick-start)
-## Data Preparation
+### Data Preparation
Download the ImageNet 2012 dataset and structure the dataset under
train and val subfloders. You can follow [this page](https://github.com/pytorch/examples/tree/master/imagenet#requirements)
to structure the dataset. The data directory should be in the form:
@@ -17,11 +23,11 @@ to structure the dataset. The data directory should be in the form:
├── n01443537/
├── ...
- ## COMMANDS
+### Commands
```
cd classify-imagenet
python -m torch.distributed.launch --nproc_per_node=8 train_imagenet.py --data $DATA_DIR --results_dir $RESULTS_DIR \
--save $EXPR_NAME --optimizer fromage --learning_rate 1e-2 --seed 0
```
Above `$DATA_DIR` refers to the dataset directory path, `$RESULTS_DIR` is the results directory with `$EXPR_NAME` giving
-a name for the experiment.
\ No newline at end of file
+a name for the experiment.
diff --git a/generate-cifar/README.md b/generate-cifar/README.md
index 576a6b7..52fcee9 100755
--- a/generate-cifar/README.md
+++ b/generate-cifar/README.md
@@ -8,19 +8,9 @@ The following Python packages are required: numpy, torch, torchvision, tqdm.
An example job is
```
-python main.py --optim fromage --lrG 0.01 --lrD 0.01 --epochs 121 --seed 0
+python main.py --seed 0 --optim fromage --initial_lr 0.01
```
-See inside `main.py` for additional command line arguments.
-
-## Results
-
-Runnning `sh batch.sh`, we obtain the following results:
-
-| | train FID | test FID |
-|---------|------------|------------|
-| Fromage | 16.4 ± 0.5 | 16.3 ± 0.8 |
-| Adam | 19.1 ± 0.9 | 19.4 ± 1.1 |
-| SGD | 36.4 ± 2.5 | 36.7 ± 2.7 |
+See inside `batch.sh` for the settings used in the paper.
## Acknowledgements
- The self attention block implementation is originally by https://github.com/zhaoyuzhi.
diff --git a/transformer-wikitext2/README.md b/transformer-wikitext2/README.md
index db23265..e94417a 100644
--- a/transformer-wikitext2/README.md
+++ b/transformer-wikitext2/README.md
@@ -1,61 +1,11 @@
-# Word-level language modeling RNN
+
+Fromage 🧀 optimiser
+
-This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task.
-By default, the training script uses the Wikitext-2 dataset, provided.
-The trained model can then be used by the generate script to generate new text.
+## Transformer training on Wikitext-2
-```bash
-python main.py --cuda --epochs 6 # Train a LSTM on Wikitext-2 with CUDA
-python main.py --cuda --epochs 6 --tied # Train a tied LSTM on Wikitext-2 with CUDA
-python main.py --cuda --epochs 6 --model Transformer --lr 5
- # Train a Transformer model on Wikitext-2 with CUDA
-python main.py --cuda --tied # Train a tied LSTM on Wikitext-2 with CUDA for 40 epochs
-python generate.py # Generate samples from the trained LSTM model.
-python generate.py --cuda --model Transformer
- # Generate samples from the trained Transformer model.
+This codebase is from the [Pytorch example](https://github.com/pytorch/examples/tree/master/word_language_model). To run the training script, use a command like:
```
-
-The model uses the `nn.RNN` module (and its sister modules `nn.GRU` and `nn.LSTM`)
-which will automatically use the cuDNN backend if run on CUDA with cuDNN installed.
-
-During training, if a keyboard interrupt (Ctrl-C) is received,
-training is stopped and the current model is evaluated against the test dataset.
-
-The `main.py` script accepts the following arguments:
-
-```bash
-optional arguments:
- -h, --help show this help message and exit
- --data DATA location of the data corpus
- --model MODEL type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)
- --emsize EMSIZE size of word embeddings
- --nhid NHID number of hidden units per layer
- --nlayers NLAYERS number of layers
- --lr LR initial learning rate
- --clip CLIP gradient clipping
- --epochs EPOCHS upper epoch limit
- --batch_size N batch size
- --bptt BPTT sequence length
- --dropout DROPOUT dropout applied to layers (0 = no dropout)
- --decay DECAY learning rate decay per epoch
- --tied tie the word embedding and softmax weights
- --seed SEED random seed
- --cuda use CUDA
- --log-interval N report interval
- --save SAVE path to save the final model
- --onnx-export path to export the final model in onnx format
- --transformer_head N the number of heads in the encoder/decoder of the transformer model
- --transformer_encoder_layers N the number of layers in the encoder of the transformer model
- --transformer_decoder_layers N the number of layers in the decoder of the transformer model
- --transformer_d_ff N the number of nodes on the hidden layer in feed forward nn
-```
-
-With these arguments, a variety of models can be tested.
-As an example, the following arguments produce slower but better models:
-
-```bash
-python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40
-python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied
-python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40
-python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 --tied
+python main.py --cuda --epochs 20 --model Transformer --optim fromage --lr 0.01 --p_bound 1.0
```
+We provide the shell script `batch.sh` to run multiple experiments.