diff --git a/README.md b/README.md index c73de25..8cf386e 100755 --- a/README.md +++ b/README.md @@ -18,9 +18,11 @@ Fromage 🧀 optimiser To get started with Fromage in your Pytorch code, copy the file `fromage.py` into your project directory, then write: ``` from fromage import Fromage -optimizer = Fromage(net.parameters(), lr=0.01) +optimizer = Fromage(net.parameters(), lr=0.01, p_bound=None) ``` -We found an initial learning rate of 0.01 worked well in all experiments except model fine-tuning, where we used 0.001. You may want to experiment with learning rate decay schedules. +An initial learning rate of 0.01 has worked well in all our experiments except model fine-tuning, where 0.001 worked well. Decaying the learning rate when the loss plateaus is a good idea. + +On some benchmarks, Fromage heavily overfit the training set. We were able to control this behaviour by setting the p_bound regularisation flag. This constrains the norm of each layer's weights to lie within a factor of p_bound times its intial value. ## About this repository @@ -37,24 +39,21 @@ If something isn't clear or isn't working, let us know in the *Issues section* o Here is the structure of this repository. . - ├── classify-cifar/ # CIFAR-10 classification experiments. ✅ - ├── classify-imagenet/ # Imagenet classification experiments. Coming soon! 🕒 - ├── classify-mnist/ # MNIST classification experiments. ✅ - ├── finetune-transformer/ # Transformer fine-tuning experiments. ✅ - ├── generate-cifar/ # CIFAR-10 class-conditional GAN experiments. ✅ - ├── make-plots/ # Code to reproduce the figures in the paper. ✅ - ├── LICENSE # The license on our algorithm. ✅ - ├── README.md # The very page you're reading now. ✅ - └── fromage.py # Pytorch code for the Fromage optimiser. ✅ + ├── classify-cifar/ # CIFAR-10 classification experiments. + ├── classify-imagenet/ # Imagenet classification experiments. + ├── classify-mnist/ # MNIST classification experiments. + ├── transformer-wikitext2/ # Transformer training experiments. + ├── generate-cifar/ # CIFAR-10 class-conditional GAN experiments. + ├── make-plots/ # Code to reproduce the figures in the paper. + ├── LICENSE # The license on our algorithm. + ├── README.md # The very page you're reading now. + └── fromage.py # Pytorch code for the Fromage optimiser. -Check back in a few days if the code you're after is missing. We're currently cleaning and posting it. - ## Acknowledgements - This research was supported by [Caltech](https://www.caltech.edu/) and [NVIDIA](https://www.nvidia.com/). -- Our code is written in [Pytorch](https://pytorch.org/). - Our GAN implementation is based on a codebase by [Jiahui Yu](http://jiahuiyu.com/). -- Our Transformer code is from [🤗 Transformers](https://github.com/huggingface/transformers). +- Our Transformer code is from the [Pytorch example](https://github.com/pytorch/examples/tree/master/word_language_model). - Our CIFAR-10 classification code is orginally by [kuangliu](https://github.com/kuangliu/pytorch-cifar). - Our MNIST code was originally forked from the [Pytorch example](https://github.com/pytorch/examples/tree/master/mnist). - See [here](https://arxiv.org/abs/1708.03888) and [here](https://people.eecs.berkeley.edu/~youyang/publications/batch32k.pdf) for closely related work by [Yang You](https://people.eecs.berkeley.edu/~youyang/), [Igor Gitman](https://scholar.google.com/citations?user=8r9aWLIAAAAJ&hl=en) and [Boris Ginsburg](https://scholar.google.com/citations?user=7BRYaGcAAAAJ&hl=nl). diff --git a/classify-imagenet/README.md b/classify-imagenet/README.md index ccb8c84..44dce8a 100644 --- a/classify-imagenet/README.md +++ b/classify-imagenet/README.md @@ -1,8 +1,14 @@ - ## Requirements +

+Fromage 🧀 optimiser +

+ +## Imagenet classification experiments + +### Requirements - [PyTorch](http://pytorch.org) - [NVIDIA APEX](https://github.com/NVIDIA/apex#quick-start) -## Data Preparation +### Data Preparation Download the ImageNet 2012 dataset and structure the dataset under train and val subfloders. You can follow [this page](https://github.com/pytorch/examples/tree/master/imagenet#requirements) to structure the dataset. The data directory should be in the form: @@ -17,11 +23,11 @@ to structure the dataset. The data directory should be in the form: ├── n01443537/ ├── ... - ## COMMANDS +### Commands ``` cd classify-imagenet python -m torch.distributed.launch --nproc_per_node=8 train_imagenet.py --data $DATA_DIR --results_dir $RESULTS_DIR \ --save $EXPR_NAME --optimizer fromage --learning_rate 1e-2 --seed 0 ``` Above `$DATA_DIR` refers to the dataset directory path, `$RESULTS_DIR` is the results directory with `$EXPR_NAME` giving -a name for the experiment. \ No newline at end of file +a name for the experiment. diff --git a/generate-cifar/README.md b/generate-cifar/README.md index 576a6b7..52fcee9 100755 --- a/generate-cifar/README.md +++ b/generate-cifar/README.md @@ -8,19 +8,9 @@ The following Python packages are required: numpy, torch, torchvision, tqdm. An example job is ``` -python main.py --optim fromage --lrG 0.01 --lrD 0.01 --epochs 121 --seed 0 +python main.py --seed 0 --optim fromage --initial_lr 0.01 ``` -See inside `main.py` for additional command line arguments. - -## Results - -Runnning `sh batch.sh`, we obtain the following results: - -| | train FID | test FID | -|---------|------------|------------| -| Fromage | 16.4 ± 0.5 | 16.3 ± 0.8 | -| Adam | 19.1 ± 0.9 | 19.4 ± 1.1 | -| SGD | 36.4 ± 2.5 | 36.7 ± 2.7 | +See inside `batch.sh` for the settings used in the paper. ## Acknowledgements - The self attention block implementation is originally by https://github.com/zhaoyuzhi. diff --git a/transformer-wikitext2/README.md b/transformer-wikitext2/README.md index db23265..e94417a 100644 --- a/transformer-wikitext2/README.md +++ b/transformer-wikitext2/README.md @@ -1,61 +1,11 @@ -# Word-level language modeling RNN +

+Fromage 🧀 optimiser +

-This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task. -By default, the training script uses the Wikitext-2 dataset, provided. -The trained model can then be used by the generate script to generate new text. +## Transformer training on Wikitext-2 -```bash -python main.py --cuda --epochs 6 # Train a LSTM on Wikitext-2 with CUDA -python main.py --cuda --epochs 6 --tied # Train a tied LSTM on Wikitext-2 with CUDA -python main.py --cuda --epochs 6 --model Transformer --lr 5 - # Train a Transformer model on Wikitext-2 with CUDA -python main.py --cuda --tied # Train a tied LSTM on Wikitext-2 with CUDA for 40 epochs -python generate.py # Generate samples from the trained LSTM model. -python generate.py --cuda --model Transformer - # Generate samples from the trained Transformer model. +This codebase is from the [Pytorch example](https://github.com/pytorch/examples/tree/master/word_language_model). To run the training script, use a command like: ``` - -The model uses the `nn.RNN` module (and its sister modules `nn.GRU` and `nn.LSTM`) -which will automatically use the cuDNN backend if run on CUDA with cuDNN installed. - -During training, if a keyboard interrupt (Ctrl-C) is received, -training is stopped and the current model is evaluated against the test dataset. - -The `main.py` script accepts the following arguments: - -```bash -optional arguments: - -h, --help show this help message and exit - --data DATA location of the data corpus - --model MODEL type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU) - --emsize EMSIZE size of word embeddings - --nhid NHID number of hidden units per layer - --nlayers NLAYERS number of layers - --lr LR initial learning rate - --clip CLIP gradient clipping - --epochs EPOCHS upper epoch limit - --batch_size N batch size - --bptt BPTT sequence length - --dropout DROPOUT dropout applied to layers (0 = no dropout) - --decay DECAY learning rate decay per epoch - --tied tie the word embedding and softmax weights - --seed SEED random seed - --cuda use CUDA - --log-interval N report interval - --save SAVE path to save the final model - --onnx-export path to export the final model in onnx format - --transformer_head N the number of heads in the encoder/decoder of the transformer model - --transformer_encoder_layers N the number of layers in the encoder of the transformer model - --transformer_decoder_layers N the number of layers in the decoder of the transformer model - --transformer_d_ff N the number of nodes on the hidden layer in feed forward nn -``` - -With these arguments, a variety of models can be tested. -As an example, the following arguments produce slower but better models: - -```bash -python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 -python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied -python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 -python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 --tied +python main.py --cuda --epochs 20 --model Transformer --optim fromage --lr 0.01 --p_bound 1.0 ``` +We provide the shell script `batch.sh` to run multiple experiments.