Skip to content

asmekal/iclr2020-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 

Repository files navigation

iclr2020-notes

personal notes from ICLR2020

General Notes

Main links (official)

Main page https://iclr.cc/

Virtual format notes (medium)

All contents https://iclr.cc/virtual_2020/

Reflections after the conference (medium)

Comments about format

ICLR'20 due to well-known circumstences were held online world-wide from 26 to 30 April. In a week after conference ended organizers made all of the materials public - https://iclr.cc/virtual_2020/

All of the papers were presented by 5 min video - both posters and orals. Great works were up to 15 mins. Five poster sessions every day, each work was presented on exactly two of them. During presentation time authors held meetings in zoom rooms and can be asked directly. But for me much simpler way to ask was to write in rocket-chat (each poster had its own channel there) and ask there not waiting until particular time.

Oral presentations were cool and useful (not always, but often), but I was missing the posters in the form of 'virtual pieces of paper' because it would be much simpler to understand the idea of the work seeing its entirely in one image instead of scrolling the slides.

Me on conference

From ~650 accepted papers I looked on ~100, summarizing (in most cases just minor comment or mention) them below in this document. I tried to group them by contents if possible.

This time I were unable to look on all interesting papers I wanted to. Which means that each section of this document can miss importatn works in the domain. I still have ~70 important papers in the queue to watch later... Maybe one day I will take a look and summarize them, maybe...

In general I tried to broaden my horizons in many topics like GANs, NAS, Optimization, DL Theory, Adversarial examples, Vision/NLP/Audio/Video processing. I skipped entirely prunning/quantization and too theoretical works.

If the paper has ❓ - it means that paper looks good, but I didn't understand something and I am not confident in the comment I write here.

Key takeaways (TL;DR)

Attack & Defence

  • Now you can steal model by API which returns only its predictions (and save a lot of money $$$) - checked for BERT-like NLP models, could the same be done for CV tasks?

  • There is a way to defend from physically-realizable attacks (doing adversarial in the real world)

  • Proper usage of misclassified examples can improve robustness to adversarial examples

Optimization

  • There are methods to estimate generalization gap knowing only train error (metric). The formula is test_loss < train_loss + G where G is generalization gap. And this G can be estimated on train data only. Which means that in theory you can use more training data joining validation set as it is no longer needed to select the best epoch or when to stop training. Joining validation to training may be used for some critical applications where each point of accuracy is really important (because increasing your data by 20% does not give you much in most cases - you will need 10x increase to go significantly better). Anyway, some works to estimate G are in this report, see below.

  • Exponential learning rate also works! Yes, you can increase lr to, say, 1e22 and you will still converge. Moreover authors of that work found a way to train with exponential lr with same quality as constant/cosine schedule. The key result here - there are always many good lr schedules which will lead to same results. However, the questions "so which one of them is the best?" or "which lr schedule generalize well for more problems?" still remains.

  • ICLR community seem to embrace tiny datasets (hello, MNIST and CIFAR) and Alexnet model which makes research simpler and faster, but often useless for applications, due to lack of generalization of larger datasets

DL Theory

  • Backpropaganda: several conventional myths disproved empirically. 1)suboptimal local minima DO exist (that's why initialization matters); 2)l2 norm regularization DECREASES performance; 3)rank minimization DECREASES performance

  • Vanilla Grad Descent under certain assumptions of the problem is optimal (which means that no other gradient-based method can converge faster!) but in practice these assumptions not (never?=)) satisfied, so your Adam will work better. One of the works prove that on weaker set of assumptions clipped SGD is optimal (most likely it is still useless for real problems).

GANs

  • (insane) Faces can be synthesized from audio - it really works to some extent

  • DeepMind proved that high-fidelity speed can be generated with GANs (which is faster than augoregressive wavenet)

  • Stable Rank Normalization: Spectral Norm + Rank Norm to improve generalization (decrease generalization gap)

NAS

  • Robust DARTS - super simple approach to train DARTS much better by early stopping + continuation with more regularization (see details below)

Layers

  • Deconvolution - modification to vanilla convolution which is free on inference, improves robustness for all conventional models where it was applied. The only cost is training time increase (up to 10x in my test). See details below

NLP

  • Just see NLP section, everything there is good:)

Summaries and comments

Attack and Defence

Attribution

  • method which explains how and why particular regions of the input image are responsible for classifier prediction. The method is to train external Generator (which will generate degraded image similar to input, e.g. if classifier outputs 0.9 probability, generator receives input image and desire to make classifier have 0.1 probability) and Discriminator trained with KL divergence from actual output of the classifier and L1 distance loss to the original input. The methos is applicable to the cases then how and why of the classifier model are important (e.g. medical applications). However, it requires to train additional model (which is GAN so the training won't be easy) instead of most of attribution methods which can be directly applied to classifier any additional training and external models Explanation by Progressive Exaggeration

  • New attribution method where unimportant features on feature map are replaced with noise. Converges in ~10 iterations, author also approximated it with single NN which do the same in a single pass. Seems to really explain NN predictions (not just exploit the structure of the image). Restricting the Flow: Information Bottlenecks for Attribution

Sanity Check for attribution methods

Alpha-Beta LRP fails sanity check (which may be an important property??? as we do not need to train network)

Robustness

Optimization

Generalization gap estimation

This part is about estimation of test loss knowing train loss, i.e. $L_{test} = L_{train} + G$, where $G$ is generalization gap. In theory if the estimation is reliable it will allow to merge train and validation (so to have more data) and estimate best model / best training epoch reliably without validation data.

[Generalization gap ends]

exponential lr decay really works...

what about learning rates

DL theory

myths outline suboptimal local minima l2 norm regularization decrease performance rank minimization decrease performance

how much position information is endoded?

  • higher depth is beneficial (slide from this paper presentation) - the paper seems to provide some intuition explaining the examples, but I haven't checked it yet

Depth vs Width results

  • vanilla gradient descent is theoretically optimal (surprise!) in speed of convergence against all over gradient-based methods. But in practice it is not (no surprise) because the theoretical constraints are almost never satisfied. In this work authors make some analysis and derived that for less-constrained functions grad descent with clipping is optimal. But they again didn't compare with Adam and others...

Vanill Grad Descent is theoretically optimal (surprise!)

gradient clipping does not help against bad labels, but 'partial' clipping does

parameter-equivalent networks (for ReLU activation)

image

Audio

image

  • DeepMind made speed synthesis via GAN (and proved that high fidelity speed synthesis with GANs is possible). The paper has several tricks: 1)G and D conditioned on linguistic features 2)44h of training data 3)residual blocks with progressive dilation in G 4)several discriminators 5)another unconditioned discriminators (only realism checking) 6)FID and KID from speech recognition model features to track training progress 7)padding masks to generate longer samples (see paper for details).

key contributions to success

Video

Generative Models

generalization gap upper bound

  • RealnessGAN - instead of hard labels 0 and 1 for GAN treat them as random variables A0 and A1. Seems to stabilize training as the authors were able to train DCGAN on 1024x1024. Proven theoretical guarantees of convergence. A0 and A1 was taken as different discrete distributions (so D had N output probabilities instead of single) Real or Not Real, that is the Question

RealnessGAN

NAS

  • The authors show that DARTS method overfits on validation data and do not generalize well on test data for several search spaces. The reason for poor generalization is highly correlated with sharpness of the minima. The sharpness of the minima can be efficiently estimated by computing largest singular value of full Hessian w.r.t. alpha on randomly taken validation batch. We can then track this metric (largest eigenvalue) and do early stopping (DARTS-ES) when it starts to grow too rapidly. DARTS-ES already increases performance, but this can be further improved. Authors show that regularization (e.g. L2 norm) stabilizes largest eigenvalue metric, thus they propose the following method. Whenever sharp increases of eigenvalue is observed, the model goes back to the last checkpoint before that behavious starts, and training continues with higher regularization. This approach improve generalization significantly in many settings, including original DARTS search space. Understanding and Robustifying Differentiable Architecture Search

robust darts takeaways

Representation

Vision (surprise!)

Layers

  • superior results against for 10 most popular architectures replacing batchnorm with deconvolution. Can be efficiently calculated at a fraction of conv layer cost. Authors replaced batchnorms with network deconvolution and gain superior results on all architectures they tried. There is also a connection with brain (a speciall kind of vision operation). The intuition behind - the layer is decorrelating neighbouring pixels and channels so it is much easier for NN to learn. Network Deconvolution

what if...

deconvolution operation

deconvolution applied to image

deconvolution algorithm

  • if you somehow know that your network should know exact multiplication/addition of tensor elements (e.g. physics, math problems) Neural Arithmetic Units

NLP

GPUs go brrr

ELECTRA pretraining

Anomaly detection

Speed improvements

Other (random topics)

discrepancy formula

image

resource requirements

About

personal notes from ICLR2020

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published