personal notes from ICLR2020
Main page https://iclr.cc/
All contents https://iclr.cc/virtual_2020/
Reflections after the conference (medium)
ICLR'20 due to well-known circumstences were held online world-wide from 26 to 30 April. In a week after conference ended organizers made all of the materials public - https://iclr.cc/virtual_2020/
All of the papers were presented by 5 min video - both posters and orals. Great works were up to 15 mins. Five poster sessions every day, each work was presented on exactly two of them. During presentation time authors held meetings in zoom rooms and can be asked directly. But for me much simpler way to ask was to write in rocket-chat (each poster had its own channel there) and ask there not waiting until particular time.
Oral presentations were cool and useful (not always, but often), but I was missing the posters in the form of 'virtual pieces of paper' because it would be much simpler to understand the idea of the work seeing its entirely in one image instead of scrolling the slides.
From ~650 accepted papers I looked on ~100, summarizing (in most cases just minor comment or mention) them below in this document. I tried to group them by contents if possible.
This time I were unable to look on all interesting papers I wanted to. Which means that each section of this document can miss importatn works in the domain. I still have ~70 important papers in the queue to watch later... Maybe one day I will take a look and summarize them, maybe...
In general I tried to broaden my horizons in many topics like GANs, NAS, Optimization, DL Theory, Adversarial examples, Vision/NLP/Audio/Video processing. I skipped entirely prunning/quantization and too theoretical works.
If the paper has β - it means that paper looks good, but I didn't understand something and I am not confident in the comment I write here.
-
Now you can steal model by API which returns only its predictions (and save a lot of money $$$) - checked for BERT-like NLP models, could the same be done for CV tasks?
-
There is a way to defend from physically-realizable attacks (doing adversarial in the real world)
-
Proper usage of misclassified examples can improve robustness to adversarial examples
-
There are methods to estimate generalization gap knowing only train error (metric). The formula is
test_loss < train_loss + G
where G is generalization gap. And thisG
can be estimated on train data only. Which means that in theory you can use more training data joining validation set as it is no longer needed to select the best epoch or when to stop training. Joining validation to training may be used for some critical applications where each point of accuracy is really important (because increasing your data by 20% does not give you much in most cases - you will need 10x increase to go significantly better). Anyway, some works to estimateG
are in this report, see below. -
Exponential learning rate also works! Yes, you can increase lr to, say,
1e22
and you will still converge. Moreover authors of that work found a way to train with exponential lr with same quality as constant/cosine schedule. The key result here - there are always many good lr schedules which will lead to same results. However, the questions "so which one of them is the best?" or "which lr schedule generalize well for more problems?" still remains. -
ICLR community seem to embrace tiny datasets (hello, MNIST and CIFAR) and Alexnet model which makes research simpler and faster, but often useless for applications, due to lack of generalization of larger datasets
-
Backpropaganda: several conventional myths disproved empirically. 1)suboptimal local minima DO exist (that's why initialization matters); 2)l2 norm regularization DECREASES performance; 3)rank minimization DECREASES performance
-
Vanilla Grad Descent under certain assumptions of the problem is optimal (which means that no other gradient-based method can converge faster!) but in practice these assumptions not (never?=)) satisfied, so your Adam will work better. One of the works prove that on weaker set of assumptions clipped SGD is optimal (most likely it is still useless for real problems).
-
(insane) Faces can be synthesized from audio - it really works to some extent
-
DeepMind proved that high-fidelity speed can be generated with GANs (which is faster than augoregressive wavenet)
-
Stable Rank Normalization: Spectral Norm + Rank Norm to improve generalization (decrease generalization gap)
- Robust DARTS - super simple approach to train DARTS much better by early stopping + continuation with more regularization (see details below)
- Deconvolution - modification to vanilla convolution which is free on inference, improves robustness for all conventional models where it was applied. The only cost is training time increase (up to 10x in my test). See details below
- Just see NLP section, everything there is good:)
-
optimal strategy for both adversarial attack and defence. Done with GAN training, and shown that found generator-attacker really outperforms other attacker approaches Optimal Strategies Against Generative Attacks
-
make use of (originally) misclassified examples. Approach achieved SOTA on MNIST and CIFAR10 adversarial defence... Improving Adversarial Robustness Requires Revisiting Misclassified Examples
-
poisoning network predictions to fool attacker and increase number of attacks needed until succes Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks
-
mixed precision DNNs are more robust to adversarial attacks (than original non-quantized nets) EMPIR: Ensembles of Mixed Precision Deep Networks for Increased Robustness Against Adversarial Attacks
-
amazing presentation how to attack NLP model with rediculously simple strategy to get only slightly inferior model (attack cost ~few hungred dollars according to the authors) Thieves on Sesame Street! Model Extraction of BERT-based APIs - now the question is how we can do the same for computer vision tasks?:)
-
Efficient defence against physically-realizable attack is adversarial training by these recipy Defending Against Physically Realizable Attacks on Image Classification
-
Skip connections improves adv attacks on another networks Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets
-
(attack-robust activation) k-winners-take-all defence Enhancing Adversarial Defense by k-Winners-Take-All
-
(comments from sofa == unreliable) there is something called "network verification" which is a direction to verify(proove) certain properties of a model (e.g. robustness to certain type of perturbations - rotations, noise, adversarial, etc). The work 1 have some theoretical proofs(=verification) that adversarial training improves adversarial robustness and work 2 is fast and efficient verification approach
-
method which explains how and why particular regions of the input image are responsible for classifier prediction. The method is to train external Generator (which will generate degraded image similar to input, e.g. if classifier outputs 0.9 probability, generator receives input image and desire to make classifier have 0.1 probability) and Discriminator trained with KL divergence from actual output of the classifier and L1 distance loss to the original input. The methos is applicable to the cases then how and why of the classifier model are important (e.g. medical applications). However, it requires to train additional model (which is GAN so the training won't be easy) instead of most of attribution methods which can be directly applied to classifier any additional training and external models Explanation by Progressive Exaggeration
-
New attribution method where unimportant features on feature map are replaced with noise. Converges in ~10 iterations, author also approximated it with single NN which do the same in a single pass. Seems to really explain NN predictions (not just exploit the structure of the image). Restricting the Flow: Information Bottlenecks for Attribution
- β deformable kernels - seems to be more robust Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation
This part is about estimation of test loss knowing train loss, i.e.
-
GSNR = mean^2 / variance or gradients gives an estimation of generalization on test set (the higher - the better) Understanding Why Neural Networks Generalize Well Through GSNR of Parameters
-
another work on generalization gap estimation (upper bound). It turns out that if we have a trained network and replace one fixed layer the loss will not increase drastically for most of the layers chosen. In fact only small amount of layers (called 'critical' in the paper) will severly affect performance after such transformation. The method could predict e.g. that trained resnet18 has larger generalization gap than resnet34 The intriguing role of module criticality in the generalization of deep networks
- exponential lr also converge (to the comparable results) for wide range of DL models which has normalization, which rases a questions about learning schedulers in general - is it even worth trying to wary them An Exponential Learning Rate Schedule for Deep Learning
-
most likely the best high dimensional optimization without gradients; by Google Brain; maybe useful for hyperparameter search/NAS? Gradientless Descent: High-Dimensional Zeroth-Order Optimization
-
double descent (test loss decreases then increases then decreases again with increase in numbe of parameters) is a frequent phenomena openai paper well describing it BUT(!) it is NOT ALWAYS present acc to this paper
-
Prox-SGD - sgd with explicit regularization which allows to produce sparser networks with no accuracy loss ProxSGD: Training Structured Neural Networks under Regularization and Constraints
- disproving false claims (see slides below): suboptimal local minima DO exist (that's why initialization matters), l2 norm regularization DECREASES performance, rank minimization DECREASES performance Truth or backpropaganda? An empirical investigation of deep learning theory
- (see answers below) How much Position Information Do Convolutional Neural Networks Encode?
- higher depth is beneficial (slide from this paper presentation) - the paper seems to provide some intuition explaining the examples, but I haven't checked it yet
- vanilla gradient descent is theoretically optimal (surprise!) in speed of convergence against all over gradient-based methods. But in practice it is not (no surprise) because the theoretical constraints are almost never satisfied. In this work authors make some analysis and derived that for less-constrained functions grad descent with clipping is optimal. But they again didn't compare with Adam and others...
- another work on gradient clipping proved that it does not fight against noisy labels, but 'partial' clipping (see paper for details) does Can gradient clipping mitigate label noise?
- the image below can give some intuition why compression/quantization works. The paper proves that permulation and rescaling (see below) are the only function-preserving transformation Functional vs. parametric equivalence of ReLU networks
- Face from audio via GAN (insane but works) From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech
- DeepMind made speed synthesis via GAN (and proved that high fidelity speed synthesis with GANs is possible). The paper has several tricks: 1)G and D conditioned on linguistic features 2)44h of training data 3)residual blocks with progressive dilation in G 4)several discriminators 5)another unconditioned discriminators (only realism checking) 6)FID and KID from speech recognition model features to track training progress 7)padding masks to generate longer samples (see paper for details).
-
guys implemented conventional audio filters and the results are just fantastic quality - they used tiny models and have high quality results paper github
-
harmonic convolution Deep Audio Priors Emerge From Harmonic Convolutional Networks
-
nlp-inspired pretraining for speech representing it as discrite vocabulary vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
- "baseline" for Video Continuation Generation which somehow works; based on VideoBERT + autoregressive model Scaling Autoregressive Video Models
-
theoretical analysis of GANs lead to combining several discriminator objective into 'supergan' which shouls converge more stable in theory Smoothness and Stability in GANs
-
Spectral Norm + Rank Norm to improve generalization (decrease generalization gap). Experiments show that this joint normalization improves both classification and GAN performance. Stable Rank Normalization for Improved Generalization in Neural Networks and GANs
- RealnessGAN - instead of hard labels 0 and 1 for GAN treat them as random variables A0 and A1. Seems to stabilize training as the authors were able to train DCGAN on 1024x1024. Proven theoretical guarantees of convergence. A0 and A1 was taken as different discrete distributions (so D had N output probabilities instead of single) Real or Not Real, that is the Question
-
iclr.video use classifier as energy function; helps to improve applications of generative models to target tasks (OOD detection, adversarial robustness, etc) Your classifier is secretly an energy based model and you should treat it like one
-
To what extent we can manipulate generated image features (zoom in/out or shift or ...) On the "steerability" of generative adversarial networks
-
physics-motivated model for videos (beautiful motivation, but works with only very simple systems of objects so far). The idea is to learn encoder from pixel space, Hamiltonian network (which rules the system state) and decoder from latent space back to pixel space. The system evolution is going by adjusting state +alpha * dt, where alpha is speed. In practice it is useless but I liked the idea and motivation. Hamiltonian Generative Networks
-
visualization tool and (new) metrics to monitor/estimate convergence of GAN A Closer Look at the Optimization Landscapes of Generative Adversarial Networks
- The authors show that DARTS method overfits on validation data and do not generalize well on test data for several search spaces. The reason for poor generalization is highly correlated with sharpness of the minima. The sharpness of the minima can be efficiently estimated by computing largest singular value of full Hessian w.r.t. alpha on randomly taken validation batch. We can then track this metric (largest eigenvalue) and do early stopping (DARTS-ES) when it starts to grow too rapidly. DARTS-ES already increases performance, but this can be further improved. Authors show that regularization (e.g. L2 norm) stabilizes largest eigenvalue metric, thus they propose the following method. Whenever sharp increases of eigenvalue is observed, the model goes back to the last checkpoint before that behavious starts, and training continues with higher regularization. This approach improve generalization significantly in many settings, including original DARTS search space. Understanding and Robustifying Differentiable Architecture Search
-
1.5 GPU hours on cifar search (4-10 times faster than DARTS) - results better than DARTS PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search
-
Early research about how to initialize Meta-networks (networks which emit other networks, e.g. NAS). Authors propose efficient initialization, but there are some restrictions in the search space. This is a very early research and acc to authors there are a lot of low hanging fruits in that direction. In presentation authors very clearly explain that initialization is crusial as with bad initialization models will not converge at all. Principled Weight Initialization for Hypernetworks
-
FasterSeg: Searching for Faster Real-time Semantic Segmentation
-
10 NAS algorithms on search space of 15k architectures (all 15k trained) NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search
-
NAS-Bench-1Shot1: Benchmarking and Dissecting One-shot Neural Architecture Search
-
How to make model better generalize on unseen (related but different) domains by plugging special feature transformation layers into model Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation
-
Supervised learning still performs (much) better, but here are major improvements for unsupervised learning Self-labelling via simultaneous clustering and representation learning
-
Representation learned progressively from last NN layer to first NN layer - have reasonable results and ability to control the generation PROGRESSIVE LEARNING AND DISENTANGLEMENT OF HIERARCHICAL REPRESENTATIONS
-
reasonable experiments on EMNIST for varying symbols with large amount of control Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN)
- β Space2Vec - embedding for spacial locations (as far as I understood only for geolocations not for (x,y) on the image) -> use in classification Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells
- superior results against for 10 most popular architectures replacing batchnorm with deconvolution. Can be efficiently calculated at a fraction of conv layer cost. Authors replaced batchnorms with network deconvolution and gain superior results on all architectures they tried. There is also a connection with brain (a speciall kind of vision operation). The intuition behind - the layer is decorrelating neighbouring pixels and channels so it is much easier for NN to learn. Network Deconvolution
- if you somehow know that your network should know exact multiplication/addition of tensor elements (e.g. physics, math problems) Neural Arithmetic Units
- BLUE is finally sentenced to death BERTScore: Evaluating Text Generation with BERT
-
Nuclear Sampling: instead of top-k most probable words take top-p (sum of top-m words probabilities >= p) The Curious Case of Neural Text Degeneration
-
Unlikelihood (which outperforms nuclear sampling from above βοΈ significantly) - the idea is to penalize unlikely situations (negative examples which are either 1)random or 2)repeting words) Neural Text Generation With Unlikelihood Training
-
β Controllable generation with language models Plug and Play Language Models: A Simple Approach to Controlled Text Generation
-
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
-
tiny change in LSTM which make it better than your Transformer, die BERT! (ok, it won't but the idea is simple but working) Mogrifier LSTM
-
yes Are Transformers universal approximators of sequence-to-sequence functions?
-
Transformer solves math problems much better than Wolphram alpha (with pretty straightforward approach) Deep Learning For Symbolic Mathematics
-
The main problem with text GANs (acc to authors) is that Discriminator easily overpowering Generator. To improve Generator training it is rewarded when current generated sentence is better than previously generated sentence Self-Adversarial Learning with Comparative Discrimination for Text Generation
- main idea is to detect anomaly regions as regions with high difference between original and AE-reconstructed image; solved by gradient minimization of Reconstruction loss (x_i) + ||x_i - x_orig||; looks like anomaly localization improves Iterative energy-based projection on a normal data manifold for anomaly localization
-
Automatic search for equivalent (but simpler or faster) set of operations, which is useful for NN inference speed optimization Deep Symbolic Superoptimization Without Human Knowledge
-
Autoregressive decoder speed up Decoding As Dynamic Programming For Recurrent Autoregressive Models
-
BackPACK - wrapper for pytorch to estimate several gradient statistics, works reasonably fast, but some features do not support branching and custom forward implementations BackPACK: Packing more into Backprop
-
how to learn from rule-based (automatically) generated markup - method can significantly improve performance Learning from Rules Generalizing Labeled Exemplars
-
how to measure quality on test set with noisy labels? see π (exact formula in the paper) Discrepancy Ratio: Evaluating Model Performance When Even Experts Disagree on the Truth
-
New NN+tree end2end trained module for tabular data outperforms XGBoost and CatBoost in several datasets Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data
-
Decision Trees with criteria = linear model on each node with NN approximation. More robust (but most likely slower) Locally Constant Networks
-
Neural tangents - library for infinite width NNs paper
-
exploring continuous game of life Intrinsically Motivated Discovery of Diverse Patterns in Self-Organizing Systems
-
amazing pranking presentation - I hated the presenter very much sitting at home=)