Skip to content

Latest commit

 

History

History

max-normed-optimizer

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Notes on the application of the Normalized Gradient Descent approach to Deep Learning

Normalized gradient descent decouples the gradient norm from its direction. In the simplest setup it just normalizes the gradient and multiply it by some step.

The simplest optimizer of that form is following:

def l2_normalize(grad: tf.Tensor) -> tf.Tensor:
    return tf.nn.l2_normalize(grad)

class NormalizedSGD(tf.train.Optimizer):
    ...
    def _apply_dense(self, grad: tf.Tensor, var: tf.Variable) -> tf.Operation:
        update_grad = self.lr * tf.nn.l2_normalize(grad)
        return tf.assign_sub(var, update_grad)

Despite the simplicity of this approach this methods seems to work surprisingly well on different benchmarks and in many cases beating Adam optimizer.

My notes:

Here are my notes about this problem: Notes

Related works:

It is worth to check this papers:

Results

Simple Quadratic problem:

Notebook Solving_quadratic_equation

The aim is to solve quadratic equation of form:

L(x) = 0.5 x^T Q x + b^T x

where Q is random positive-definite matrix, b is a random vector.

Notes:

  • adaptive learning rate is working for that case
  • momentum is working too.

Example result:

  • one may note that the initial value of the learning rate was too small
  • momentum was decreasing slowly.
  • gradient correlation drops to zero. See my notes.

img

Comparison with Adam and SGD (momentum=0.9) for different initial values of the learning rate.

When using adaptive method, the initial value of the learning rate should not be as much important. Check the fig a) where NSGD - denotes the normalized gradient descent method with adaptive learning rate as discussed in notes.

img

Training MLP on MNIST dataset.

Notebook Training simple multilayer percepton

Model:

* 30 dense layers of size 128.
* After each layer Batchnormalization is applied then dropout at level 0.2
* Small l2 regularization is added to the weights of the network

Notes:

  • Adaptive learning rate was not working, as discussed in my notes.
  • Momentum was not helping.
  • Plain max normalization is used to achieve the best result and manual learning rate scheduling.

Comparision with Adam and SGD with momentum=0.9

img

img

Training CNN on CIFAR10 dataset.

Notebook Training simple CNN

The implementation of the CNN net is taken from the: keras-demo

Notes:

  • Adam, and Normalized gradients train much faster than used in the Keras demo RMSprop optimizer. The authors report 79% accuracy after 60 epochs. Here we reach 82% in 45 epochs.
  • Adaptive learning rate was not working, as discussed in my notes.
  • Momentum was not working.
  • Plain std normalization is used to achieve the best result and manual learning rate scheduling. Max normalization was performing a worse in that case. Maybe lower learning rate are required to get the best result.

Comparision with Adam and SGD with momentum=0.9

img

img