Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Last active July 25, 2023 18:07
Show Gist options
  • Save shagunsodhani/4441216a298df0fe6ab0 to your computer and use it in GitHub Desktop.
Save shagunsodhani/4441216a298df0fe6ab0 to your computer and use it in GitHub Desktop.
Notes for "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" paper

The Batch Normalization paper describes a method to address the various issues related to training of Deep Neural Networks. It makes normalization a part of the architecture itself and reports significant improvements in terms of the number of iterations required to train the network.

Issues With Training Deep Neural Networks

Internal Covariate shift

Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift.

It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite.

Vanishing Gradient

Saturating nonlinearities (like tanh or sigmoid) can not be used for deep networks as they tend to get stuck in the saturation region as the network grows deeper. Some ways around this are to use:

  • Nonlinearities like ReLU which do not saturate
  • Smaller learning rates
  • Careful initializations

Normalization

Let us say that the layer we want to normalize has d dimensions x = (x1, ... xd). Then, we can normalize the kth dimension as follows:

Scaled and shifted normalized value

We also need to scale and shift the normalized values otherwise just normalizing a layer would limit the layer in terms of what it can represent. For example, if we normalize the inputs to a sigmoid function, then the output would be bound to the linear region only.

So the normalized input xk is transformed to:

Scaled and shifted normalized value

where γ and β are parameters to be learned.

Moreover, just like we use mini-batch in Stochastic Gradient Descent (SGD), we can use mini-batch with normalization to estimate the mean and variance for each activation.

The transformation from x to y as described above is called Batch Normalizing Tranform. This BN transform is differentiable and ensures that as the model is training, the layers can learn on the input distributions that exhibit less internal covariate shift and can hence accelerate the training.

At training time, a subset of activations in specified and BN transform is applied to all of them.

During test time, the normalization is done using the population statistics instead of mini-batch statistics to ensure that the output deterministically depends on the input.

Batch Normalized Convolutional Networks

Let us say that x = g(Wu+b) is the operation performed by the layer where W and b are the parameters to be learned, g is a nonlinearity and u is the input from the previous layer.

The BN transform is added just before the nonlinearity, by normalizing x = Wu+b. An alternative would have been to normalize u itself but constraining just the first and the second moment would not eliminate the covariate shift from u.

When normalizing Wu+b, we can ignore the b term as it would be canceled during the normalization step (b's role is subsumed by β) and we have

z = g( BN(Wu) )

For convolutional layers, normalization should follow the convolution property as well - ie different elements of the same feature map, at different locations, are normalized in the same way. So all the activations in a mini-batch are jointly normalized over all the locations and parameters (γ and β) are learnt per feature map instead of per activation.

Advantages Of Batch Normalization

  1. Reduces internal covariant shift.
  2. Reduces the dependence of gradients on the scale of the parameters or their initial values.
  3. Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques.
  4. Allows use of saturating nonlinearities and higher learning rates.

Batch Normalization was applied to models trained for MNIST and Inception Network for ImageNet. All the above-mentioned advantages were validated in the experiments. Interestingly, Batch Normalization with sigmoid achieved an accuracy of 69.8% (overall best, using any nonlinearity, was 74.8%) while Inception model (sigmoid nonlinearity), without Batch Normalisation, worked only as good as a random guess.

Future Work

While BN Transform does enhance the overall deep network training task, its precise effect on gradient propagation is still not well understood. A future extension of Batch Normalisation would be in the domain of Recurrent Neural Networks where internal covariate shift and vanishing gradients are more severe. It remains to be explored if it can also help with domain adaption by easily generalizing to new data distributions.

@yenchenlin
Copy link

Hello @shagunsodhani, may I ask a question about this paper here?

In the section 2 of this paper, there's an example says that if

x_hat = x - E[x], where x = u + b, X = {X1...N} ... if a gradient descent step ignores the dependence of E[x] on b, then it will update ...

I've got author's point. However, how can I derive the correct gradient for bias b in above example?
It seems that the last equation in the example u + (b + delta_b) - E[u + (b + delta_b)] = u + b - E[u + b] can always be established no matter what delta_b is.

@danstowell
Copy link

Typo: "whitelisted" -> "whitened"

@sandeep-ipk
Copy link

Can someone please throw some light on how the mean and variance over the entire batch is calculated, because at a time only one input vector is passed through the network, right? So how can we calculate the mean and variance of the entire batch? I'm Confused. :/

@tlkahn
Copy link

tlkahn commented Apr 8, 2017

Cannot see images. Is it just me?

@sunshineatnoon
Copy link

@sandeep-ipk, at a time there's a batch of samples(vectors) pass through the network. For instance, if you train a CNN to classify images, then each layer gets a tensor whose size is (batchSize, feature map number, feature map height, feature map width). Then you can calculate mean and variance of these batchSize feature maps.

@shagunsodhani
Copy link
Author

@tikahn I have updated the link to the images. Sorry, I did not get any notifications for your comment.

@jfhauris
Copy link

In the keras doc for Batch Normalization it says: "axis: Integer, the axis that should be normalized (typically the features axis). "
I am really confused by what dimension is being normalized. For example, if you have a batch of 32 RGB images that you want to normalize, it seems to me that you want to normalize all of the R channels with respect to each other, then separately all of the B channels with respect to each other, then G. In other words you do not wnat to take the mean and std of the RGB channels but of each channel separately and then apply the respective normalization to each channel.
More explicitly: for each channel = feature map = i = R, G, B (for this example),
mu_i = mean(all 32 images, channel i) = sum_j[ image_j_i} / 32. Here you are summing over the j=32 images for the ith channel.
Then std_i = std(all 32 images, channel i).
Finally roughly: image_norm_ch_i = (image_ch_i - mu_i) / std_i

Is this correct or am I totally missing the boat? Thanks, Jon

@Nitinsiwach
Copy link

@yenchenlin - If you were able to find a resolution would you mind entertaining the exact same question i have posted here

@coldgemini
Copy link

why doesn't the batch norm has to scale back to original distribution? then why do you even need this layer if everything is scaled back after all?

@JoelRunevic
Copy link

JoelRunevic commented Jun 23, 2023

why doesn't the batch norm has to scale back to original distribution? then why do you even need this layer if everything is scaled back after all?

We have that $y = \gamma \hat{x} + \beta$, but this does not imply that $y = x$, where $x$ is the unnormalized value of $\hat{x}$. The reason why we normalize is to ensure that the gradient step is more "well-behaved" and does not fall victim to issues of vanishing gradients, saturated nonlinearities etc; however, we do not want to restrict the distribution to Gaussian in every case, as this may not be useful for some certain sub-cases. Hence, we allow the $\gamma$ and $\beta$ parameters to scale and shift the distribution, but this does not imply that the distribution will be scaled back to the unnormalized distribution in every case (this will only happen if the network deems this to be optimal). It is simply to add flexibility to the network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment