shagunsodhani · October 9, 2016 13:22 · Oct 9, 2016
diff --git a/PixelRNN.md b/PixelRNN.md
@@ -0,0 +1,78 @@
+# Pixel Recurrent Neural Network
+
+## Introduction
+
+* Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc.
+* [Link to the paper](https://arxiv.org/abs/1601.06759)
+
+## Model
+
+* Scan the image, one row at a time and one pixel at a time (within each row).
+* Given the scanned content, predict the distribution over the possible values for the next pixel.
+* Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem.
+* Parameters used in prediction are shared across all the pixel positions.
+* Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well.
+
+### Pixel as discrete value
+
+* The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values).
+* This discrete representation is simpler and easier to learn.
+
+## Pixel RNN
+
+### Row LSTM
+
+* Undirectional layer that processed image row by row.
+* Uses one-dimensional convolution (kernel of size kx1, k>=3).
+* Refer image 2 in the [paper](https://arxiv.org/abs/1601.06759).
+* Weight sharing in convolution ensures translation invariance of computed feature along each row.
+* For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context.
+* For equations related to state-to-state component, refer to equation 3 in the [paper](https://arxiv.org/abs/1601.06759)
+
+### Diagonal BiLSTM
+
+* Bidirectional layer that processes the image in the diagonal fashion.
+* Input map skewed by offsetting each row of the image by one position with respect to the previous row.
+* Refer image 3 in the [paper](https://arxiv.org/abs/1601.06759)
+* For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1.
+* Kernel size of 2x1 processes minimal information yielding a highly non-linear computation.
+* Output map is skewed back by removing the offset positions.
+* To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map.
+
+### Residual Connections
+
+* Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly.
+* Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)
+
+### Masked Convolutions
+
+* Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used).
+* Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen.
+* Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself.
+* Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)
+
+### PixelCNN
+
+* Uses multiple convolution layers that preserve spatial resolution.
+* Makes receptive field large but not unbounded.
+* Mask used to avoid seeing the future context.
+* Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily).
+
+### Multi-Scale PixelRNN
+
+* Composed of one unconditional PixelRNN and multiple conditional PixelRNNs.
+* Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s)
+* Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image.
+* For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n.
+* For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map.
+
+## Training and Evaluation
+
+* Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared.
+* Update rule - RMSProp
+* Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET.
+* Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well.
+* PixelRNN outperforms other models for Binary MNIST and CIFAR10.
+* For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field.
+* The paper also provides new benchmarks for generative image modelling on IMAGENET dataset.
+