Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created October 9, 2016 13:22
Show Gist options
  • Save shagunsodhani/e741ebd5ba0e0fc0f49d7836e30891a7 to your computer and use it in GitHub Desktop.
Save shagunsodhani/e741ebd5ba0e0fc0f49d7836e30891a7 to your computer and use it in GitHub Desktop.

Revisions

  1. shagunsodhani created this gist Oct 9, 2016.
    78 changes: 78 additions & 0 deletions PixelRNN.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,78 @@
    # Pixel Recurrent Neural Network

    ## Introduction

    * Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc.
    * [Link to the paper](https://arxiv.org/abs/1601.06759)

    ## Model

    * Scan the image, one row at a time and one pixel at a time (within each row).
    * Given the scanned content, predict the distribution over the possible values for the next pixel.
    * Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem.
    * Parameters used in prediction are shared across all the pixel positions.
    * Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well.

    ### Pixel as discrete value

    * The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values).
    * This discrete representation is simpler and easier to learn.

    ## Pixel RNN

    ### Row LSTM

    * Undirectional layer that processed image row by row.
    * Uses one-dimensional convolution (kernel of size kx1, k>=3).
    * Refer image 2 in the [paper](https://arxiv.org/abs/1601.06759).
    * Weight sharing in convolution ensures translation invariance of computed feature along each row.
    * For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context.
    * For equations related to state-to-state component, refer to equation 3 in the [paper](https://arxiv.org/abs/1601.06759)

    ### Diagonal BiLSTM

    * Bidirectional layer that processes the image in the diagonal fashion.
    * Input map skewed by offsetting each row of the image by one position with respect to the previous row.
    * Refer image 3 in the [paper](https://arxiv.org/abs/1601.06759)
    * For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1.
    * Kernel size of 2x1 processes minimal information yielding a highly non-linear computation.
    * Output map is skewed back by removing the offset positions.
    * To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map.

    ### Residual Connections

    * Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly.
    * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)

    ### Masked Convolutions

    * Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used).
    * Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen.
    * Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself.
    * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)

    ### PixelCNN

    * Uses multiple convolution layers that preserve spatial resolution.
    * Makes receptive field large but not unbounded.
    * Mask used to avoid seeing the future context.
    * Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily).

    ### Multi-Scale PixelRNN

    * Composed of one unconditional PixelRNN and multiple conditional PixelRNNs.
    * Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s)
    * Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image.
    * For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n.
    * For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map.

    ## Training and Evaluation

    * Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared.
    * Update rule - RMSProp
    * Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET.
    * Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well.
    * PixelRNN outperforms other models for Binary MNIST and CIFAR10.
    * For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field.
    * The paper also provides new benchmarks for generative image modelling on IMAGENET dataset.