The major innovation of RNN is that each output is a function of both previous output and new data. As a result, RNN gain the ability to incorporate information on previous observations into the computation it performs on a new feature vector, effectively creating a model with memory. This recurrent formulation enables parameter sharing across a much deeper computational graph that includes cycles. Prominent architectures include Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) that aim to overcome the challenge of vanishing gradients associated with learning long-range dependencies, where errors need to be propagated over many connections.
RNNs have been successfully applied to various tasks that require mapping one or more input sequences to one or more output sequences and are particularly well suited to natural language. RNN can also be applied to univariate and multivariate time series to predict market or fundamental data. This chapter covers how RNN can model alternative text data using the word embeddings that we covered in Chapter 16 to classify the sentiment expressed in documents. Most specifically, this chapter addresses:
- How to unroll and analyze the computational graph for an RNN
- How gated units learn to regulate an RNN’s memory from data to enable long-range dependencies
- How to design and train RNN for univariate and multivariate time series in Python
- How to leverage word embeddings for sentiment analysis with RNN
RNNs assume that data is sequential so that previous data points impact the current observation and are relevant for predictions of subsequent elements in the sequence. They allow for more complex and diverse input-output relationships than feedforward networks (FFNN) and convolutional nets that are designed to map one input to one output vector, usually of fixed size and using a given number of computational steps. RNN, in contrast, can model data for tasks where the input, the output or both are best represented as a sequence of vectors.
Note that input and output sequences can be of arbitrary lengths because the recurrent transformation that is fixed but learned from the data can be applied as many times as needed. Just as CNN easily scale to large images and some CNN can process images of variable size, RNN scale to much longer sequences than networks not tailored to sequence-based tasks. Most RNN can also process sequences of variable length.
RNNs are called recurrent because they apply the same transformations to every element of a sequence in a way that the output depends on the outcome of prior iterations. As a result, RNNs maintain an internal state that captures information about previous elements in the sequence akin to a memory.
The backpropagation algorithm that updates the weight parameters based on the gradient of the loss function with respect to the parameters involves a forward pass from left to right along the unrolled computational graph, followed by backward pass in the opposite direction.
- Sequence Modeling: Recurrent and Recursive Nets, Deep Learning Book, Chapter 10, Ian Goodfellow, Yoshua Bengio and Aaron Courville, MIT Press, 2016
- Supervised Sequence Labelling with Recurrent Neural Networks, Alex Graves, 2013
- Tutorial on LSTM Recurrent Networks, Juergen Schmidhuber, 2003
- The Unreasonable Effectiveness of Recurrent Neural Networks
RNNs can be designed in a variety of ways to best capture the functional relationship and dynamic between input and output data. In addition to the recurrent connections between the hidden states, there are several alternative approaches, including recurrent output relationships, bidirectional RNN, and encoder-decoder architectures.
RNNs with an LSTM architecture have more complex units that maintain an internal state and contain gates to keep track of dependencies between elements of the input sequence and regulate the cell’s state accordingly. These gates recurrently connect to each other instead of the usual hidden units we encountered above. They aim to address the problem of vanishing and exploding gradients by letting gradients pass through unchanged.
A typical LSTM unit combines four parameterized layers that interact with each other and the cell state by transforming and passing along vectors. These layers usually involve an input gate, an output gate, and a forget gate, but there are variations that may have additional gates or lack some of these mechanisms
- Understanding LSTM Networks, Christopher Olah, 2015
- An Empirical Exploration of Recurrent Network Architectures, Rafal Jozefowicz, Ilya Sutskever, et al, 2015
Gated recurrent units (GRU) simplify LSTM units by omitting the output gate. They have been shown to achieve similar performance on certain language modeling tasks but do better on smaller datasets.
- Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Kyunghyun Cho, Yoshua Bengio, et al 2014
- Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio, 2014
We illustrate how to build RNN using the Keras library for various scenarios. The first set of models includes regression and classification of univariate and multivariate time series. The second set of tasks focuses on text data for sentiment analysis using text data converted to word embeddings (see Chapter 15).
The notebook univariate_time_series_regression demonstrates how to get data into the requisite shape and how to forecast the S&P 500 index values using a Recurrent Neural Network.
We'll now build a slightly deeper model by stacking two LSTM layers using the Quandl stock price data (see the stacked_lstm_with_feature_embeddings notebook for implementation details). Furthermore, we will include features that are not sequential in nature, namely indicator variables that identify the ticker and time periods like month and year.
So far, we have limited our modeling efforts to single time series. RNNs are naturally well suited to multivariate time series and represent a non-linear alternative to the Vector Autoregressive (VAR) models we covered in Chapter 8, Time Series Models.
The notebook multivariate_timeseries demonstrates the application of RNNs to modeling and forecasting several time series using the same dataset we used for the VAR example, namely monthly data on consumer sentiment, and industrial production from the Federal Reserve's FRED service.
RNNs are commonly applied to various natural language processing tasks. We've already encountered sentiment analysis using text data in part three of this book.
The notebook sentiment_analysis illustrates how to apply an RNN model to text data to detect positive or negative sentiment (which can easily be extended to a finer-grained sentiment scale). We are going to use word embeddings to represent the tokens in the documents. We covered word embeddings in Chapter 15, Word Embeddings. They are an excellent technique to convert text into a continuous vector representation such that the relative location of words in the latent space encodes useful semantic aspects based on the words' usage in context.
In this example, we again use Keras' built-in embedding layer that allows us to train vector representations specific to the task at hand. In the next example, we use pretrained vectors instead.
In Chapter 15, Word Embeddings, we showed how to learn domain-specific word embeddings. Word2vec, and related learning algorithms, produce high-quality word vectors, but require large datasets. Hence, it is common that research groups share word vectors trained on large datasets, similar to the weights for pretrained deep learning models that we encountered in the section on transfer learning in the previous chapter.
The notebook sentiment_analysis_pretrained_embeddings illustrates how to use pretrained Global Vectors for Word Representation (GloVe) provided by the Stanford NLP group with the IMDB review dataset.
- Large Movie Review Dataset, Stanford AI Group
- GloVe: Global Vectors for Word Representation, Stanford NLP