Recurrent Language Models with RNNs

In this article, we will first try to understand the basics of language models, what Recurrent Neural Networks are and how can we use them to solve the problem of language modeling.

7 years ago   •   13 min read

By Felipe Ducau

Building an intuition

There is a vast amount of data which is inherently sequential, such as speech, time series (weather, financial, etc.), sensor data, video, and text, just to mention some. Recurrent Neural Networks (RNNs) are a family of neural networks designed specifically for sequential data processing. In this article, we will learn about RNNs by exploring the particularities of text understanding, representation, and generation.

What does it mean for a machine to understand natural language? There are a number of different approaches to try to answer this question. One capability we would expect from such machine is the ability to tell us how likely a given sentence is. In this approach, we assume that the likeliness of a sentence can be determined from everyday use of language.

Neural language models attempt to solve the problem of determining the likelihood of a sentence in the real world. Even though it does not seem to be the most exciting task in the world on the surface, this type of modelling is an essential building block for understanding natural language and a fundamental task in natural language processing (NLP). Once we have such system in place, we can use it later to generate new text with some fantastic outcomes. Once we are armed with this model, we also gain important insights in to other tasks such as machine translation and dialogue generation.

Note on linguistic units:We can probably agree on the fact that a sentence can be understood as a sequence, but a sequence of what? characters? words? something in the middle? This is an active area of research and there is not consensus as to what constitutes the basic unit in written language. For now, we will simply use words as our linguistic unit.

In this article, we will first try to understand the basics of language models, what Recurrent Neural Networks are and how can we use them to solve the problem of language modeling. With that in place, we will try to understand the reason why and how this approach gets to work and generalize. Finally, we give a description on how we can put together all these tools in order to generate text automatically.

Being this article intended to be an introduction to Natural Language Understanding and text generation, we relegate most of the practical and implementation details for the next post of the series in which we will take a more hands-on approach to text generation in different forms.

Word level Language Models

We can now start formalizing our ideas, let's consider a sentence $S$ composed by $T$ words, such that

$$S = (w_1 , w _2 , ... , w _T)$$

At the same time, each symbol $w_i$ is part of a vocabulary $V$ which contains all the possible words,

$$V = \{ v_1, v _2, ..., v _{|V|} \}$$

where $|V|$ represents the size of the vocabulary.

If we want to compute the probability of a sentence, we can use the chain rule to get

$$p(S) = p(w_1 , w _2 , ... , w _T) = p(w _1) \cdot p(w _2 | w _1) \cdot p(w _3 | w _2, w _1) \cdot \cdot \cdot p(w _T | w _{T-1}, w _{T-2}, ..., w _1 )$$

or, more consisely,

$$p(S) = \prod_{t=1}^T p(w _t | w _{\lt t})$$

where $w _{\lt t}$ refers to all the words previous to time $t$.

The probability of the word at time $t$ depends on the previous words of the sentence. In order to compute the probability of a sentence we just need to compute the individual terms $p(w _t | w _{\lt t})$ and multiply them together.

A language model, in general, attempts to predict the next word $w_{t+1}$ given the preceding words $w _{\lt t}$ at each time step $t$.

Recurrent Neural Networks (RNNs)

Humans base much of their understanding from context. Let's consider the following sequence - Paris is the largest city of ______. It is easy to fill the blank with France. This means that there is information about the last word encoded in the previous elements of the sequence.

The idea behind this architecture is to exploit this sequential structure of the data. The name of this neural networks comes from the fact that they operate in a recurrent way. This means that the same operation is performed for every element of a sequence, with its output depending on the current input, and the previous operations.

This is achieved by looping an output of the network at time $t$ with the input of the network at time $t+1$. These loops allow persistence of information from one time step to the next one.

Figure 1. Left: Circuit diagram. The black square represents a time delay of a single time step. Right: Same network as an unfolded computational graph, each node is associated with a particular time.

This structure with loops might be a bit confusing, but it becomes intuitive when we look at the chain formed when we unroll the computational graph. Now we have an architecture which can receive different inputs at each time step $x_t$, has the capability of producing outputs at each time step $o_t$ and maintains a memory state $h_t$ which contains information about what happened in the network up to time $t$.

It is important to notice that we can unroll this network as many times as elements in the input sequence. Furthermore, the parameters of each "realization" of the RNN cell are the same, making the number of parameters of the model independent of the length of the sequence.

Essential to the success of a model like this is the operations we performed inside the RNN unit. In the next article of this series we will look inside this cell, particularly we will introduce the Gated Recurrent Unit (GRU) a particular type of RNN. For now, we can think of the RNN cell as a computation in which we update the memory vector $h$ deciding, at each timestep, which information we want to keep, which information is not relevant anymore and we would like to forget and which information to add from the new input. The RNN cell also creates an output vector which is tightly related to the current hidden state (or memory vector).

It is worth noticing that Recurrent Neural Networks can be used in a variety of scenarios depending in how the inputs are fed and the outputs are interpreted. These scenarios can be divided into three main different classes:

  • Sequential input to sequential output. Machine translation / part-of-speech tagging and language modeling tasks lie within this class.
  • Sequential input to single output. One task with this property is sentiment analysis, in which we fed a sentence and we want to classify it as positive, neutral or negative.
  • Single input to sequential output. This is, for example, the case of image captioning: where we fed a picture to the RNN and want to generate a description of it.

Recurrent Language Models

In this section we look at how we use this architecture for the task of language modelling as defined previously. Imagine we want to compute the likelihood of the sentence "And the boss is a cat named Joe". To do so we need to estimate the following probabilities,

$$ p(\text{And}), p(\text{the}| \text{And}), p(\text{boss}| \text{And the}) ..., p(\text{Joe}|\text{And the boss is a cat named}) $$

Word Embedding

The first thing we have to take care of is how we define the input to our language modeling function. It is clear that the input is a sequence of $T - 1$ words, but, how each of these words should be represented? As in most deep learning systems, we would like to constraint the model with prior knowledge as least as possible. One encoding scheme that meets this criterion is a 1-of-K encoding (one-hot-encoding).

Under this scheme, each word $i$ in the vocabulary $V$ is represented as a binary vector $\mathbf{w} _i$. To denote the $i$-th word with the vector $\mathbf{w} _i$, we set the $i$-th element of it to be 1 and all the other elements to be 0.

$$ w _i = [0,0,...,1 (i\text{-th element}), 0,...,0]^T \in \{ 0, 1\}^{|V|} $$

The prior knowledge embedded in this encoding scheme is minimal in the sense that the distance between any two word vectors is equal to 1 if the two words are different and 0 if the words are the same word.

Now, the input to our model is a sequence of $T-1$ one-hot vectors $(\mathbf{w} _1, \mathbf{w} _2, ..., \mathbf{w} _{T-1})$. Each of these vectors will be then multiplied by a weight matrix $\mathbf{E}$, to get a sequence of continuous vectors $(\mathbf{x} _1, \mathbf{x} _2, ..., \mathbf{x} _{T-1})$ such that,

$$ \mathbf{x} _j = \mathbf{E}^T \mathbf{w} _j $$

Performance note: Actually this matrix-vector multiplication is not performed. Since the vector $\mathbf{w} _j$ has only one element equal to 1 ($i$-th element) and the rest are zeros, the multiplication is equivalent to just taking the $i$-th row of $\mathbf{E}$. Because slicing a matrix is faster than performing the multiplication, that is the way this operation is performed in practice.

Forward path

Figure 2 illustrates how we approach this problem. First, we initialize the memory vector $h$ to zero. In the first time step (denoted by zero) the input to the RNN unit is special token <\s> which symbolizes the beginning of a sentence. As an output, we get the probability of every possible word in the vocabulary given the start of sentence token. The memory vector gets updated in this same operation and sent to the next time step. Now we repeat the procedure for time step 1 in which And is the input of the cell, $h_1$ is the memory state which contains information about the past and $p(w_2|\text{<\s> And})$ is the output.

Figure 2. RNN for language modelling.

In general, at each time step, we seek to estimate a probability distribution over all the possible next words in the vocabulary $V$ given the previous words. The output layer of the RNN is then a softmax layer which returns a vector of size $|V|$ whose $i$-th element indicates the predicted probability of the word $V_i$ being the next word to appear in the sentence.

For the case in which the output at time $t$ is an affine transformation of the computed memory state $h_t$, we have

$$ p(w_t = k | w_{\lt t}) = \frac{exp(\mathbf{v} _k^T h _t + b _k)}{\sum _{k'} exp(\mathbf{v} _{k'}^T h _t + b _{k'})}$$

Backward pass

We have already an architecture that can potentially learn to score sequences making use of recurrent neural networks. The only missing part is to define an appropriate loss function for the network to actually learn what we are expecting it to learn.

The loss for a given sequence is the negative log probability that the model assigns to the correct output. This is $L = - log(p(w _1, w _2, ..., w _T))$. Using the chain rule and the fact that the log of a product equals the sum of the logs, we obtain, for a sentence $\mathbf{x}$,

$$ L(\mathbf{x}) = - \sum_t log \ p _{model} (w _t = x _{t + 1}) = - \sum _t log \ \mathbf{o} _t [x _{t+1}] $$

where $\mathbf{o} _t [x _{t+1}]$ is the element of the output softmax corresponding to the real word $x _{t+1}$.

With the loss defined and given that the whole system is differentiable, we can backpropagate the loss through all the previous RNN units and embedding matrices and update its weights accordingly.

Generalization to unseen n-grams

Now that we described our language model, let us look into what happens inside to better understand why this approach is so powerful. In particular, we focus on how the model generalizes to unseen $n$-grams (sequence of $n$ consecutive words). This is one of the main advantages of neural language models compared with classical NLP approaches such as $n$-gram language model.

Our model can be thought as a composite of two functions ($f \circ g$). The first function $f$ maps a sequence of previous words (preceding $n-1$ words) onto a continuous vector space. The resulting vector $\mathbf{h}$ is the memory state or context vector.
$$ f: \{0,1\}^{|V| \times (n-1)} \rightarrow \mathbb R^d $$

The second stage, characterized by $g$ maps the continuous vector $h$ to the target word probability, by applying an affine transformation (multiplication by a matrix and addition to a bias vector) followed by a softmax normalization (to convert the output into a valid probability distribution).

$$ g(\mathbf{h}) = softmax(\mathbf{U}^T \mathbf{h} + \mathbf{c}) $$

Lets look a bit closer at the operation performed by $g$. We can ignore the effect of the bias term. We can express the matrix-vector multiplication as follows,


where each element of the output vector is the vector multiplication of $\mathbf{h}$ with the corresponding row of $\mathbf{U}^T$. You can use this visualization to convince yourself about this fact.

This means that the predicted probability of the model for the $i$-th word of the dictionary is proportional to how aligned the $i$-th column of $\mathbf{U}^T$ is with the context vector $\mathbf{h}$.

Now, if we have two sequences of context words which are usually followed by a similar set of words, then the context vectors $\mathbf{h} _1$ and $\mathbf{h} _2$ have to be similar. Why? because once we multiply them by $\mathbf{U}^T$ we need the same set of words to have high probability.

What this means, in the end, is that the neural language model must project $n$-grams that are followed by the same word to nearby points in the context vector space. If this property is not met and the two $n$-grams are mapped to completely different vectors, then when multiplying each one by $\mathbf{U}^T$ we would obtain very different probabilities for the next word, resulting in a bad language model.

Let's try to illustrate this with an example in which, without loss of generality, we will assume that every word only depends on the previous word that appears in the sentence. In other words, we consider a bigram language model where the context length is one. Our training corpus is composed by the following three sentences 1:

  • There are three teams left for the qualification.
  • four teams have passed the first round.
  • four groups are playing in the field.

We will focus on the bold-faced phrases. The first word of each of these bigrams is a context word and the language model has to predict the probability of the following word.

From our previous analysis, we notice that the model must map the context words "three" and "four" to nearby points in the context space. This is because we need them to give a similar probability to the word "teams" (first and second sentences of the training set).

The target word vectors $\mathbf{u} _{teams}$ and $\mathbf{u} _{groups}$ will necessarily be close to each other as well, because otherwise the probability of "teams" given "four" ($p(\text{teams}|\text{four}) \propto \mathbf{u} _{teams} \mathbf{h} _{four}$) and the probability of "groups" given "four" ($p(\text{groups}|\text{four}) \propto \mathbf{u} _{groups} \mathbf{h} _{four}$) will be different despite the fact that they are equally likely in the training corpus (second and third sentences of the training set).

Now, assuming that we trained our language model with this tiny corpus, we could ask for the probability of the bigram "three groups" which the model has never seen. Our language model will project the context word "three" to a point in the context space close to "four". From this context vector, the model will have to assign a high probability to the word "groups", because the context vector $\mathbf{h} _{three}$ and the target word vector $\mathbf{u} _{gropus}$ are well aligned. So, even without ever seeing the bigram "three groups", this approach can assign a reasonable probability.

In this property is where much of the power of this approach relies on. The key property here is the use of distributed representations of words and context, i.e. the fact that we use continuous vector representations for words and context. The approach is based on the distributional hypothesis of language:

The Distributional Hypothesis is that words that occur in the same contexts tend to have similar meanings. The underlying is that "a word is characterized by the company it keeps". ACLWeb

This approach not only exploits the probabilistic nature of language in two main ways. On the one hand, it learns the probability of one word given the context, and in the other, it represents similar context by similar vectors in the context space and similar words by similar vectors in the word space.

Text Generation

How can we use this language model now to generate new pieces of text? It is relatively straight forward. If we want to generate a new sentence we just need to initialize the context vector $\mathbf{h} _0$ randomly, then unroll the RNN sampling at each time step one word from the output word probability distribution and feeding this word back to the input of the next time RNN unit.

Let's make a concrete example. Assume do not want just to generate generic text but we want it to have certain characteristics. For instance, we want to generate sentences as if Sheldon Cooper was the one writing. One approach would be to get all Big Bang Theory scripts, filter by Sheldon and train our language model on that data.

The problem with this approach is that the number of observations might not be large enough to train our system. The alternative we have then is to use pre-training, i.e. we first train our language model with generic text (we can use the Gutenberg corpus for example). Then, once our model has a certain understanding of words, context and text structure, we can continue with the training procedure just using Sheldon's lines to learn the particular way he speaks.

In the next post of this series, we will look closer at the implementation of this kind of models and do some experiments. For now, if you're curious about the capabilities of this techniques I would recommend you looking at the Silicon Valley automatic script generation in the Deep Writing blog.

Next steps

In the next post, we will take this to the next level. We will dive deeper into the implementation details of these models and mention some common issues and how they get tackled in practice. We will also suggest some open datasets and give some ideas on which kind of training data we can use. Finally, we will also try to implement our first text generation software from scratch using PyTorch and run some experiments. Keep posted!

Further reading

  • Experimenting with text generation. Deep writing blog
  • Natural Language Understanding with Distributed Representation. Kyunghyun Cho. New York University, 2015. Lecture Note for DS-GA 3001
  • The Unreasonable Effectiveness of Recurrent Neural Networks. Link
  • Understanding LSTM Networks. Link

1: Example borrowed from Prof. Kyunghyun Cho lecture notes on on Natural Language Understanding with Distributed Representations. New York University, 2015. Link.

To get started with your own ML-in-a-box setup, sign up here.

Spread the word

Keep reading