.. _chapter_rnn_scratch:
Implementation of Recurrent Neural Networks from Scratch
========================================================
In this section we implement a language model introduce in
:numref:`chapter_rnn` from scratch. It is based on a character-level
recurrent neural network trained on H. G. Wells’ *The Time Machine*. As
before, we start by reading the data set first, which is introduced in
:numref:`chapter_language_model`.
.. code:: python
%matplotlib inline
import d2l
import math
from mxnet import autograd, nd, gluon
batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
One-hot Encoding
----------------
Remember that each token is presented as a numerical index in
``train_iter``. Feeding these indices directly to the neural network
might make it hard to learn. We often present each token as a more
expressive feature vector. The easiest presentation is called *one-hot
encoding*.
In a nutshell, we map each index to a different unit vector: assume that
the number of different toeksn in the vocabulary is :math:`N` (the
``len(vocab)``) and the token indices range from 0 to :math:`N-1`. If
the index of a token is the integer :math:`i`, then we create a vector
:math:`\mathbf{e}_i` of all 0s with a length of :math:`N` and set the
element at position :math:`i` to 1. This vector is the one-hot vector of
the original token. The one-hot vectors with indices 0 and 2 are shown
below.
.. code:: python
nd.one_hot(nd.array([0, 2]), len(vocab))
.. parsed-literal::
:class: output
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]]
The shape of the mini-batch we sample each time is (batch size, time
step). The ``one_hot`` function transforms such a mini-batch into a 3-D
tensor with the last dimension equals to the vocabulary size. We often
transpose the input so that we will obtain a (time step, batch size,
vocabulary size) output that fits into a sequence model easier.
.. code:: python
X = nd.arange(10).reshape((2, 5))
nd.one_hot(X.T, 28).shape
.. parsed-literal::
:class: output
(5, 2, 28)
Initializing the Model Parameters
---------------------------------
Next, we initialize the model parameters for a RNN model. The number of
hidden units ``num_hiddens`` is a tunable parameter.
.. code:: python
def get_params(vocab_size, num_hiddens, ctx):
num_inputs = num_outputs = vocab_size
normal = lambda shape: nd.random.normal(
scale=0.01, shape=shape, ctx=ctx)
# Hidden layer parameters
W_xh = normal((num_inputs, num_hiddens))
W_hh = normal((num_hiddens, num_hiddens))
b_h = nd.zeros(num_hiddens, ctx=ctx)
# Output layer parameters
W_hq = normal((num_hiddens, num_outputs))
b_q = nd.zeros(num_outputs, ctx=ctx)
# Attach a gradient
params = [W_xh, W_hh, b_h, W_hq, b_q]
for param in params: param.attach_grad()
return params
RNN Model
---------
First, we need an ``init_rnn_state`` function to return the hidden state
at initialization. It returns a tuple consisting of an NDArray with a
value of 0 and a shape of (batch size, number of hidden units). Using
tuples makes it easier to handle situations where the hidden state
contains multiple variables (e.g. when combining multiple layers in an
RNN where each layers requires initializing).
.. code:: python
def init_rnn_state(batch_size, num_hiddens, ctx):
return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )
The following ``rnn`` function defines how to compute the hidden state
and output in a time step. The activation function here uses the tanh
function. As described in :numref:`chapter_mlp`, the mean value of the
:math:`\tanh` function values is 0 when the elements are evenly
distributed over the real numbers.
.. code:: python
def rnn(inputs, state, params):
# inputs shape: (num_steps, batch_size, vocab_size)
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
H = nd.tanh(nd.dot(X, W_xh) + nd.dot(H, W_hh) + b_h)
Y = nd.dot(H, W_hq) + b_q
outputs.append(Y)
return nd.concat(*outputs, dim=0), (H,)
Now we have all functions defined, next we create a class to wrap these
functions and store parameters.
.. code:: python
# Save to the d2l package.
class RNNModelScratch(object):
"""A RNN Model based on scratch implementations"""
def __init__(self, vocab_size, num_hiddens, ctx,
get_params, init_state, forward):
self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
self.params = get_params(vocab_size, num_hiddens, ctx)
self.init_state, self.forward_fn = init_state, forward
def __call__(self, X, state):
X = nd.one_hot(X.T, self.vocab_size)
return self.forward_fn(X, state, self.params)
def begin_state(self, batch_size, ctx):
return self.init_state(batch_size, self.num_hiddens, ctx)
Let’s do a sanity check whether inputs and outputs have the correct
dimensions, e.g. to ensure that the dimensionality of the hidden state
hasn’t changed.
.. code:: python
vocab_size, num_hiddens, ctx = len(vocab), 512, d2l.try_gpu()
model = RNNModelScratch(len(vocab), num_hiddens, ctx, get_params,
init_rnn_state, rnn)
state = model.begin_state(X.shape[0], ctx)
Y, new_state = model(X.as_in_context(ctx), state)
Y.shape, len(new_state), new_state[0].shape
.. parsed-literal::
:class: output
((10, 28), 1, (2, 512))
We can see that the output shape is (number steps :math:`\times` batch
size, vocabulary size), while the state shape remains the same, i.e.
(batch size, number of hidden units).
Prediction
----------
We first explain the predicting function so we can regularly check the
prediction during training. This function predicts the next
``num_predicts`` characters based on the ``prefix`` (a string containing
several characters). For the beginning of the sequence, we only update
the hidden state. After that we begin generating new characters and
emitting them.
.. code:: python
# Save to the d2l package.
def predict_ch8(prefix, num_predicts, model, vocab, ctx):
state = model.begin_state(batch_size=1, ctx=ctx)
outputs = [vocab[prefix[0]]]
get_input = lambda: nd.array([outputs[-1]], ctx=ctx).reshape((1, 1))
for y in prefix[1:]: # Warmup state with prefix
_, state = model(get_input(), state)
outputs.append(vocab[y])
for _ in range(num_predicts): # Predict num_predicts steps
Y, state = model(get_input(), state)
outputs.append(int(Y.argmax(axis=1).reshape(1).asscalar()))
return ''.join([vocab.idx_to_token[i] for i in outputs])
We test the ``predict_rnn`` function first. Given that we didn’t train
the network it will generate nonsensical predictions. We initialize it
with the sequence ``traveller`` and have it generate 10 additional
characters.
.. code:: python
predict_ch8('time traveller ', 10, model, vocab, ctx)
.. parsed-literal::
:class: output
'time traveller emmmmmmmm'
Gradient Clipping
-----------------
For a sequence of length :math:`T`, we compute the gradients over these
:math:`T` time steps in an iteration, which results in a chain of
matrix-products with length :math:`O(T)` during backpropagating. As
mentioned in :numref:`chapter_numerical_stability`, it might result in
numerical instability, e.g. the gradients may either explode or vanish,
when :math:`T` is large. Therefore RNN models often need extra help to
stabilize the training.
Recall that when solving an optimization problem, we take update steps
for the weights :math:`\mathbf{w}` in the general direction of the
negative gradient :math:`\mathbf{g}_t` on a minibatch, say
:math:`\mathbf{w} - \eta \cdot \mathbf{g}_t`. Let’s further assume that
the objective is well behaved, i.e. it is Lipschitz continuous with
constant :math:`L`, i.e.
.. math:: |l(\mathbf{w}) - l(\mathbf{w}')| \leq L \|\mathbf{w} - \mathbf{w}'\|.
In this case we can safely assume that if we update the weight vector by
:math:`\eta \cdot \mathbf{g}_t` we will not observe a change by more
than :math:`L \eta \|\mathbf{g}_t\|`. This is both a curse and a
blessing. A curse since it limits the speed with which we can make
progress, a blessing since it limits the extent to which things can go
wrong if we move in the wrong direction.
Sometimes the gradients can be quite large and the optimization
algorithm may fail to converge. We could address this by reducing the
learning rate :math:`\eta` or by some other higher order trick. But what
if we only rarely get large gradients? In this case such an approach may
appear entirely unwarranted. One alternative is to clip the gradients by
projecting them back to a ball of a given radius, say :math:`\theta` via
.. math:: \mathbf{g} \leftarrow \min\left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}.
By doing so we know that the gradient norm never exceeds :math:`\theta`
and that the updated gradient is entirely aligned with the original
direction :math:`\mathbf{g}`. It also has the desirable side-effect of
limiting the influence any given minibatch (and within it any given
sample) can exert on the weight vectors. This bestows a certain degree
of robustness to the model. Gradient clipping provides a quick fix to
the gradient exploding. While it doesn’t entire solve the problem, it is
one of the many techniques to alleviate it.
Below we define a function to clip the gradients of a model that is
either a ``RNNModelScratch`` instance or a Gluon model. Also note that
we compute the gradient norm over all parameters.
.. code:: python
# Save to the d2l package.
def grad_clipping(model, theta):
if isinstance(model, gluon.Block):
params = [p.data() for p in model.collect_params().values()]
else:
params = model.params
norm = math.sqrt(sum((p.grad ** 2).sum().asscalar() for p in params))
if norm > theta:
for param in params:
param.grad[:] *= theta / norm
Training
--------
Similar to :numref:`chapter_linear_scratch`, let’s first define the
function to train the model on one data epoch. It differs to the models
training from previous chapters in three places:
1. Different sampling methods for sequential data (independent sampling
and sequential partitioning) will result in differences in the
initialization of hidden states.
2. We clip the gradient before updating the model parameters. This
ensures that the model doesn’t diverge even when gradients blow up at
some point during the training process (effectively it reduces the
stepsize automatically).
3. We use perplexity to evaluate the model. This ensures that different
tests are comparable.
When the consecutive sampling is used, we initialize the hidden state at
the beginning of each epoch. Since the :math:`i`-th example in the next
mini-batch is adjacent to the current :math:`i`-th example, so we next
mini-batch can use the current hidden state directly, we only detach the
gradient so that we only compute the gradients within a mini-batch. When
using the random sampling, we need to re-initialize the hidden state for
each iteration since each example is sampled with a random position.
Same to the ``train_epoch_ch3`` function
(:numref:`chapter_linear_scratch`), we use generalized ``updater``,
which could be a Gluon trainer or a scratched implementation.
.. code:: python
# Save to the d2l package.
def train_epoch_ch8(model, train_iter, loss, updater, ctx, use_random_iter):
state, timer = None, d2l.Timer()
metric = d2l.Accumulator(2) # loss_sum, num_examples
for X, Y in train_iter:
if state is None or use_random_iter:
# Initialize state when either it's the first iteration or
# using random sampling.
state = model.begin_state(batch_size=X.shape[0], ctx=ctx)
else:
for s in state: s.detach()
y = Y.T.reshape((-1,))
X, y = X.as_in_context(ctx), y.as_in_context(ctx)
with autograd.record():
py, state = model(X, state)
l = loss(py, y).mean()
l.backward()
grad_clipping(model, 1)
updater(batch_size=1) # Since used mean already.
metric.add(l.asscalar() * y.size, y.size)
return math.exp(metric[0]/metric[1]), metric[1]/timer.stop()
The training function again supports either we implement the model from
scratch or using Gluon.
.. code:: python
# Save to the d2l package.
def train_ch8(model, train_iter, vocab, lr, num_epochs, ctx,
use_random_iter=False):
# Initialize
loss = gluon.loss.SoftmaxCrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
legend=['train'], xlim=[1, num_epochs])
if isinstance(model, gluon.Block):
model.initialize(ctx=ctx, force_reinit=True, init=init.Normal(0.01))
trainer = gluon.Trainer(model.collect_params(), 'sgd', {'learning_rate': lr})
updater = lambda batch_size : trainer.step(batch_size)
else:
updater = lambda batch_size : d2l.sgd(model.params, lr, batch_size)
predict = lambda prefix: predict_ch8(prefix, 50, model, vocab, ctx)
# Train and check the progress.
for epoch in range(num_epochs):
ppl, speed = train_epoch_ch8(
model, train_iter, loss, updater, ctx, use_random_iter)
if epoch % 10 == 0:
print(predict('time traveller'))
animator.add(epoch+1, [ppl])
print('Perplexity %.1f, %d tokens/sec on %s' % (ppl, speed, ctx))
print(predict('time traveller'))
print(predict('traveller'))
Finally we can train a model. Since we only use 10,000 tokens in the
dataset, so here we need more data epochs to converge.
.. code:: python
num_epochs, lr = 500, 1
train_ch8(model, train_iter, vocab, lr, num_epochs, ctx)
.. parsed-literal::
:class: output
Perplexity 1.0, 45996 tokens/sec on gpu(0)
time traveller it s against reason said filby what reason said
traveller it s against reason said filby what reason said
.. figure:: output_rnn-scratch_2cce6c_27_1.svg
Then let’s check the results to use a random sampling iterator.
.. code:: python
train_ch8(model, train_iter, vocab, lr, num_epochs, ctx, use_random_iter=True)
.. parsed-literal::
:class: output
Perplexity 1.3, 45498 tokens/sec on gpu(0)
time traveller it s against reason said filby what reason said
traveller smiled round at us then still smiling faintly and
.. figure:: output_rnn-scratch_2cce6c_29_1.svg
In the following we will see how to improve significantly on the current
model and how to make it faster and easier to implement.
Summary
-------
- Sequence models need state initialization for training.
- Between sequential models you need to ensure to detach the gradient,
to ensure that the automatic differentiation does not propagate
effects beyond the current sample.
- A simple RNN language model consists of an encoder, an RNN model and
a decoder.
- Gradient clipping prevents gradient explosion (but it cannot fix
vanishing gradients).
- Perplexity calibrates model performance across variable sequence
length. It is the exponentiated average of the cross-entropy loss.
- Sequential partitioning typically leads to better models.
Exercises
---------
1. Show that one-hot encoding is equivalent to picking a different
embedding for each object.
2. Adjust the hyperparameters to improve the perplexity.
- How low can you go? Adjust embeddings, hidden units, learning
rate, etc.
- How well will it work on other books by H. G. Wells, e.g. `The War
of the Worlds `__.
3. Modify the predict function such as to use sampling rather than
picking the most likely next character.
- What happens?
- Bias the model towards more likely outputs, e.g. by sampling from
:math:`q(w_t|w_{t-1}, \ldots w_1) \propto p^\alpha(w_t|w_{t-1}, \ldots w_1)`
for :math:`\alpha > 1`.
4. Run the code in this section without clipping the gradient. What
happens?
5. Change adjacent sampling so that it does not separate hidden states
from the computational graph. Does the running time change? How about
the accuracy?
6. Replace the activation function used in this section with ReLU and
repeat the experiments in this section.
7. Prove that the perplexity is the inverse of the harmonic mean of the
conditional word probabilities.
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_rnn-scratch.svg