[MUSIC] In this video,
we'll discuss how to combat exploding and vanishing gradients in practice, let's
start with the exploding gradient problem. In the previous video, we've learned that this problem occurs
when the gradient norms become large, so possibly even NaNs, which makes training
of the recurrent neural network unstable. The resulting instability is
actually very easy to detect. On the slide, you can see the learning
curve of some neural network. The training iterations are on the x axis,
and the loss on the training
data is on the y axis. This particular neural network doesn't
suffer from the exploding gradient problem. Therefore, the loss decreases with
the number of iterations, and the training is stable. But sometimes you can see spikes in
the learning curve, like this one, this a result of the exploding
gradient problem. The gradient explodes, and you make
a long step in the parameter space. As a result, you may end up with
a model with quite random weights, and high training costs. In the worst case, the gradient
may even become not a number, and you may end up with not numbers in
the weights of the neural network. The most common way to combat this
problem is gradient clipping, this technique is very simple,
but still very effective. If the network suffers from
the exploding gradient problem, then the gradient of the loss respect
to all the parameters of the network. So let's simply clip this one,
it's our threshold, by doing this, we don't change the direction of
the gradient, we only change its length. Actually, we can clip not the norm
of the whole gradient vector, but just the norm of the part which causes the
problem, do you remember which part is it? Yeah, it is a Jacobian matrix of
hidden units at one timestamp, with respect to hidden units
in the previous timestamp. So it is enough to clip just the value
of the Jacobian at each timestamp. Okay, and how to chose the threshold for
gradient clipping? We can choose it manually, so
we start with a large threshold and decrease it while the network doesn't
suffer from exploding gradient problem. Also, we can look at
the norm of the gradient at a sufficient number
of training iterations. And choose the threshold in such a way
that we clip unusually large values. There is another interesting
technique which can help us to overcome the exploiting
gradient problem. It is not designed specifically for
this purpose, but it still can be helpful. As you remember,
when we do back propagation through time, we need to make a forward pass through
the entire sequence to compute the loss. And then we need to make a backward pass
through the entire sequence to compute the gradient. If our training sequences are quite long,
then this is a very computationally expensive procedure, and additionally, we
may have the exploding gradient problem. So let's run forward and backward passes through the chunks of the
sequence, instead of the whole sequence. In this case, we first make forward and backward passes through the first
chunk and store the last hidden state. Then we can go to the next chunk, and start forward pass from
the hidden state we stored. Then we make forward and
backward passes through the second chunk, store the last hidden state,
and go to the next chunk. And so on, while we don't
reach the end of the sequence. So we carry hidden states forward in
time forever, but only backpropagate for some smaller number of steps. This algorithm is called truncated
backpropagation through time, and it is much faster than the usual
backpropagation through time. It also doesn't suffer from the exploding
gradient problem that much, since we don't take into account the contributions
to the gradient from faraway steps. But of course, these advantages
do not come without a price. Dependencies that are longer than the
chunk size don't affect the training, so it's much more difficult to rank long
range dependencies with these algorithms. Now let's speak about
the vanishing gradient problem. From the previous video,
we know that these problem occurs when the contributions to the gradient
from faraway steps becomes small. Which makes the learning of long
range dependencies very difficult. This problem is more complicated
than the exploding gradient problem. It is difficult to detect
the vanishing gradient problem, and there is no one simple way to overcome it,
let's start with the detection. In the case of the exploding
gradient problem, we had a clear indication
that it occurred. We saw the spikes in the learning curve,
and in the curve with the gradient norm. But in the case of vanishing gradient
problem, the learning curve and the number of the gradient
look quite okay. I mean, from the learning curve, you may only see that the loss
of the network is not that good. And it's not clear whether this is due
to the vanishing gradient problem or because the task itself is difficult. And if you look, for example, at
the gradient of the most intensive term, with respect to faraway units, you may see
that the norm of this gradient is small. But it could because there are no
longer any dependencies in the data. So we can be sure that there is a
vanishing gradient problem in the network, only when we overcome it and
see that the network works better. There are a lot of different techniques
to deal with the vanishing gradients. The most common approach is to use
specifically designed recurrent architectures, such as long
short-term memory, or LSTM, and gated recurrent unit, or GRU. These architectures are very important,
so we'll speak about them in a separate video later this week, and now we will
briefly discuss some other ideas. As you already know, the Jacobian matrix,
which may cause the problem of vanishing and exploding gradients, depends on
the choice of the activation function and the values of your recurrent weights, W. So if we want to overcome
the vanishing gradient problem, we should try to use the rectified linear
units activation function, which is much more resistant to this problem, and
what about the recurrent weight matrix W? From linear algebra, you may remember that
an orthogonal matrix is a square matrix, such as its transpose is
equal to its inverse. Orthogonal matrices have many
interesting properties, but the most important for us, is that
all the agent values of an orthogonal matrix have absolute value of 1. This means that no matter how many times
we perform repeated matrix multiplication, the resulting matrix doesn't explode or
vanish. Therefore, if we initialize the recurrent
weight matrix W with an orthogonal matrix, the second part of the Jacobian doesn't
cause the vanishing gradient problem, at least on the first
iterations of the training. In this case, the network has a chance to
find long range dependencies in the data, at least first iterations. There are some approaches that utilize
the properties of orthogonal matrices, not just for a proper initialization,
but also for the parameterization of the weights for
the whole training process. But these methods are out
of scope of this course. The last idea that we will discuss in
this video is the idea of using skip connections. In a recurrent neural network,
we can't carry the contributions through the gradient through a lot of timestamps,
because at each step, we need to multiply them by the Jacobian
matrix, and as a result, they vanish. Let's add shortcuts between hidden
states that are separated by more than one timestamp. These shortcuts are usual connections,
with their own parameter matrices. By using them, we create much shorter ways
between faraway timestamps in the network. So when we backpropagate
the gradients along these short ways, they vanish slower, and we can learn
longer dependencies with such a network. The concept of skip connections is not
exclusive for recurrent neural networks. You already saw a similar idea in one of
the architectures in the computer vision part of this course, in which one? Yeah, I spoke about the residual network, which contains a date shortcut
connections in each where. We can also of course use the shortcuts
in recurrent neural networks, as well as known identity shortcuts
in big form architectures for the computer vision tasks. Okay, let's summarize what
we've learned in this video. Exploding gradients problem
is very easy to detect, whereas it's not clear how to detect
the vanishing gradient problem. Gradient clipping is a simple method
to combat the exploding gradients. And truncated backpropagation through
time also can help us with this, in addition to the acceleration
of the training. To overcome the vanishing gradient
problem, we can use several methods, including careful choice
of the activation function, proper initialization of
the recurrent weights, and modification of the network with
additional skip connections. In the next video, we will discuss more
advanced architectures, LSTM and GRU, that are the most popular
recurrent architectures nowadays. [MUSIC]