In this video, I'm going to talk about the exploding and vanishing gradients problem, which is what makes it difficult to train recurrent neural networks. For many years, researchers in URL networks thought they would never be able to train these networks to model dependencies over long time periods. But at the end of this video, I can describe four different ways in which that can now be done. To understand why it's so difficult to train recurrent neural networks, we have to understand a very important difference between the forward and backward passes in a recurrent neural net. In the forward pass, we used squashing functions, like the logistic, to prevent the activity vectors from exploding. So, if you look at the picture on the right, each neuron is using a logistic unit shown by that blue curve and it can't output any value greater than one or less than zero, So that stops explosions. The backward pass, however, is completely linear. Most people find this very surprising. If you double the error derivatives is it the final layers of this net, all the error derivatives will double when you back propagate. So, if you look at the red dots that I put on the blue curves, We'll suppose those are the activity levels of the neurons on the forward pass. And so, when you back propagate, you're using the gradients of the blue curves at those red dots. So the red lines are meant to throw the tangents to the blue curves at the red dots. And, once you finish the forward pass, the slope of that tangent is fixed. You now back propagate and the back propagation is like going forwards though a linear system in which the slope of the non-linearity has been fixed. Of course, each time you back propagate, the slopes will be different because they were determined by the forward pass. But during the back propagation, it's a linear system and so it suffers from a problem of linear systems, which is when you iterate, they tend to either explode or die. So when we backpropagate through many layers if the weights are small the gradients will shrink and become exponentially small. And that means that when you backpropagate through time gradients that are many steps earlier than the targets arrive will be tiny. Similarly, if the weights are big, the gradients will explode. And that means when you back propagate through time, the gradients will get huge and wipe out all your knowledge. In a feed-forward neural net, unless it's very deep, these problems aren't nearly as bad because we typically only have a few hidden layers. But as soon as we have a recurrent neural network trained on a long sequence, for example 100 time steps, then if the gradients are growing as we back propagate, we'll get whatever that growth rate is to the power of 100 and if they're dying, we'll get whatever that decay is to the power of 100 and, so, they'll either explode or vanish. We might be able to initialize the weights very carefully to avoid this and more recent work, shows that indeed careful initialization of the weights does make things work much better. But even with good initial weights, it's hard to detect the dependency of the current output or an input from many time-steps ago. So it's hard to make the output depend on things that happened a long time ago. Rnn's have difficulty dealing with long-range dependencies. Here's an example of exploding and dying gradients for a system that's trying to learn attractors. So suppose we try and train a recurrent neural network, So that whatever state we started in, it ends up in one of these two attractor states. So we're going a learn blue basin of attraction and a pink basin of attraction. And if we start anywhere within the blue basin of attraction, we will end up at the same point. What that means is that, small differences in our initial state make no difference to where we end up. So the derivative of the final state with respect to changes in the initial state, is zero. That's vanishing gradients. When we back propagate through the dynamics of this system we will discover there's no gradience from where you start, and the same with the pink basin of attraction. If however, we start very close to the boundary between the two attractors. Then, a tiny difference in where we start, that's the other side of the watershed, makes the huge difference to where we end up, that's the explosion gradient problem. And so whenever your trying to use a recurrent neural network to learn attractors like this, you're bound to get vanishing or exploding gradients. It turns out, there's at least four effective ways to learn a recurrent neural network. The first is a method called long short term memory and I'll talk about that more in this lecture. The idea is we actually change the architecture of the neural network to make it good at remembering things. The second method is to use a much better optimizer that can deal with very small gradients. I'll talk about that in the next lecture. The real problem in optimization is to detect small gradients that have an even smaller curvature. Heissan-free Optimization, tailored to your own apps is good at doing that. The third method really kind of evades the problem. What we do is we carefully initialize the input to hidden weights and we very carefully initialize the hidden to hidden weights, and also feedback weights from the outputs to the hidden units. And the idea of this careful initialization is to make sure that the hidden state has a huge reservoir of weakly coupled oscillators. So if you hit it with an input sequence, it will reverberate for a long time and those reverberations are remembering what happened in the input sequence You then try and couple those reverberations to the output you want and so the only thing that learns in an Echo State Network is the connections between the hidden units and the outputs. And if the output units are linear, that's very easy to train. So this hasn't really learned the recurrent. It's used a fixed random recurrent bit, but a carefully chosen one and then just learned the hidden tripod connections. And the final method is to use momentum, but to use momentum with the kind of initialization that was being used for Echo State Networks and that makes them work even better. So it was very clever to find out how to initialize these recurrent networks so they'll have interesting dynamics, but they work even better if you now modify that dynamic slightly in that direction that will help with the task at hand.