In this video, we will learn how LSTM and GRU architectures work and understand why they're built in the way they are. Up to this point, we have been talking mostly about the Simple Recurrent Neural Network. In such network, the dependence between the hidden units and the neighboring time steps is given by a very simple formula. We just compute the non-linear function of a linear combination on the inputs. But, as you can remember from the beginning of the week, we can actually use something more sophisticated to compute the next hidden units from the previous ones. For example, some MLP, which has more than one hidden layer, where layer is a primitive function. So it is what we'll do now. We'll construct a more effective primitive function to compute hidden units. You might also think about this as a construction of a new type of recurrent layer. Let's start from the function we use in the simple recurrent neural network and construct the function for the LCM step-by-step. On the slide, you can see the diagram of a simple recurrent layer. Here, non-linearity and summation are shown in circles, multiplications by the weight matrices are shown on the corresponding edges, and base vectors are dropped from that picture for simplicity. When we do a back propagation through such layer, gradients needs to go through non-linearity and for multiplication by their current weight matrix W. Both of these things can cause the vanishing gradient problem. The main idea here is to create a short way for the gradients result any non-linearities or multiplications. All sorts of LSTM propose to do it by adding a new separately way through recurrent layer. So the LCM layer has its own internal memory C, which other layers on the network don't have access to. In the LCM layer, at each time step, we compute not only the vector of hidden units H but also a vector of memory cell C on the same dimension and, as a result, we have two ways through such layer, one between hidden units Ht-1 and Ht and the second one between memory cells Ct-1 and Ct. Now, let's understand what is going on inside the LSTM layer. In the simple recurrent neural network, we compute these hidden units at some non-linear function on the linear combination of the inputs. Here, we do the same but do not return this value as an output of the layer. We want to use the internal memory too. To look at the memory, LSTM needs at least two controller: an input gate and an output gate. These gates are the vectors of the same dimension as the hidden units. To compute them, we use the similar formula: the non-linearity over the linear combination. We can even rewrite the formulas for the information vector G and the gates in one vector formula in which weight matrices, V and W, and the bisector V are the concatenations of the corresponding parameters. Suppose there are N inputs X and M hidden units H in the network, then what size do the matrices V and W and the vector b have? Yep, the matrix V is 3m by M. The matrix W is 3m by M and the vector b contains 3m elements. For the gates, we use only the sigmoid non-linearity. It is important because you want the elements on the gates to take values from zero to one. In this case, the value one can be interpreted as an open gate and the value zero as a closed gate. If we multiply some information by the gate vector, we either get the same information or zero or something in between. As you've probably already guessed, the input gate controls what to store in the memory. The information vector g is multiplied by the input gate and then added to the memory. Multiplication here is element wise. The output gate controls what to read from the memory and return to the outer world. Memory cell C are multiplied by the output gate and then returned as a new hidden units. Now, we see that the LCM has pretty nice interpretation with internal memory and the controller suite, but how does this help with the vanishing gradient problem? As I already mentioned, there's not only one way through the recurrent layer and the important thing here is that now we have at least one short way for the information and for the gradients, between memory cells Ct minus 1 and Ct. There is no any non-linearity or multiplication on this way so if we calculate the Jacobian of Ct with respect to Ct-1, we see that it is equal to one. So there is no vanishing problem anymore. Is it an architecture? Unfortunately, it's not. We can only write something new in the memory at each time step and we can't erase anything. What if we want to work with really long sequences? Memory cell C have faint capacities so the values will be a mess after a lot of time steps. We need to be able to erase the information from the memory sometimes. One more gate, which is called a forget gate will help us with this. We compute in the same manner as the previous two gates and use it on the input memory cells before doing something else with them. If the forget the gate is closed, we erase the information from the memory. This version of LSTM with three gates is the most standard nowadays. Forget gate is very important in task with long sequences, but because of which, we now have multiplication on the short way through the LSTM layer. If we compute the Jacobian of Ct with respect to Ct-1, it is equal to Ft now not one and the Forget gate Ft takes values from zero to one. So it is usually less than one and may cause vanishing gradient problem. To do this, the proper initialization can be used. If the base of the forget gate is initialized with high positive numbers, for example, five, then the forget gate at first iteration with training is almost equal to one. At the beginning, LSTM doesn't forget and can't find long range dependencies in the data. Later, it learns to forget if it's necessary. To have some intuition about how LSTM may behave in practice, let's look at different extreme regimes in which it can work. On the slide, you can see it from a different picture of the LSTM layer. Here, the internal memory C is pictured inside the layer as a yellow circle and gates are represented with green circles. The regime type depends on the state of the gates. They either opened or closed. If only input and forget gates are open then the LSTM reads all the information on the inputs and stores it. In the opposite situation, that only forget and output gates are open, the LSTM carries the information to time and release it to the next layer. If both input and output gates are closed, the LSTM either erases all the information or stores it but in most cases, it doesn't directed to the outer world so it doesn't read or write anything. Now, I have a question for you. Which combination on the gate values makes the LSTM very similar to the simple recurrent neural network? Yeah, it is when input and output gates are open and forget gate is closed. In this case, LSTM reads everything to the memory and return the whole memory as an output so memory and hidden units here are all of the same entities. Of course, all of these cases are just extreme cases. In practice, the model behavior is somewhere in between. Because of all the different regimes, LSTM can work with the information more accurately. For example, when the simple recurrent neural network read something from the data, it outputs this information at each time step and gradually forgets it each time. At the same situation, LSTM can carry the information much longer from time and outputs it only in the particular time steps. LSTM has a lot of advantages compared with the simple recurrent neural network but, at the same time, it has four times more parameters because each gate and the information left in g has its own set of parameters V, W, and b. This makes LSTM less efficient in terms of memory and time and also makes the GRU architecture more likely. The GRU architecture is the strongest front. It has only three times more parameters compared to the simple recurrent neural network and in terms of quality, it works pretty much the same as the LSTM. The GRU layer doesn't contain an additional internal memory, but it also contains two gates that are called the reset and update gates. These gates are computed in the same manner as the ones from LSTM so they're equal to the sigmoid function or the linear combination of the inputs. As a result, they can take queries from zero to one. On the slide, the diagram for computing the gates is pictured as separate step for simplicity of the pictures. There is other gate controls which part of the hidden units from the previous time step? We use as an input to the information vector g. It acts quite similar to an input gate in the LSTM. The update gate controls the balance between the storing the previous values of the hidden units. And the writing the new information into hidden units so it works as a combination of inputs and forget gates from the LSTM. The situation that the vanishing gradient problem in GRU is very similar to the one in LSTM. Here, we have a short way layer with only one multiplication by the update gate on it. This short way is actually an identity keep connection from HTMNS1 to HT which is additionally controlled by the update gate. Which initialization tree should be used to make GRUs table to the vanishing gradients? We should initialize the base vector on the update gate with some high positive numbers. Meaning, in the beginning of the training, the gradients go through these multiplication very easy and the network is capable to find long range dependencies in the data, but you should use not too high numbers here. Since if the update gate is open, then the GRU layer doesn't pay much attention to the inputs X. We discussed two architectures, LSTM and GRU. How to understand which one of them should we use in a particular task? There is no obvious answer here that one of them is better than the other, but there is a rule of thumb. First, train the LSTM since it has more parameters and can be a little bit more flexible, then train the GRU, and if it works the same or the quality difference is negligible then use the GRU. In that case, return to LSTM. Also, you can use a multi-layer recurrent neural network so you can stack several recurrent layers as shown in the picture. In. This case, for the last layer, it's better to use LSTM since GRU doesn't have a proper amount of the output gate and cannot work with the output as accurate as LSTM. And for all the other layers, you can use either LSTM or GRU. It doesn't matter that much. Let's summarize what we have learnt in this video. We discussed two gated recurrent architectures: LSTM and GRU. These models are more resistant to vanishing gradient problem because there is an additional short way for the gradients through them. In the next video, we will discuss how to use recurrent neural networks to solve different practical tasks.