1 00:00:02,610 --> 00:00:05,315 In this video, we will learn how 2 00:00:05,315 --> 00:00:10,875 LSTM and GRU architectures work and understand why they're built in the way they are. 3 00:00:10,875 --> 00:00:15,505 Up to this point, we have been talking mostly about the Simple Recurrent Neural Network. 4 00:00:15,505 --> 00:00:17,970 In such network, the dependence between the hidden units and 5 00:00:17,970 --> 00:00:21,645 the neighboring time steps is given by a very simple formula. 6 00:00:21,645 --> 00:00:25,829 We just compute the non-linear function of a linear combination on the inputs. 7 00:00:25,829 --> 00:00:27,895 But, as you can remember from the beginning of the week, 8 00:00:27,895 --> 00:00:29,740 we can actually use something more 9 00:00:29,740 --> 00:00:33,190 sophisticated to compute the next hidden units from the previous ones. 10 00:00:33,190 --> 00:00:34,520 For example, some MLP, 11 00:00:34,520 --> 00:00:35,900 which has more than one hidden layer, 12 00:00:35,900 --> 00:00:38,530 where layer is a primitive function. 13 00:00:38,530 --> 00:00:40,475 So it is what we'll do now. 14 00:00:40,475 --> 00:00:44,745 We'll construct a more effective primitive function to compute hidden units. 15 00:00:44,745 --> 00:00:50,380 You might also think about this as a construction of a new type of recurrent layer. 16 00:00:50,380 --> 00:00:53,310 Let's start from the function we use in the simple recurrent neural network 17 00:00:53,310 --> 00:00:56,730 and construct the function for the LCM step-by-step. 18 00:00:56,730 --> 00:01:01,620 On the slide, you can see the diagram of a simple recurrent layer. 19 00:01:01,620 --> 00:01:04,610 Here, non-linearity and summation are shown in circles, 20 00:01:04,610 --> 00:01:08,930 multiplications by the weight matrices are shown on the corresponding edges, 21 00:01:08,930 --> 00:01:12,590 and base vectors are dropped from that picture for simplicity. 22 00:01:12,590 --> 00:01:14,355 When we do a back propagation through such layer, 23 00:01:14,355 --> 00:01:18,600 gradients needs to go through non-linearity and for multiplication by 24 00:01:18,600 --> 00:01:21,123 their current weight matrix W. Both of 25 00:01:21,123 --> 00:01:24,195 these things can cause the vanishing gradient problem. 26 00:01:24,195 --> 00:01:26,720 The main idea here is to create a short way for 27 00:01:26,720 --> 00:01:30,805 the gradients result any non-linearities or multiplications. 28 00:01:30,805 --> 00:01:34,350 All sorts of LSTM propose to do it 29 00:01:34,350 --> 00:01:37,700 by adding a new separately way through recurrent layer. 30 00:01:37,700 --> 00:01:40,715 So the LCM layer has its own internal memory C, 31 00:01:40,715 --> 00:01:43,740 which other layers on the network don't have access to. 32 00:01:43,740 --> 00:01:45,407 In the LCM layer, 33 00:01:45,407 --> 00:01:46,600 at each time step, 34 00:01:46,600 --> 00:01:49,650 we compute not only the vector of hidden units H but also a vector of memory cell C 35 00:01:49,650 --> 00:01:54,101 on the same dimension and, 36 00:01:54,101 --> 00:01:56,509 as a result, we have two ways through such layer, 37 00:01:56,509 --> 00:02:00,600 one between hidden units Ht-1 and Ht and the second 38 00:02:00,600 --> 00:02:04,955 one between memory cells Ct-1 and Ct. Now, 39 00:02:04,955 --> 00:02:07,692 let's understand what is going on inside the LSTM layer. 40 00:02:07,692 --> 00:02:11,228 In the simple recurrent neural network, 41 00:02:11,228 --> 00:02:13,635 we compute these hidden units at 42 00:02:13,635 --> 00:02:17,050 some non-linear function on the linear combination of the inputs. 43 00:02:17,050 --> 00:02:22,071 Here, we do the same but do not return this value as an output of the layer. 44 00:02:22,071 --> 00:02:24,470 We want to use the internal memory too. 45 00:02:24,470 --> 00:02:26,330 To look at the memory, 46 00:02:26,330 --> 00:02:30,665 LSTM needs at least two controller: an input gate and an output gate. 47 00:02:30,665 --> 00:02:34,283 These gates are the vectors of the same dimension as the hidden units. 48 00:02:34,283 --> 00:02:36,650 To compute them, we use the similar formula: the 49 00:02:36,650 --> 00:02:40,055 non-linearity over the linear combination. 50 00:02:40,055 --> 00:02:41,900 We can even rewrite the formulas for 51 00:02:41,900 --> 00:02:46,873 the information vector G and the gates in one vector formula in which weight matrices, 52 00:02:46,873 --> 00:02:52,500 V and W, and the bisector V are the concatenations of the corresponding parameters. 53 00:02:52,500 --> 00:02:58,255 Suppose there are N inputs X and M hidden units H in the network, 54 00:02:58,255 --> 00:03:03,195 then what size do the matrices V and W and the vector b have? 55 00:03:03,195 --> 00:03:08,185 Yep, the matrix V is 3m by M. The matrix W 56 00:03:08,185 --> 00:03:13,225 is 3m by M and the vector b contains 3m elements. 57 00:03:13,225 --> 00:03:16,480 For the gates, we use only the sigmoid non-linearity. 58 00:03:16,480 --> 00:03:18,490 It is important because you want the elements on 59 00:03:18,490 --> 00:03:21,100 the gates to take values from zero to one. 60 00:03:21,100 --> 00:03:23,290 In this case, the value one can be interpreted as 61 00:03:23,290 --> 00:03:26,585 an open gate and the value zero as a closed gate. 62 00:03:26,585 --> 00:03:29,226 If we multiply some information by the gate vector, 63 00:03:29,226 --> 00:03:36,810 we either get the same information or zero or something in between. 64 00:03:36,810 --> 00:03:38,455 As you've probably already guessed, 65 00:03:38,455 --> 00:03:41,305 the input gate controls what to store in the memory. 66 00:03:41,305 --> 00:03:46,295 The information vector g is multiplied by the input gate and then added to the memory. 67 00:03:46,295 --> 00:03:48,280 Multiplication here is element wise. 68 00:03:48,280 --> 00:03:54,060 The output gate controls what to read from the memory and return to the outer world. 69 00:03:54,060 --> 00:04:00,740 Memory cell C are multiplied by the output gate and then returned as a new hidden units. 70 00:04:00,740 --> 00:04:03,530 Now, we see that the LCM has pretty nice interpretation 71 00:04:03,530 --> 00:04:06,965 with internal memory and the controller suite, 72 00:04:06,965 --> 00:04:09,940 but how does this help with the vanishing gradient problem? 73 00:04:09,940 --> 00:04:11,905 As I already mentioned, 74 00:04:11,905 --> 00:04:16,530 there's not only one way through the recurrent layer and the important thing 75 00:04:16,530 --> 00:04:18,465 here is that now we have 76 00:04:18,465 --> 00:04:22,196 at least one short way for the information and for the gradients, 77 00:04:22,196 --> 00:04:25,775 between memory cells Ct minus 1 and Ct. 78 00:04:25,775 --> 00:04:29,555 There is no any non-linearity or multiplication on this way 79 00:04:29,555 --> 00:04:33,635 so if we calculate the Jacobian of Ct with respect to Ct-1, 80 00:04:33,635 --> 00:04:36,035 we see that it is equal to one. 81 00:04:36,035 --> 00:04:38,305 So there is no vanishing problem anymore. 82 00:04:38,305 --> 00:04:40,395 Is it an architecture? 83 00:04:40,395 --> 00:04:42,150 Unfortunately, it's not. 84 00:04:42,150 --> 00:04:44,180 We can only write something new in the memory at 85 00:04:44,180 --> 00:04:46,940 each time step and we can't erase anything. 86 00:04:46,940 --> 00:04:50,290 What if we want to work with really long sequences? 87 00:04:50,290 --> 00:04:53,260 Memory cell C have faint capacities so 88 00:04:53,260 --> 00:04:56,905 the values will be a mess after a lot of time steps. 89 00:04:56,905 --> 00:05:00,545 We need to be able to erase the information from the memory sometimes. 90 00:05:00,545 --> 00:05:04,565 One more gate, which is called a forget gate will help us with this. 91 00:05:04,565 --> 00:05:07,910 We compute in the same manner as the previous two gates and use 92 00:05:07,910 --> 00:05:11,635 it on the input memory cells before doing something else with them. 93 00:05:11,635 --> 00:05:13,880 If the forget the gate is closed, 94 00:05:13,880 --> 00:05:16,530 we erase the information from the memory. 95 00:05:16,530 --> 00:05:20,650 This version of LSTM with three gates is the most standard nowadays. 96 00:05:20,650 --> 00:05:26,000 Forget gate is very important in task with long sequences, 97 00:05:26,000 --> 00:05:29,714 but because of which, we now have multiplication on the short way through the LSTM layer. 98 00:05:29,714 --> 00:05:35,485 If we compute the Jacobian of Ct with respect to Ct-1, 99 00:05:35,485 --> 00:05:42,385 it is equal to Ft now not one and the Forget gate Ft takes values from zero to one. 100 00:05:42,385 --> 00:05:47,060 So it is usually less than one and may cause vanishing gradient problem. 101 00:05:47,060 --> 00:05:49,927 To do this, the proper initialization can be used. 102 00:05:49,927 --> 00:05:54,445 If the base of the forget gate is initialized with high positive numbers, for example, 103 00:05:54,445 --> 00:05:59,520 five, then the forget gate at first iteration with training is almost equal to one. 104 00:05:59,520 --> 00:06:05,335 At the beginning, LSTM doesn't forget and can't find long range dependencies in the data. 105 00:06:05,335 --> 00:06:08,995 Later, it learns to forget if it's necessary. 106 00:06:08,995 --> 00:06:13,300 To have some intuition about how LSTM may behave in practice, 107 00:06:13,300 --> 00:06:15,341 let's look at different extreme regimes in which it can work. 108 00:06:15,341 --> 00:06:20,020 On the slide, you can see it from a different picture of the LSTM layer. 109 00:06:20,020 --> 00:06:23,514 Here, the internal memory C is pictured inside the layer 110 00:06:23,514 --> 00:06:27,795 as a yellow circle and gates are represented with green circles. 111 00:06:27,795 --> 00:06:30,710 The regime type depends on the state of the gates. 112 00:06:30,710 --> 00:06:32,830 They either opened or closed. 113 00:06:32,830 --> 00:06:35,640 If only input and forget gates are open then 114 00:06:35,640 --> 00:06:39,250 the LSTM reads all the information on the inputs and stores it. 115 00:06:39,250 --> 00:06:41,738 In the opposite situation, that only forget and output gates are open, 116 00:06:41,738 --> 00:06:47,170 the LSTM carries the information to time and release it to the next layer. 117 00:06:47,170 --> 00:06:50,500 If both input and output gates are closed, 118 00:06:50,500 --> 00:06:55,615 the LSTM either erases all the information or stores it but in most cases, 119 00:06:55,615 --> 00:06:59,190 it doesn't directed to the outer world so it doesn't read or write anything. 120 00:06:59,190 --> 00:07:02,290 Now, I have a question for you. 121 00:07:02,290 --> 00:07:05,090 Which combination on the gate values makes the LSTM 122 00:07:05,090 --> 00:07:08,880 very similar to the simple recurrent neural network? 123 00:07:08,880 --> 00:07:14,665 Yeah, it is when input and output gates are open and forget gate is closed. 124 00:07:14,665 --> 00:07:18,990 In this case, LSTM reads everything to the memory and return the whole memory 125 00:07:18,990 --> 00:07:24,620 as an output so memory and hidden units here are all of the same entities. 126 00:07:24,620 --> 00:07:28,660 Of course, all of these cases are just extreme cases. 127 00:07:28,660 --> 00:07:31,865 In practice, the model behavior is somewhere in between. 128 00:07:31,865 --> 00:07:33,970 Because of all the different regimes, 129 00:07:33,970 --> 00:07:36,883 LSTM can work with the information more accurately. 130 00:07:36,883 --> 00:07:40,900 For example, when the simple recurrent neural network read something from the data, 131 00:07:40,900 --> 00:07:45,945 it outputs this information at each time step and gradually forgets it each time. 132 00:07:45,945 --> 00:07:47,615 At the same situation, 133 00:07:47,615 --> 00:07:49,860 LSTM can carry the information much longer from 134 00:07:49,860 --> 00:07:53,600 time and outputs it only in the particular time steps. 135 00:07:53,600 --> 00:07:59,790 LSTM has a lot of advantages compared with the simple recurrent neural network but, 136 00:07:59,790 --> 00:08:00,840 at the same time, 137 00:08:00,840 --> 00:08:04,372 it has four times more parameters because each gate and the information left in 138 00:08:04,372 --> 00:08:08,890 g has its own set of parameters V, W, and b. 139 00:08:08,890 --> 00:08:11,640 This makes LSTM less efficient in terms of memory 140 00:08:11,640 --> 00:08:14,610 and time and also makes the GRU architecture more likely. 141 00:08:14,610 --> 00:08:18,225 The GRU architecture is the strongest front. 142 00:08:18,225 --> 00:08:20,250 It has only three times more parameters compared to 143 00:08:20,250 --> 00:08:23,115 the simple recurrent neural network and in terms of quality, 144 00:08:23,115 --> 00:08:24,480 it works pretty much the same as the LSTM. 145 00:08:24,480 --> 00:08:29,539 The GRU layer doesn't contain an additional internal memory, 146 00:08:29,539 --> 00:08:34,180 but it also contains two gates that are called the reset and update gates. 147 00:08:34,180 --> 00:08:37,420 These gates are computed in the same manner as the ones from 148 00:08:37,420 --> 00:08:41,610 LSTM so they're equal to the sigmoid function or the linear combination of the inputs. 149 00:08:41,610 --> 00:08:45,930 As a result, they can take queries from zero to one. 150 00:08:45,930 --> 00:08:48,360 On the slide, the diagram for computing the gates is 151 00:08:48,360 --> 00:08:51,505 pictured as separate step for simplicity of the pictures. 152 00:08:51,505 --> 00:08:56,275 There is other gate controls which part of the hidden units from the previous time step? 153 00:08:56,275 --> 00:08:58,589 We use as an input to the information vector g. 154 00:08:58,589 --> 00:09:03,421 It acts quite similar to an input gate in the LSTM. 155 00:09:03,421 --> 00:09:05,750 The update gate controls the balance 156 00:09:05,750 --> 00:09:08,695 between the storing the previous values of the hidden units. 157 00:09:08,695 --> 00:09:12,190 And the writing the new information into hidden units so it 158 00:09:12,190 --> 00:09:16,430 works as a combination of inputs and forget gates from the LSTM. 159 00:09:16,430 --> 00:09:19,052 The situation that the vanishing gradient problem in GRU 160 00:09:19,052 --> 00:09:21,794 is very similar to the one in LSTM. 161 00:09:21,794 --> 00:09:26,221 Here, we have a short way layer with only one multiplication by the update gate on it. 162 00:09:26,221 --> 00:09:30,060 This short way is actually an identity keep connection 163 00:09:30,060 --> 00:09:33,943 from HTMNS1 to HT which is additionally controlled by the update gate. 164 00:09:33,943 --> 00:09:41,190 Which initialization tree should be used to make GRUs table to the vanishing gradients? 165 00:09:41,190 --> 00:09:46,555 We should initialize the base vector on the update gate with some high positive numbers. 166 00:09:46,555 --> 00:09:49,230 Meaning, in the beginning of the training, the gradients 167 00:09:49,230 --> 00:09:50,660 go through these multiplication very 168 00:09:50,660 --> 00:09:55,610 easy and the network is capable to find long range dependencies in the data, 169 00:09:55,610 --> 00:09:58,060 but you should use not too high numbers here. 170 00:09:58,060 --> 00:10:00,225 Since if the update gate is open, 171 00:10:00,225 --> 00:10:04,470 then the GRU layer doesn't pay much attention to the inputs X. 172 00:10:04,470 --> 00:10:06,620 We discussed two architectures, LSTM and GRU. 173 00:10:06,620 --> 00:10:11,520 How to understand which one of them should we use in a particular task? 174 00:10:11,520 --> 00:10:15,836 There is no obvious answer here that one of them is better than the other, 175 00:10:15,836 --> 00:10:17,355 but there is a rule of thumb. 176 00:10:17,355 --> 00:10:22,645 First, train the LSTM since it has more parameters and can be a little bit more flexible, 177 00:10:22,645 --> 00:10:24,400 then train the GRU, 178 00:10:24,400 --> 00:10:29,325 and if it works the same or the quality difference is negligible then use the GRU. 179 00:10:29,325 --> 00:10:32,875 In that case, return to LSTM. 180 00:10:32,875 --> 00:10:36,210 Also, you can use a multi-layer recurrent neural network so you can 181 00:10:36,210 --> 00:10:39,600 stack several recurrent layers as shown in the picture. In. 182 00:10:39,600 --> 00:10:41,400 This case, for the last layer, 183 00:10:41,400 --> 00:10:42,740 it's better to use 184 00:10:42,740 --> 00:10:44,370 LSTM since GRU doesn't 185 00:10:44,370 --> 00:10:44,920 have a proper amount of the output gate and cannot work with the output 186 00:10:44,920 --> 00:10:46,172 as accurate as LSTM. 187 00:10:46,172 --> 00:10:51,347 And for all the other layers, 188 00:10:51,347 --> 00:10:53,295 you can use either LSTM or GRU. 189 00:10:53,295 --> 00:10:56,760 It doesn't matter that much. 190 00:10:56,760 --> 00:10:59,430 Let's summarize what we have learnt in this video. 191 00:10:59,430 --> 00:11:00,915 We discussed two gated recurrent architectures: LSTM and GRU. 192 00:11:00,915 --> 00:11:07,560 These models are more resistant to vanishing gradient problem 193 00:11:07,560 --> 00:11:11,955 because there is an additional short way for the gradients through them. 194 00:11:11,955 --> 00:11:14,850 In the next video, we will discuss how to use 195 00:11:14,850 --> 00:11:18,580 recurrent neural networks to solve different practical tasks.