1 00:00:02,390 --> 00:00:05,520 In this video, we will speak about simple Recurrent Neural Network, 2 00:00:05,520 --> 00:00:08,920 how to use Backpropagation to train. 3 00:00:08,920 --> 00:00:12,510 In the previous video, we spoke about a recurrent architecture. 4 00:00:12,510 --> 00:00:16,415 In this architecture, we work with sequences, element by element. 5 00:00:16,415 --> 00:00:22,323 Here, Xs are the elements of the input sequence and at each time step, 6 00:00:22,323 --> 00:00:26,120 we have some output or a prediction right ahead. 7 00:00:26,120 --> 00:00:28,895 We can use this architecture for a lot of different tasks. 8 00:00:28,895 --> 00:00:33,100 For example, if we want to use it for language modeling, 9 00:00:33,100 --> 00:00:37,420 then Xs should be word embeddings and the output 10 00:00:37,420 --> 00:00:42,325 of our model at each time step is a probability distribution over a vocabulary, 11 00:00:42,325 --> 00:00:47,110 which says which word is more probable to be the next one in our sequence. 12 00:00:47,110 --> 00:00:51,220 Of course, we can generate this word from this probability distribution and give 13 00:00:51,220 --> 00:00:55,985 it to our model as an input in the next time step so we can generate text. 14 00:00:55,985 --> 00:00:59,655 We can use a recurrent architecture to solve a post tagging task. 15 00:00:59,655 --> 00:01:02,915 In this task, we want to tag each word to the text, 16 00:01:02,915 --> 00:01:05,250 which is a part of speech label. 17 00:01:05,250 --> 00:01:08,057 Here we have the same inputs, word embeddings, 18 00:01:08,057 --> 00:01:09,927 but the output is different now. 19 00:01:09,927 --> 00:01:11,640 Now, at each time step, 20 00:01:11,640 --> 00:01:14,970 the output is the probability distribution over the set of 21 00:01:14,970 --> 00:01:21,080 all possible post tags which says which post tag is more probable to the current word. 22 00:01:21,080 --> 00:01:27,990 Now we see that we can use this architecture for different tasks and here, 23 00:01:27,990 --> 00:01:29,550 in the middle of this architecture, 24 00:01:29,550 --> 00:01:31,115 we see this box MLP, 25 00:01:31,115 --> 00:01:34,420 which can be quite different inside. 26 00:01:34,420 --> 00:01:36,198 Let's use the symbols MLP, which exist. 27 00:01:36,198 --> 00:01:39,036 The MLP is one layer of hidden neurons. 28 00:01:39,036 --> 00:01:45,144 Then, we have a vector of hidden neurons in the front layer of our network 29 00:01:45,144 --> 00:01:47,196 and we can use these hidden neurons both 30 00:01:47,196 --> 00:01:50,790 to calculate the output of our networks or the predictions we had, 31 00:01:50,790 --> 00:01:54,695 and to calculate the hidden neurons from the next time step. 32 00:01:54,695 --> 00:01:57,820 On the slide, you can see formulas for forward pass in our network, 33 00:01:57,820 --> 00:02:02,555 an action that follows are pretty similar to the ones you already saw in the course. 34 00:02:02,555 --> 00:02:04,835 Here, again, you have some known neural functions 35 00:02:04,835 --> 00:02:07,865 F often in your combinations of the inputs. 36 00:02:07,865 --> 00:02:11,570 But here you need to remember that all the parameters are shared between 37 00:02:11,570 --> 00:02:17,285 time steps so we use same word matrices and the same bisectors at each time stamp. 38 00:02:17,285 --> 00:02:23,205 This type of visualization of recurrent neural network usually called a full diversion. 39 00:02:23,205 --> 00:02:27,140 And there is another diversion of the rotation for recurrent neural network, 40 00:02:27,140 --> 00:02:29,150 the more complex one. 41 00:02:29,150 --> 00:02:31,608 Actually, this complex visualization is more general 42 00:02:31,608 --> 00:02:35,375 because it doesn't depend on the length of the input sequence. 43 00:02:35,375 --> 00:02:38,804 While the full diversion is actually depends on it, 44 00:02:38,804 --> 00:02:41,280 but in the computerization, 45 00:02:41,280 --> 00:02:43,265 we have these recurrent connection, 46 00:02:43,265 --> 00:02:46,187 which forms a loop in our scheme and it's actually 47 00:02:46,187 --> 00:02:49,175 not clear how to train a network with a loop. 48 00:02:49,175 --> 00:02:56,200 That's why we will use a full visualization when we will speak about training. 49 00:02:56,200 --> 00:02:59,150 Now, let's speak about training of Recurrent Neural Network. 50 00:02:59,150 --> 00:03:02,397 Here we see a Recurrent Neural Network in the unfolded form. 51 00:03:02,397 --> 00:03:04,497 To train it training we actually need some data. 52 00:03:04,497 --> 00:03:10,965 Let's say that we have a training dataset so for some training sequence we know 53 00:03:10,965 --> 00:03:15,030 all the true label's y and we make 54 00:03:15,030 --> 00:03:19,590 some predictions y-hat and to understand how work are we, 55 00:03:19,590 --> 00:03:25,287 use some loss function L but we use this loss function separately at each time step, 56 00:03:25,287 --> 00:03:30,810 and then we sum up all of the sources to take their total loss, to have a total loss. 57 00:03:30,810 --> 00:03:35,475 Now, we have our usual neural network, 58 00:03:35,475 --> 00:03:38,825 we have data, we have a loss, what should we do? 59 00:03:38,825 --> 00:03:40,845 We need to backpropagate. 60 00:03:40,845 --> 00:03:43,378 As usual, in backpropagation, of the system, 61 00:03:43,378 --> 00:03:46,535 we need to make a forward pass and backward pass. 62 00:03:46,535 --> 00:03:49,400 In forward pass, we need to calculate all of the elements in 63 00:03:49,400 --> 00:03:52,315 our own neural network so we calculate the hidden elements, 64 00:03:52,315 --> 00:03:55,345 then we make our predictions, and we calculate the loss. 65 00:03:55,345 --> 00:03:56,922 In the backward pass, 66 00:03:56,922 --> 00:04:00,190 we need to calculate the gradient of our loss function with respect to 67 00:04:00,190 --> 00:04:04,150 all the parameters so respect to weight matrices and bisvectors. 68 00:04:04,150 --> 00:04:07,096 In usual fit for our neural network, 69 00:04:07,096 --> 00:04:10,992 we make a backpropagation on different layers so in the vertical direction, 70 00:04:10,992 --> 00:04:12,820 but here we have sequences. 71 00:04:12,820 --> 00:04:15,560 This why we need to backpropagate not only through layers but 72 00:04:15,560 --> 00:04:18,985 also through time in result direction. 73 00:04:18,985 --> 00:04:23,000 That's why the algorithm of training of recurrent neural networks, 74 00:04:23,000 --> 00:04:25,680 usually called backpropagation through time. 75 00:04:25,680 --> 00:04:30,000 Additionally, we need to remember that all the parameters are shared between time steps. 76 00:04:30,000 --> 00:04:31,765 Therefore, if you want to calculate the gradient 77 00:04:31,765 --> 00:04:33,970 of our loss with respect to some parameter, 78 00:04:33,970 --> 00:04:37,060 we actually need to sum up the gradient from all the time steps. 79 00:04:37,060 --> 00:04:39,260 Let's, for example, calculate the gradient of 80 00:04:39,260 --> 00:04:42,285 loss function with respect to weight matrix U. 81 00:04:42,285 --> 00:04:47,440 As I already said, we need to sum up the gradients from all these time steps in 82 00:04:47,440 --> 00:04:53,150 our sequence and to calculate the gradient of the loss at the specific time step t, 83 00:04:53,150 --> 00:04:55,510 we can simply use a chain rule. 84 00:04:55,510 --> 00:04:59,935 As a result, we have this product and the first element in this product, 85 00:04:59,935 --> 00:05:02,890 we calculate if we know which loss function we used, 86 00:05:02,890 --> 00:05:04,740 and the second element, 87 00:05:04,740 --> 00:05:06,475 we can actually also calculate very simply 88 00:05:06,475 --> 00:05:11,080 because our prediction depends on the matrix U only in one point. 89 00:05:11,080 --> 00:05:13,880 Now let's consider a more difficult example. 90 00:05:13,880 --> 00:05:16,935 Let's calculate the gradient of our loss function with respect to 91 00:05:16,935 --> 00:05:18,510 the recurrent weight matrix W. 92 00:05:18,510 --> 00:05:21,700 In first step, we will do the same thing order. 93 00:05:21,700 --> 00:05:24,880 We simply sum up the gradients from all the time steps in 94 00:05:24,880 --> 00:05:29,830 our sequence and actually we want to do the same thing thing order in the next step, 95 00:05:29,830 --> 00:05:33,235 but if we look at the formula for the hidden units, 96 00:05:33,235 --> 00:05:35,530 we see that there is not only one dependence 97 00:05:35,530 --> 00:05:38,780 between our weight matrix W and hidden units at 98 00:05:38,780 --> 00:05:45,000 time step t because hidden units in the previous time step actually also depend on W. 99 00:05:45,000 --> 00:05:51,048 So, here, we need to roll deeper and use not a simple chain rule, 100 00:05:51,048 --> 00:05:53,584 but use a formula of total derivative. 101 00:05:53,584 --> 00:05:54,805 After finding this formula, 102 00:05:54,805 --> 00:05:58,123 we end up with the following equation for gradient for 103 00:05:58,123 --> 00:06:00,000 our loss function with respect to W. 104 00:06:00,000 --> 00:06:01,995 At the last part of this equation, 105 00:06:01,995 --> 00:06:04,630 you can see that the first we take into 106 00:06:04,630 --> 00:06:08,155 account dependence between hidden units at time steps t and the 107 00:06:08,155 --> 00:06:09,290 weight matrix W. 108 00:06:09,290 --> 00:06:12,494 When we take one step back and take into account the dependence 109 00:06:12,494 --> 00:06:15,455 between previous hidden units and weight matrix W, 110 00:06:15,455 --> 00:06:19,180 and then we take one step back and so on and so on while we 111 00:06:19,180 --> 00:06:22,560 don't reach the beginning power sequence. 112 00:06:22,560 --> 00:06:24,321 As a result, here, 113 00:06:24,321 --> 00:06:26,500 the last part of this equation is the sum of 114 00:06:26,500 --> 00:06:30,340 the contributions from the old previous time steps toward 115 00:06:30,340 --> 00:06:32,845 the gradient at time step t. 116 00:06:32,845 --> 00:06:37,172 And to calculate the contribution from time step k to the gradient at time step t, 117 00:06:37,172 --> 00:06:40,703 we actually need to go from the hidden units at time step t 118 00:06:40,703 --> 00:06:45,730 to hidden units at time step k. In each element on the sum, 119 00:06:45,730 --> 00:06:52,092 we have this product of the combined matrices of the gradients of hidden units 120 00:06:52,092 --> 00:06:55,630 at one time step with respect to the hidden units in the previous time step. 121 00:06:55,630 --> 00:06:58,930 Now, we calculated the gradient of our loss function 122 00:06:58,930 --> 00:07:02,060 with respect to two weight matrices, U and W. 123 00:07:02,060 --> 00:07:06,160 When we calculated the gradient of our loss function with respect to U, 124 00:07:06,160 --> 00:07:08,560 we needed only to go backwards in layers, 125 00:07:08,560 --> 00:07:10,675 so in vertical direction. 126 00:07:10,675 --> 00:07:14,185 When we calculated the gradient with respect to W, we need to go 127 00:07:14,185 --> 00:07:17,291 backwards both in vertical direction and in horizontal direction, 128 00:07:17,291 --> 00:07:18,790 so backwards in layers and time. 129 00:07:18,790 --> 00:07:21,720 Now we have the last weight matrix V, 130 00:07:21,720 --> 00:07:24,470 and it's a question for you. 131 00:07:24,470 --> 00:07:27,856 Do we really to go backwards in time to calculate the gradient of 132 00:07:27,856 --> 00:07:30,857 our loss function with respect to weight matrix V? 133 00:07:30,857 --> 00:07:33,080 Yes, here we have the same situation as with 134 00:07:33,080 --> 00:07:36,663 the recurrent weight matrix W because the dependence 135 00:07:36,663 --> 00:07:41,571 between hidden units and weight matrix V is actually not only in one place. 136 00:07:41,571 --> 00:07:44,650 Hidden units from all the previous time steps also depends 137 00:07:44,650 --> 00:07:48,750 from V so we need to go backwards in time to calculate this gradient. 138 00:07:48,750 --> 00:07:52,100 Let's summarize what you've learned in this video. 139 00:07:52,100 --> 00:07:54,950 We now know what is the simple Recurrent Neural Network is 140 00:07:54,950 --> 00:07:58,167 and how to train it using Backpropagation through time. 141 00:07:58,167 --> 00:08:02,950 This backpropagation through time algorithm is actually a simple backpropagation, 142 00:08:02,950 --> 00:08:04,600 but with a fancy name. 143 00:08:04,600 --> 00:08:06,820 In the next video, we will discuss 144 00:08:06,820 --> 00:08:11,270 the difficulties that arise during the training of Recurrent Neural Networks.