1 00:00:00,000 --> 00:00:04,500 In the last video, you saw how the attention model allows 2 00:00:04,500 --> 00:00:06,780 a neural network to pay attention to 3 00:00:06,780 --> 00:00:11,310 only part of an input sentence while it's generating a translation, 4 00:00:11,310 --> 00:00:14,080 much like a human translator might. 5 00:00:14,080 --> 00:00:16,470 Let's now formalize that intuition into 6 00:00:16,470 --> 00:00:20,800 the exact details of how you would implement an attention model. 7 00:00:20,800 --> 00:00:23,700 So same as in the previous video, 8 00:00:23,700 --> 00:00:30,160 let's assume you have an input sentence and you use a bidirectional RNN, 9 00:00:30,160 --> 00:00:35,775 or bidirectional GRU, or bidirectional LSTM to compute features on every word. 10 00:00:35,775 --> 00:00:40,660 In practice, GRUs and LSTMs are often used for this, 11 00:00:40,660 --> 00:00:43,655 with maybe LSTMs be more common. 12 00:00:43,655 --> 00:00:46,770 And so for the forward occurrence, 13 00:00:46,770 --> 00:00:51,625 you have a forward occurrence first time step. 14 00:00:51,625 --> 00:00:55,140 Activation backward occurrence, first time step. 15 00:00:55,140 --> 00:00:58,620 Activation forward occurrence, second time step. 16 00:00:58,620 --> 00:01:02,070 Activation backward and so on. 17 00:01:02,070 --> 00:01:09,120 For all of them in just a forward fifth time step a backwards fifth time step. 18 00:01:09,120 --> 00:01:13,260 We had a zero here technically we can also 19 00:01:13,260 --> 00:01:19,580 have I guess a backwards sixth as a factor of all zero, 20 00:01:19,580 --> 00:01:22,190 actually that's a factor of all zeroes. 21 00:01:22,190 --> 00:01:29,034 And then to simplify the notation going forwards at every time step, 22 00:01:29,034 --> 00:01:32,030 even though you have the features computed from 23 00:01:32,030 --> 00:01:37,850 the forward occurrence and from the backward occurrence in the bidirectional RNN. 24 00:01:37,850 --> 00:01:46,640 I'm just going to use a of t to represent both of these concatenated together. 25 00:01:46,640 --> 00:01:50,345 So a of t is going to be a feature vector for 26 00:01:50,345 --> 00:01:55,300 time step t. Although to be consistent with notation, 27 00:01:55,300 --> 00:01:58,190 we're using second, I'm going to call this t_prime. 28 00:01:58,190 --> 00:02:04,035 Actually, I'm going to use t_prime to index into the words in the French sentence. 29 00:02:04,035 --> 00:02:08,930 Next, we have our forward only, 30 00:02:08,930 --> 00:02:16,255 so it's a single direction RNN with state s to generate the translation. 31 00:02:16,255 --> 00:02:18,100 And so the first time step, 32 00:02:18,100 --> 00:02:24,765 it should generate y1 and just will have as input some context 33 00:02:24,765 --> 00:02:30,022 C. And if you want to index it with time I guess you 34 00:02:30,022 --> 00:02:36,750 could write a C1 but sometimes I just right C without the superscript one. 35 00:02:36,750 --> 00:02:43,390 And this will depend on the attention parameters so alpha_11, 36 00:02:43,390 --> 00:02:50,960 alpha_12 and so on tells us how much attention. 37 00:02:50,960 --> 00:02:57,610 And so these alpha parameters tells us how much the context would depend 38 00:02:57,610 --> 00:03:01,275 on the features we're getting or the activations we're 39 00:03:01,275 --> 00:03:05,820 getting from the different time steps. 40 00:03:05,820 --> 00:03:10,760 And so the way we define the context is actually be a way to some of 41 00:03:10,760 --> 00:03:16,770 the features from the different time steps waited by these attention waits. 42 00:03:16,770 --> 00:03:25,120 So more formally the attention waits will satisfy this that they are all be non-negative, 43 00:03:25,120 --> 00:03:29,005 so it will be a zero positive and they'll sum to one. 44 00:03:29,005 --> 00:03:32,100 We'll see later how to make sure this is true. 45 00:03:32,100 --> 00:03:36,690 And we will have the context or the context at time one 46 00:03:36,690 --> 00:03:41,625 often drop that superscript that's going to be sum over t_prime, 47 00:03:41,625 --> 00:03:45,620 all the values of t_prime of this waited 48 00:03:45,620 --> 00:03:54,358 sum of these activations. 49 00:03:54,358 --> 00:04:03,530 So this term here are the attention waits and this term here comes from here. 50 00:04:03,530 --> 00:04:14,250 So alpha(t_prime) is the amount of attention that's 51 00:04:14,250 --> 00:04:26,480 yt should pay to a of t_prime. 52 00:04:26,480 --> 00:04:27,740 So in other words, 53 00:04:27,740 --> 00:04:30,630 when you're generating the t of the output words, 54 00:04:30,630 --> 00:04:35,540 how much you should be paying attention to the t_primeth input to word. 55 00:04:35,540 --> 00:04:41,875 So that's one step of generating the output and then at the next time step, 56 00:04:41,875 --> 00:04:47,070 you generate the second output and is again done some of 57 00:04:47,070 --> 00:04:52,725 where now you have a new set of attention waits on they to find a new way to sum. 58 00:04:52,725 --> 00:04:55,385 That generates a new context. 59 00:04:55,385 --> 00:05:00,175 This is also input and that allows you to generate the second word. 60 00:05:00,175 --> 00:05:05,930 Only now just this way to sum becomes the context of 61 00:05:05,930 --> 00:05:12,910 the second time step is sum over t_prime alpha(2, t_prime). 62 00:05:12,910 --> 00:05:16,500 So using these context vectors. 63 00:05:16,500 --> 00:05:18,163 C1 right there back, 64 00:05:18,163 --> 00:05:24,630 C2, and so on. 65 00:05:24,630 --> 00:05:30,590 This network up here looks like a pretty standard RNN sequence 66 00:05:30,590 --> 00:05:33,890 with the context vectors as output and we 67 00:05:33,890 --> 00:05:37,880 can just generate the translation one word at a time. 68 00:05:37,880 --> 00:05:42,590 We have also define how to compute the context vectors in terms of 69 00:05:42,590 --> 00:05:46,815 these attention ways and those features of the input sentence. 70 00:05:46,815 --> 00:05:49,220 So the only remaining thing to do is to 71 00:05:49,220 --> 00:05:53,322 define how to actually compute these attention waits. 72 00:05:53,322 --> 00:05:55,015 Let's do that on the next slide. 73 00:05:55,015 --> 00:05:57,454 So just to recap, alpha(t, 74 00:05:57,454 --> 00:06:01,065 t_prime) is the amount of attention you should paid to 75 00:06:01,065 --> 00:06:07,225 a(t_prime ) when you're trying to generate the t th words in the output translation. 76 00:06:07,225 --> 00:06:11,268 So let me just write down the formula and we talk of how this works. 77 00:06:11,268 --> 00:06:14,395 This is formula you could use the compute alpha(t, 78 00:06:14,395 --> 00:06:18,136 t_prime) which is going to compute these terms e(t, 79 00:06:18,136 --> 00:06:21,735 t_prime) and then use essentially a soft pass to make sure that 80 00:06:21,735 --> 00:06:25,770 these waits sum to one if you sum over t_prime. 81 00:06:25,770 --> 00:06:28,555 So for every fix value of t, 82 00:06:28,555 --> 00:06:34,765 these things sum to one if you're summing over t_prime. 83 00:06:34,765 --> 00:06:38,795 And using this soft max prioritization, 84 00:06:38,795 --> 00:06:41,405 just ensures this properly sums to one. 85 00:06:41,405 --> 00:06:44,760 Now how do we compute these factors e. Well, 86 00:06:44,760 --> 00:06:48,960 one way to do so is to use a small neural network as follows. 87 00:06:48,960 --> 00:06:56,315 So s t minus one was the neural network state from the previous time step. 88 00:06:56,315 --> 00:06:59,134 So here is the network we have. 89 00:06:59,134 --> 00:07:04,950 If you're trying to generate yt then st minus one was the hidden state from 90 00:07:04,950 --> 00:07:07,598 the previous step that just fell into st 91 00:07:07,598 --> 00:07:12,005 and that's one input to very small neural network. 92 00:07:12,005 --> 00:07:16,260 Usually, one hidden layer in neural network because you need to compute these a lot. 93 00:07:16,260 --> 00:07:23,015 And then a(t_prime) the features from time step t_prime is the other inputs. 94 00:07:23,015 --> 00:07:24,795 And the intuition is, 95 00:07:24,795 --> 00:07:31,465 if you want to decide how much attention to pay to the activation of t_prime. 96 00:07:31,465 --> 00:07:35,220 Well, the things that seems like it should depend the most on 97 00:07:35,220 --> 00:07:39,345 is what is your own hidden state activation from the previous time step. 98 00:07:39,345 --> 00:07:41,545 You don't have the current state activation yet 99 00:07:41,545 --> 00:07:43,680 because of context feeds into this so you haven't computed that. 100 00:07:43,680 --> 00:07:47,269 But look at whatever you're hidden stages of this RNN generating 101 00:07:47,269 --> 00:07:51,260 the upper translation and then for each of the positions, 102 00:07:51,260 --> 00:07:53,531 each of the words look at their features. 103 00:07:53,531 --> 00:07:57,126 So it seems pretty natural that alpha(t, 104 00:07:57,126 --> 00:08:03,015 t_prime) and e(t, t_prime) should depend on these two quantities. 105 00:08:03,015 --> 00:08:04,820 But we don't know what the function is. 106 00:08:04,820 --> 00:08:07,805 So one thing you could do is just train a very small neural network 107 00:08:07,805 --> 00:08:10,710 to learn whatever this function should be. 108 00:08:10,710 --> 00:08:18,080 And trust that obligation trust wait and descent to learn the right function. 109 00:08:18,080 --> 00:08:21,700 And it turns out that if you implemented 110 00:08:21,700 --> 00:08:25,660 this whole model and train it with gradient descent, 111 00:08:25,660 --> 00:08:27,490 the whole thing actually works. 112 00:08:27,490 --> 00:08:31,270 This little neural network does a pretty decent job telling 113 00:08:31,270 --> 00:08:35,545 you how much attention yt should pay to 114 00:08:35,545 --> 00:08:40,000 a(t_prime) and this formula makes sure that 115 00:08:40,000 --> 00:08:45,713 the attention waits sum to one and then as you chug along generating one word at a time, 116 00:08:45,713 --> 00:08:50,200 this neural network actually pays attention to the right parts of 117 00:08:50,200 --> 00:08:54,973 the input sentence that learns all this automatically using gradient descent. 118 00:08:54,973 --> 00:08:58,765 Now, one downside to this algorithm is that 119 00:08:58,765 --> 00:09:03,950 it does take quadratic time or quadratic cost to run this algorithm. 120 00:09:03,950 --> 00:09:09,550 If you have tx words in the input and ty words in 121 00:09:09,550 --> 00:09:12,655 the output then the total number of 122 00:09:12,655 --> 00:09:17,025 these attention parameters are going to be tx times ty. 123 00:09:17,025 --> 00:09:21,940 And so this algorithm runs in quadratic cost. 124 00:09:21,940 --> 00:09:26,450 Although in machine translation applications where 125 00:09:26,450 --> 00:09:29,290 neither input nor output sentences is 126 00:09:29,290 --> 00:09:33,430 usually that long maybe quadratic cost is actually acceptable. 127 00:09:33,430 --> 00:09:38,230 Although, there is some research work on trying to reduce costs as well. 128 00:09:38,230 --> 00:09:47,575 Now, so far up in describing the attention idea in the context of machine translation. 129 00:09:47,575 --> 00:09:53,160 Without going too much into detail this idea has been applied to other problems as well. 130 00:09:53,160 --> 00:09:55,100 So just image captioning. 131 00:09:55,100 --> 00:09:58,420 So in the image capturing problem the task is to 132 00:09:58,420 --> 00:10:02,155 look at the picture and write a caption for that picture. 133 00:10:02,155 --> 00:10:06,015 So in this paper set to the bottom by Kevin Chu, 134 00:10:06,015 --> 00:10:08,835 Jimmy Barr, Ryan Kiros, Kelvin Shaw, Aaron Korver, 135 00:10:08,835 --> 00:10:11,190 Russell Zarkutnov, Virta Zemo, 136 00:10:11,190 --> 00:10:16,620 and Andrew Benjo they also showed that you could have a very similar architecture. 137 00:10:16,620 --> 00:10:21,220 Look at the picture and pay attention only to parts 138 00:10:21,220 --> 00:10:26,500 of the picture at a time while you're writing a caption for a picture. 139 00:10:26,500 --> 00:10:30,745 So if you're interested, then I encourage you to take a look at that paper as well. 140 00:10:30,745 --> 00:10:36,310 And you get to play with all this and more in the programming exercise. 141 00:10:36,310 --> 00:10:42,100 Whereas machine translation is a very complicated problem in the prior exercise you 142 00:10:42,100 --> 00:10:44,910 get to implement and play of the attention while you 143 00:10:44,910 --> 00:10:48,002 yourself for the date normalization problem. 144 00:10:48,002 --> 00:10:50,625 So the problem inputting a date like this. 145 00:10:50,625 --> 00:10:55,390 This actually has a date of the Apollo Moon landing and normalizing it into 146 00:10:55,390 --> 00:11:01,165 standard formats or a date like this and having a neural network a sequence, 147 00:11:01,165 --> 00:11:04,250 sequence model normalize it to this format. 148 00:11:04,250 --> 00:11:07,710 This by the way is the birthday of William Shakespeare. 149 00:11:07,710 --> 00:11:09,433 Also it's believed to be. 150 00:11:09,433 --> 00:11:12,430 And what you see in prior exercises as you can train 151 00:11:12,430 --> 00:11:15,940 a neural network to input dates in any of 152 00:11:15,940 --> 00:11:18,975 these formats and have it use an attention model 153 00:11:18,975 --> 00:11:23,470 to generate a normalized format for these dates. 154 00:11:23,470 --> 00:11:26,500 One other thing that sometimes fun to do is 155 00:11:26,500 --> 00:11:29,775 to look at the visualizations of the attention waits. 156 00:11:29,775 --> 00:11:36,437 So here's a machine translation example and here were plotted in different colors. 157 00:11:36,437 --> 00:11:39,605 the magnitude of the different attention waits. 158 00:11:39,605 --> 00:11:42,865 I don't want to spend too much time on this but you find that 159 00:11:42,865 --> 00:11:46,930 the corresponding input and output words 160 00:11:46,930 --> 00:11:51,460 you find that the attention waits will tend to be high. 161 00:11:51,460 --> 00:11:56,005 Thus, suggesting that when it's generating a specific word in output is, 162 00:11:56,005 --> 00:12:01,885 usually paying attention to the correct words in the input and all this including 163 00:12:01,885 --> 00:12:04,660 learning where to pay attention when was all 164 00:12:04,660 --> 00:12:08,560 learned using propagation with an attention model. 165 00:12:08,560 --> 00:12:11,310 So that's it for the attention model 166 00:12:11,310 --> 00:12:15,255 really one of the most powerful ideas in deep learning. 167 00:12:15,255 --> 00:12:18,153 I hope you enjoy implementing and playing with 168 00:12:18,153 --> 00:12:22,000 these ideas yourself later in this week's programming exercises.