1 00:00:02,870 --> 00:00:08,185 Hey. Attention mechanism is a super powerful technique in neural networks. 2 00:00:08,185 --> 00:00:13,055 So let us cover it first with some pictures and then with some formulas. 3 00:00:13,055 --> 00:00:21,840 Just to recap, we have some encoder that has h states and decoder that has some s states. 4 00:00:21,840 --> 00:00:26,370 Now, let us imagine that we want to produce the next decoder state. 5 00:00:26,370 --> 00:00:30,695 So we want to compute sj. How can we do this? 6 00:00:30,695 --> 00:00:32,260 In the previous video, 7 00:00:32,260 --> 00:00:34,685 we just used the v vector, 8 00:00:34,685 --> 00:00:39,775 which was the information about the whole encoded input sentence. 9 00:00:39,775 --> 00:00:41,620 And instead of that, 10 00:00:41,620 --> 00:00:43,075 we could do something better. 11 00:00:43,075 --> 00:00:48,000 We can look into all states of the encoder with some weights. 12 00:00:48,000 --> 00:00:51,275 So this alphas denote some weights 13 00:00:51,275 --> 00:00:55,450 that will say us whether it is important to look there or here. 14 00:00:55,450 --> 00:00:58,300 How can we compute this alphas? 15 00:00:58,300 --> 00:01:01,010 Well, we want them to be probabilities, and also, 16 00:01:01,010 --> 00:01:04,420 we want them to capture some similarity between 17 00:01:04,420 --> 00:01:09,845 our current moment in the decoder and different moments in the encoder. 18 00:01:09,845 --> 00:01:12,850 This way, we'll look into more similar places, 19 00:01:12,850 --> 00:01:19,880 and they will give us the most important information to go next with our decoding. 20 00:01:19,880 --> 00:01:24,305 If we speak about the same thing with the formulas, 21 00:01:24,305 --> 00:01:25,950 we will say that, now, 22 00:01:25,950 --> 00:01:29,605 instead of just one v vector that we had before, 23 00:01:29,605 --> 00:01:35,005 we will have vj, which is different for different positions of the decoder. 24 00:01:35,005 --> 00:01:41,340 And this vj vector will be computed as an average of encoder states. 25 00:01:41,340 --> 00:01:47,565 And the weights will be computed as soft marks because they need to be probabilities. 26 00:01:47,565 --> 00:01:54,170 And this soft marks will be applied to similarities of encoder and decoder states. 27 00:01:54,170 --> 00:01:59,680 Now, do you have any ideas how to compute those similarities? 28 00:01:59,680 --> 00:02:01,740 I have a few. 29 00:02:01,740 --> 00:02:06,415 So papers actually have tried lots and lots of different options, 30 00:02:06,415 --> 00:02:11,675 and there are just three options for you to try to memorize. 31 00:02:11,675 --> 00:02:14,590 Maybe the easiest option is in the bottom. 32 00:02:14,590 --> 00:02:18,465 Let us just do dot product of encoder and decoder states. 33 00:02:18,465 --> 00:02:23,110 It will give us some understanding of their similarity. 34 00:02:23,110 --> 00:02:24,805 Another way is to say, 35 00:02:24,805 --> 00:02:27,472 maybe we need some weights there, 36 00:02:27,472 --> 00:02:29,395 some metrics that we need to learn, 37 00:02:29,395 --> 00:02:33,670 and it can help us to capture the similarity better. 38 00:02:33,670 --> 00:02:36,190 This thing is called multiplicative attention. 39 00:02:36,190 --> 00:02:41,774 And maybe we just do not want to care at all with our mind how to compute it. 40 00:02:41,774 --> 00:02:43,489 We just want to say, "Well, 41 00:02:43,489 --> 00:02:46,030 neural network is something intelligent. 42 00:02:46,030 --> 00:02:47,290 Please do it for us." 43 00:02:47,290 --> 00:02:49,835 And then we just take one layer over 44 00:02:49,835 --> 00:02:55,150 neural network and say that it needs to predict these similarities. 45 00:02:55,150 --> 00:03:01,683 So you see there that you have h and s multiplied by some matrices and summed. 46 00:03:01,683 --> 00:03:04,435 That's why it is called additive attention. 47 00:03:04,435 --> 00:03:09,215 And then you have some non-linearity applied to this. 48 00:03:09,215 --> 00:03:10,650 These are three options, 49 00:03:10,650 --> 00:03:13,530 and you can have also many more options. 50 00:03:13,530 --> 00:03:17,075 Now, let us put all the things together, 51 00:03:17,075 --> 00:03:21,375 just again to understand how does attention works. 52 00:03:21,375 --> 00:03:24,595 You have your conditional language modeling task. 53 00:03:24,595 --> 00:03:29,160 You'll try to predict Y sequence given s sequence. 54 00:03:29,160 --> 00:03:34,050 And now, you encode your x sequence to some vj vector, 55 00:03:34,050 --> 00:03:36,390 which is different for every position. 56 00:03:36,390 --> 00:03:41,400 This vj vector is used in the decoder. 57 00:03:41,400 --> 00:03:46,445 It is concatenated with the current input of the decoder. 58 00:03:46,445 --> 00:03:49,200 And this way, the decoder is aware of 59 00:03:49,200 --> 00:03:52,185 all the information that it needs, the previous state, 60 00:03:52,185 --> 00:03:54,060 the current input, and now, 61 00:03:54,060 --> 00:03:56,499 this specific context vector, 62 00:03:56,499 --> 00:03:59,785 computed especially for this current state. 63 00:03:59,785 --> 00:04:02,610 Now, let us see where the attention works. 64 00:04:02,610 --> 00:04:06,660 So neural machine translation had lots of problems with long sentences. 65 00:04:06,660 --> 00:04:11,245 You can see that blue score for long sentences is really lower, 66 00:04:11,245 --> 00:04:14,109 though it is really okay for short ones. 67 00:04:14,109 --> 00:04:17,775 Neural machine translation with attention can solve this problem, 68 00:04:17,775 --> 00:04:21,900 and it performs really nice for even long sentences. 69 00:04:21,900 --> 00:04:25,740 Well, this is really intuitive because attention helps to 70 00:04:25,740 --> 00:04:30,368 focus on different parts of the sentence when you do your predictions. 71 00:04:30,368 --> 00:04:31,900 And for long sentences, 72 00:04:31,900 --> 00:04:34,320 it is really important because, otherwise, 73 00:04:34,320 --> 00:04:38,500 you have to encode the whole sentence into just one vector, 74 00:04:38,500 --> 00:04:40,735 and this is obviously not enough. 75 00:04:40,735 --> 00:04:48,420 Now, to better understand those alpha IJ ways that we have learned with the attention, 76 00:04:48,420 --> 00:04:50,385 let us try to visualize them. 77 00:04:50,385 --> 00:04:54,680 This weights can be visualized with I by J matrices. 78 00:04:54,680 --> 00:04:57,810 Let's say, what is the best promising place in 79 00:04:57,810 --> 00:05:01,925 the encoder for every place in the decoder? 80 00:05:01,925 --> 00:05:04,440 So with the light dot here, 81 00:05:04,440 --> 00:05:08,250 you can see those words that are aligned. 82 00:05:08,250 --> 00:05:13,960 So you see this is a very close analogy to word alignments that we have covered before. 83 00:05:13,960 --> 00:05:18,248 We just learn that these words are somehow similar, 84 00:05:18,248 --> 00:05:23,800 relevant, and we should look into this once to translate them to another language. 85 00:05:23,800 --> 00:05:26,325 And this is also a good place to note 86 00:05:26,325 --> 00:05:29,865 that we can use some techniques from traditional methods, 87 00:05:29,865 --> 00:05:34,140 from words alignments and incorporate them to neural machine translation. 88 00:05:34,140 --> 00:05:37,200 For example, priors for words alignments can 89 00:05:37,200 --> 00:05:41,220 really help here for neural machine translation. 90 00:05:41,220 --> 00:05:44,745 Now, do you think that this attention technique is really 91 00:05:44,745 --> 00:05:49,230 similar to how humans translate real sentences? 92 00:05:49,230 --> 00:05:55,140 I mean, humans also look into some places and then translate this places. 93 00:05:55,140 --> 00:05:57,245 They have some attention. 94 00:05:57,245 --> 00:05:59,470 Do you see any differences? 95 00:05:59,470 --> 00:06:02,740 Well, actually there is one important difference here. 96 00:06:02,740 --> 00:06:05,995 So humans save time with attention because 97 00:06:05,995 --> 00:06:09,955 they look only to those places that are relevant. 98 00:06:09,955 --> 00:06:12,040 On the contrary, here, 99 00:06:12,040 --> 00:06:16,880 we waste time because to guess what is the most relevant place, 100 00:06:16,880 --> 00:06:20,230 we first need to check out all the places and compute 101 00:06:20,230 --> 00:06:23,658 similarities for the whole encoder states. 102 00:06:23,658 --> 00:06:25,225 And then just say, "Okay, 103 00:06:25,225 --> 00:06:29,670 this piece of the encoder is the most meaningful." 104 00:06:29,670 --> 00:06:32,940 Now, the last story for this video is how to 105 00:06:32,940 --> 00:06:37,110 make this attention save time, not waste time. 106 00:06:37,110 --> 00:06:38,610 It is called local attention, 107 00:06:38,610 --> 00:06:40,560 and the idea is rather simple. 108 00:06:40,560 --> 00:06:46,345 We say, let us first time try to predict what is the best place to look at. 109 00:06:46,345 --> 00:06:47,685 And then after that, 110 00:06:47,685 --> 00:06:52,185 we will look only into some window around this place. 111 00:06:52,185 --> 00:06:57,010 And we will not compute similarities for the whole sequence. 112 00:06:57,010 --> 00:07:00,616 Now, first, how you can predict the best place. 113 00:07:00,616 --> 00:07:03,195 One easy way would be to say, "You know what? 114 00:07:03,195 --> 00:07:06,245 Those matrices should be strictly diagonal, 115 00:07:06,245 --> 00:07:10,715 and the place for position J should be J." 116 00:07:10,715 --> 00:07:12,390 Well, for some languages, 117 00:07:12,390 --> 00:07:14,420 it might be really bad if you have 118 00:07:14,420 --> 00:07:18,360 some different orders and then you can try to predict it. 119 00:07:18,360 --> 00:07:19,794 How do you do this? 120 00:07:19,794 --> 00:07:23,710 You have this sigmoid for something complicated. 121 00:07:23,710 --> 00:07:28,168 This sigmoid gives you probability between zero to one. 122 00:07:28,168 --> 00:07:34,115 And then you scale this by the length of the input sentence I. 123 00:07:34,115 --> 00:07:39,290 So you see that this will be indeed something in between zero and I, 124 00:07:39,290 --> 00:07:43,490 which means that you will get some position in the input sentence. 125 00:07:43,490 --> 00:07:46,540 Now, what is inside those sigmoid? 126 00:07:46,540 --> 00:07:50,605 Well, you see a current decoder state sj, 127 00:07:50,605 --> 00:07:56,275 and you just apply some transformations as usual in neural networks. 128 00:07:56,275 --> 00:07:59,165 Anyway, so when you have this aj position, 129 00:07:59,165 --> 00:08:03,230 you can just see that you need to look only into this window and 130 00:08:03,230 --> 00:08:07,580 compute similarities for attention alphas as usual, 131 00:08:07,580 --> 00:08:12,035 or you can also try to use some Gaussian to say that 132 00:08:12,035 --> 00:08:17,325 actually those words that are in the middle of the window are even more important. 133 00:08:17,325 --> 00:08:20,858 So you can just multiply some Gaussian priors 134 00:08:20,858 --> 00:08:25,640 by those alpha weights that we were computing before. 135 00:08:25,640 --> 00:08:28,665 Now, I want to show you the comparison of different methods. 136 00:08:28,665 --> 00:08:33,465 You can see here that we have global attention and local attention. 137 00:08:33,465 --> 00:08:39,780 And for local attention, we have monotonic predictions and predictive approach. 138 00:08:39,780 --> 00:08:43,000 And the last one performs the best. 139 00:08:43,000 --> 00:08:46,165 Do you remember what is inside the brackets here? 140 00:08:46,165 --> 00:08:51,470 These are different ways to compute similarities for attention weights. 141 00:08:51,470 --> 00:08:55,985 So you remember dot product and multiplicative attention? 142 00:08:55,985 --> 00:08:59,220 And, also, you could have location-based attention, 143 00:08:59,220 --> 00:09:00,965 which is even more simple. 144 00:09:00,965 --> 00:09:08,873 It says that we should just take sj and use it to compute those weights. 145 00:09:08,873 --> 00:09:11,800 This is all for that presentation, 146 00:09:11,800 --> 00:09:15,990 and I am looking forward to see you in the next one.