1
00:00:02,870 --> 00:00:08,185
Hey. Attention mechanism is a super powerful technique in neural networks.

2
00:00:08,185 --> 00:00:13,055
So let us cover it first with some pictures and then with some formulas.

3
00:00:13,055 --> 00:00:21,840
Just to recap, we have some encoder that has h states and decoder that has some s states.

4
00:00:21,840 --> 00:00:26,370
Now, let us imagine that we want to produce the next decoder state.

5
00:00:26,370 --> 00:00:30,695
So we want to compute sj. How can we do this?

6
00:00:30,695 --> 00:00:32,260
In the previous video,

7
00:00:32,260 --> 00:00:34,685
we just used the v vector,

8
00:00:34,685 --> 00:00:39,775
which was the information about the whole encoded input sentence.

9
00:00:39,775 --> 00:00:41,620
And instead of that,

10
00:00:41,620 --> 00:00:43,075
we could do something better.

11
00:00:43,075 --> 00:00:48,000
We can look into all states of the encoder with some weights.

12
00:00:48,000 --> 00:00:51,275
So this alphas denote some weights

13
00:00:51,275 --> 00:00:55,450
that will say us whether it is important to look there or here.

14
00:00:55,450 --> 00:00:58,300
How can we compute this alphas?

15
00:00:58,300 --> 00:01:01,010
Well, we want them to be probabilities, and also,

16
00:01:01,010 --> 00:01:04,420
we want them to capture some similarity between

17
00:01:04,420 --> 00:01:09,845
our current moment in the decoder and different moments in the encoder.

18
00:01:09,845 --> 00:01:12,850
This way, we'll look into more similar places,

19
00:01:12,850 --> 00:01:19,880
and they will give us the most important information to go next with our decoding.

20
00:01:19,880 --> 00:01:24,305
If we speak about the same thing with the formulas,

21
00:01:24,305 --> 00:01:25,950
we will say that, now,

22
00:01:25,950 --> 00:01:29,605
instead of just one v vector that we had before,

23
00:01:29,605 --> 00:01:35,005
we will have vj, which is different for different positions of the decoder.

24
00:01:35,005 --> 00:01:41,340
And this vj vector will be computed as an average of encoder states.

25
00:01:41,340 --> 00:01:47,565
And the weights will be computed as soft marks because they need to be probabilities.

26
00:01:47,565 --> 00:01:54,170
And this soft marks will be applied to similarities of encoder and decoder states.

27
00:01:54,170 --> 00:01:59,680
Now, do you have any ideas how to compute those similarities?

28
00:01:59,680 --> 00:02:01,740
I have a few.

29
00:02:01,740 --> 00:02:06,415
So papers actually have tried lots and lots of different options,

30
00:02:06,415 --> 00:02:11,675
and there are just three options for you to try to memorize.

31
00:02:11,675 --> 00:02:14,590
Maybe the easiest option is in the bottom.

32
00:02:14,590 --> 00:02:18,465
Let us just do dot product of encoder and decoder states.

33
00:02:18,465 --> 00:02:23,110
It will give us some understanding of their similarity.

34
00:02:23,110 --> 00:02:24,805
Another way is to say,

35
00:02:24,805 --> 00:02:27,472
maybe we need some weights there,

36
00:02:27,472 --> 00:02:29,395
some metrics that we need to learn,

37
00:02:29,395 --> 00:02:33,670
and it can help us to capture the similarity better.

38
00:02:33,670 --> 00:02:36,190
This thing is called multiplicative attention.

39
00:02:36,190 --> 00:02:41,774
And maybe we just do not want to care at all with our mind how to compute it.

40
00:02:41,774 --> 00:02:43,489
We just want to say, "Well,

41
00:02:43,489 --> 00:02:46,030
neural network is something intelligent.

42
00:02:46,030 --> 00:02:47,290
Please do it for us."

43
00:02:47,290 --> 00:02:49,835
And then we just take one layer over

44
00:02:49,835 --> 00:02:55,150
neural network and say that it needs to predict these similarities.

45
00:02:55,150 --> 00:03:01,683
So you see there that you have h and s multiplied by some matrices and summed.

46
00:03:01,683 --> 00:03:04,435
That's why it is called additive attention.

47
00:03:04,435 --> 00:03:09,215
And then you have some non-linearity applied to this.

48
00:03:09,215 --> 00:03:10,650
These are three options,

49
00:03:10,650 --> 00:03:13,530
and you can have also many more options.

50
00:03:13,530 --> 00:03:17,075
Now, let us put all the things together,

51
00:03:17,075 --> 00:03:21,375
just again to understand how does attention works.

52
00:03:21,375 --> 00:03:24,595
You have your conditional language modeling task.

53
00:03:24,595 --> 00:03:29,160
You'll try to predict Y sequence given s sequence.

54
00:03:29,160 --> 00:03:34,050
And now, you encode your x sequence to some vj vector,

55
00:03:34,050 --> 00:03:36,390
which is different for every position.

56
00:03:36,390 --> 00:03:41,400
This vj vector is used in the decoder.

57
00:03:41,400 --> 00:03:46,445
It is concatenated with the current input of the decoder.

58
00:03:46,445 --> 00:03:49,200
And this way, the decoder is aware of

59
00:03:49,200 --> 00:03:52,185
all the information that it needs, the previous state,

60
00:03:52,185 --> 00:03:54,060
the current input, and now,

61
00:03:54,060 --> 00:03:56,499
this specific context vector,

62
00:03:56,499 --> 00:03:59,785
computed especially for this current state.

63
00:03:59,785 --> 00:04:02,610
Now, let us see where the attention works.

64
00:04:02,610 --> 00:04:06,660
So neural machine translation had lots of problems with long sentences.

65
00:04:06,660 --> 00:04:11,245
You can see that blue score for long sentences is really lower,

66
00:04:11,245 --> 00:04:14,109
though it is really okay for short ones.

67
00:04:14,109 --> 00:04:17,775
Neural machine translation with attention can solve this problem,

68
00:04:17,775 --> 00:04:21,900
and it performs really nice for even long sentences.

69
00:04:21,900 --> 00:04:25,740
Well, this is really intuitive because attention helps to

70
00:04:25,740 --> 00:04:30,368
focus on different parts of the sentence when you do your predictions.

71
00:04:30,368 --> 00:04:31,900
And for long sentences,

72
00:04:31,900 --> 00:04:34,320
it is really important because, otherwise,

73
00:04:34,320 --> 00:04:38,500
you have to encode the whole sentence into just one vector,

74
00:04:38,500 --> 00:04:40,735
and this is obviously not enough.

75
00:04:40,735 --> 00:04:48,420
Now, to better understand those alpha IJ ways that we have learned with the attention,

76
00:04:48,420 --> 00:04:50,385
let us try to visualize them.

77
00:04:50,385 --> 00:04:54,680
This weights can be visualized with I by J matrices.

78
00:04:54,680 --> 00:04:57,810
Let's say, what is the best promising place in

79
00:04:57,810 --> 00:05:01,925
the encoder for every place in the decoder?

80
00:05:01,925 --> 00:05:04,440
So with the light dot here,

81
00:05:04,440 --> 00:05:08,250
you can see those words that are aligned.

82
00:05:08,250 --> 00:05:13,960
So you see this is a very close analogy to word alignments that we have covered before.

83
00:05:13,960 --> 00:05:18,248
We just learn that these words are somehow similar,

84
00:05:18,248 --> 00:05:23,800
relevant, and we should look into this once to translate them to another language.

85
00:05:23,800 --> 00:05:26,325
And this is also a good place to note

86
00:05:26,325 --> 00:05:29,865
that we can use some techniques from traditional methods,

87
00:05:29,865 --> 00:05:34,140
from words alignments and incorporate them to neural machine translation.

88
00:05:34,140 --> 00:05:37,200
For example, priors for words alignments can

89
00:05:37,200 --> 00:05:41,220
really help here for neural machine translation.

90
00:05:41,220 --> 00:05:44,745
Now, do you think that this attention technique is really

91
00:05:44,745 --> 00:05:49,230
similar to how humans translate real sentences?

92
00:05:49,230 --> 00:05:55,140
I mean, humans also look into some places and then translate this places.

93
00:05:55,140 --> 00:05:57,245
They have some attention.

94
00:05:57,245 --> 00:05:59,470
Do you see any differences?

95
00:05:59,470 --> 00:06:02,740
Well, actually there is one important difference here.

96
00:06:02,740 --> 00:06:05,995
So humans save time with attention because

97
00:06:05,995 --> 00:06:09,955
they look only to those places that are relevant.

98
00:06:09,955 --> 00:06:12,040
On the contrary, here,

99
00:06:12,040 --> 00:06:16,880
we waste time because to guess what is the most relevant place,

100
00:06:16,880 --> 00:06:20,230
we first need to check out all the places and compute

101
00:06:20,230 --> 00:06:23,658
similarities for the whole encoder states.

102
00:06:23,658 --> 00:06:25,225
And then just say, "Okay,

103
00:06:25,225 --> 00:06:29,670
this piece of the encoder is the most meaningful."

104
00:06:29,670 --> 00:06:32,940
Now, the last story for this video is how to

105
00:06:32,940 --> 00:06:37,110
make this attention save time, not waste time.

106
00:06:37,110 --> 00:06:38,610
It is called local attention,

107
00:06:38,610 --> 00:06:40,560
and the idea is rather simple.

108
00:06:40,560 --> 00:06:46,345
We say, let us first time try to predict what is the best place to look at.

109
00:06:46,345 --> 00:06:47,685
And then after that,

110
00:06:47,685 --> 00:06:52,185
we will look only into some window around this place.

111
00:06:52,185 --> 00:06:57,010
And we will not compute similarities for the whole sequence.

112
00:06:57,010 --> 00:07:00,616
Now, first, how you can predict the best place.

113
00:07:00,616 --> 00:07:03,195
One easy way would be to say, "You know what?

114
00:07:03,195 --> 00:07:06,245
Those matrices should be strictly diagonal,

115
00:07:06,245 --> 00:07:10,715
and the place for position J should be J."

116
00:07:10,715 --> 00:07:12,390
Well, for some languages,

117
00:07:12,390 --> 00:07:14,420
it might be really bad if you have

118
00:07:14,420 --> 00:07:18,360
some different orders and then you can try to predict it.

119
00:07:18,360 --> 00:07:19,794
How do you do this?

120
00:07:19,794 --> 00:07:23,710
You have this sigmoid for something complicated.

121
00:07:23,710 --> 00:07:28,168
This sigmoid gives you probability between zero to one.

122
00:07:28,168 --> 00:07:34,115
And then you scale this by the length of the input sentence I.

123
00:07:34,115 --> 00:07:39,290
So you see that this will be indeed something in between zero and I,

124
00:07:39,290 --> 00:07:43,490
which means that you will get some position in the input sentence.

125
00:07:43,490 --> 00:07:46,540
Now, what is inside those sigmoid?

126
00:07:46,540 --> 00:07:50,605
Well, you see a current decoder state sj,

127
00:07:50,605 --> 00:07:56,275
and you just apply some transformations as usual in neural networks.

128
00:07:56,275 --> 00:07:59,165
Anyway, so when you have this aj position,

129
00:07:59,165 --> 00:08:03,230
you can just see that you need to look only into this window and

130
00:08:03,230 --> 00:08:07,580
compute similarities for attention alphas as usual,

131
00:08:07,580 --> 00:08:12,035
or you can also try to use some Gaussian to say that

132
00:08:12,035 --> 00:08:17,325
actually those words that are in the middle of the window are even more important.

133
00:08:17,325 --> 00:08:20,858
So you can just multiply some Gaussian priors

134
00:08:20,858 --> 00:08:25,640
by those alpha weights that we were computing before.

135
00:08:25,640 --> 00:08:28,665
Now, I want to show you the comparison of different methods.

136
00:08:28,665 --> 00:08:33,465
You can see here that we have global attention and local attention.

137
00:08:33,465 --> 00:08:39,780
And for local attention, we have monotonic predictions and predictive approach.

138
00:08:39,780 --> 00:08:43,000
And the last one performs the best.

139
00:08:43,000 --> 00:08:46,165
Do you remember what is inside the brackets here?

140
00:08:46,165 --> 00:08:51,470
These are different ways to compute similarities for attention weights.

141
00:08:51,470 --> 00:08:55,985
So you remember dot product and multiplicative attention?

142
00:08:55,985 --> 00:08:59,220
And, also, you could have location-based attention,

143
00:08:59,220 --> 00:09:00,965
which is even more simple.

144
00:09:00,965 --> 00:09:08,873
It says that we should just take sj and use it to compute those weights.

145
00:09:08,873 --> 00:09:11,800
This is all for that presentation,

146
00:09:11,800 --> 00:09:15,990
and I am looking forward to see you in the next one.