1
00:00:00,000 --> 00:00:01,910
For most of this week,

2
00:00:01,910 --> 00:00:07,100
you've been using a Encoder-Decoder architecture for machine translation.

3
00:00:07,100 --> 00:00:11,805
Where one R and N reads in a sentence and then different one outputs a sentence.

4
00:00:11,805 --> 00:00:15,538
There's a modification to this called the Attention Model,

5
00:00:15,538 --> 00:00:18,425
that makes all this work much better.

6
00:00:18,425 --> 00:00:21,050
The attention algorithm, the attention idea has

7
00:00:21,050 --> 00:00:23,855
been one of the most influential ideas in deep learning.

8
00:00:23,855 --> 00:00:26,410
Let's take a look at how that works.

9
00:00:26,410 --> 00:00:29,770
Get a very long French sentence like this.

10
00:00:29,770 --> 00:00:33,800
What we are asking this green encoder in your network to do is,

11
00:00:33,800 --> 00:00:36,230
to read in the whole sentence and then memorize

12
00:00:36,230 --> 00:00:40,755
the whole sentences and store it in the activations conveyed here.

13
00:00:40,755 --> 00:00:42,670
Then for the purple network,

14
00:00:42,670 --> 00:00:44,720
the decoder network till then

15
00:00:44,720 --> 00:00:47,727
generate the English translation.

16
00:00:47,727 --> 00:00:50,702
Jane went to Africa last September and enjoyed the culture and met many wonderful people;

17
00:00:50,702 --> 00:00:52,355
she came back raving about how wonderful her trip was,

18
00:00:52,355 --> 00:00:53,687
and is tempting me to go too.

19
00:00:53,687 --> 00:00:56,855
Now, the way a human translator

20
00:00:56,855 --> 00:01:00,260
would translate this sentence is not to first read

21
00:01:00,260 --> 00:01:02,390
the whole French sentence and then memorize

22
00:01:02,390 --> 00:01:06,215
the whole thing and then regurgitate an English sentence from scratch.

23
00:01:06,215 --> 00:01:11,686
Instead, what the human translator would do is read the first part of it,

24
00:01:11,686 --> 00:01:14,430
maybe generate part of the translation.

25
00:01:14,430 --> 00:01:17,270
Look at the second part, generate a few more words,

26
00:01:17,270 --> 00:01:18,960
look at a few more words,

27
00:01:18,960 --> 00:01:20,305
generate a few more words and so on.

28
00:01:20,305 --> 00:01:23,105
You kind of work part by part through the sentence,

29
00:01:23,105 --> 00:01:29,041
because it's just really difficult to memorize the whole long sentence like that.

30
00:01:29,041 --> 00:01:34,835
What you see for the Encoder-Decoder architecture above is that,

31
00:01:34,835 --> 00:01:37,760
it works quite well for short sentences,

32
00:01:37,760 --> 00:01:40,225
so we might achieve a relatively high Bleu score,

33
00:01:40,225 --> 00:01:43,215
but for very long sentences,

34
00:01:43,215 --> 00:01:45,440
maybe longer than 30 or 40 words,

35
00:01:45,440 --> 00:01:47,320
the performance comes down.

36
00:01:47,320 --> 00:01:51,346
The Bleu score might look like this as the sentence that varies

37
00:01:51,346 --> 00:01:56,219
and short sentences are just hard to translate,

38
00:01:56,219 --> 00:01:59,015
hard to get all the words, right?

39
00:01:59,015 --> 00:02:03,080
Long sentences, it doesn't do well on because it's just difficult to

40
00:02:03,080 --> 00:02:07,138
get in your network to memorize a super long sentence.

41
00:02:07,138 --> 00:02:08,990
In this and the next video,

42
00:02:08,990 --> 00:02:14,705
you'll see the Attention Model which translates maybe a bit more like humans might,

43
00:02:14,705 --> 00:02:19,030
looking at part of the sentence at a time and with an Attention Model,

44
00:02:19,030 --> 00:02:22,508
machine translation systems performance can look like this,

45
00:02:22,508 --> 00:02:26,255
because by working one part of the sentence at a time,

46
00:02:26,255 --> 00:02:29,000
you don't see this huge dip which is

47
00:02:29,000 --> 00:02:32,060
really measuring the ability of a neural network to memorize

48
00:02:32,060 --> 00:02:38,606
a long sentence which maybe isn't what we most badly need a neural network to do.

49
00:02:38,606 --> 00:02:43,325
In this video, I want to just give you some intuition about

50
00:02:43,325 --> 00:02:49,149
how attention works and then we'll flesh out the details in the next video.

51
00:02:49,149 --> 00:02:53,190
The Attention Model was due to Dimitri, Bahdanau, Camcrun Cho,

52
00:02:53,190 --> 00:03:00,060
Yoshe Bengio and even though it was obviously developed for machine translation,

53
00:03:00,060 --> 00:03:03,685
it spread to many other application areas as well.

54
00:03:03,685 --> 00:03:06,480
This is really a very influential,

55
00:03:06,480 --> 00:03:10,580
I think very seminal paper in the deep learning literature.

56
00:03:10,580 --> 00:03:14,385
Let's illustrate this with a short sentence,

57
00:03:14,385 --> 00:03:18,840
even though these ideas were maybe developed more for long sentences,

58
00:03:18,840 --> 00:03:22,440
but it'll be easier to illustrate these ideas with a simpler example.

59
00:03:22,440 --> 00:03:24,165
We have our usual sentence,

60
00:03:24,165 --> 00:03:26,470
Jane visite l'Afrique en Septembre.

61
00:03:26,470 --> 00:03:30,200
Let's say that we use a R and N,

62
00:03:30,200 --> 00:03:34,140
and in this case, I'm going to use a bidirectional R and N,

63
00:03:34,140 --> 00:03:37,845
in order to compute some set of

64
00:03:37,845 --> 00:03:42,925
features for each of the input words and you have to understand it,

65
00:03:42,925 --> 00:03:46,555
bidirectional R and N with outputs Y1 to Y3

66
00:03:46,555 --> 00:03:51,240
and so on up to Y5 but we're not doing a word for word translation,

67
00:03:51,240 --> 00:03:54,497
let me get rid of the Y's on top.

68
00:03:54,497 --> 00:03:56,220
But using a bidirectional R and N,

69
00:03:56,220 --> 00:03:59,580
what we've done is for each other words,

70
00:03:59,580 --> 00:04:02,528
really for each of the five positions into sentence,

71
00:04:02,528 --> 00:04:06,475
you can compute a very rich set of features about

72
00:04:06,475 --> 00:04:12,060
the words in the sentence and maybe surrounding words in every position.

73
00:04:12,060 --> 00:04:17,125
Now, let's go ahead and generate the English translation.

74
00:04:17,125 --> 00:04:21,690
We're going to use another R and N to generate the English translations.

75
00:04:21,690 --> 00:04:29,055
Here's my R and N note as usual and instead of using A to denote the activation,

76
00:04:29,055 --> 00:04:32,830
in order to avoid confusion with the activations down here,

77
00:04:32,830 --> 00:04:34,605
I'm just going to use a different notation,

78
00:04:34,605 --> 00:04:39,915
I'm going to use S to denote the hidden state in this R and N up here,

79
00:04:39,915 --> 00:04:45,270
so instead of writing A1 I'm going to right S1 and so

80
00:04:45,270 --> 00:04:50,750
we hope in this model that the first word it generates will be Jane,

81
00:04:50,750 --> 00:04:54,277
to generate Jane visits Africa in September.

82
00:04:54,277 --> 00:04:57,120
Now, the question is,

83
00:04:57,120 --> 00:04:59,835
when you're trying to generate this first word,

84
00:04:59,835 --> 00:05:05,720
this output, what part of the input French sentence should you be looking at?

85
00:05:05,720 --> 00:05:07,960
Seems like you should be looking primarily at this first word,

86
00:05:07,960 --> 00:05:11,385
maybe a few other words close by,

87
00:05:11,385 --> 00:05:15,100
but you don't need to be looking way at the end of the sentence.

88
00:05:15,100 --> 00:05:18,875
What the Attention Model would be computing is a set of

89
00:05:18,875 --> 00:05:25,350
attention weights and we're going to use Alpha one, one

90
00:05:25,350 --> 00:05:29,670
to denote when you're generating the first words,

91
00:05:29,670 --> 00:05:36,145
how much should you be paying attention to this first piece of information here.

92
00:05:36,145 --> 00:05:41,525
And then we'll also come up with a second that's called Attention Weight,

93
00:05:41,525 --> 00:05:46,360
Alpha one, two which tells us what we're trying to compute the first work of Jane,

94
00:05:46,360 --> 00:05:51,740
how much attention we're

95
00:05:51,740 --> 00:05:57,050
paying to this second work from the inputs and so on and the Alpha one, three and so on,

96
00:05:57,050 --> 00:06:02,147
and together this will tell us what is

97
00:06:02,147 --> 00:06:08,610
exactly the context from denoter C that we should be paying attention to,

98
00:06:08,610 --> 00:06:14,990
and that is input to this R and N unit to then try to generate the first words.

99
00:06:14,990 --> 00:06:16,590
That's one step of the R and N,

100
00:06:16,590 --> 00:06:19,880
we will flesh out all these details in the next video.

101
00:06:19,880 --> 00:06:23,366
For the second step of this R and N,

102
00:06:23,366 --> 00:06:25,755
we're going to have

103
00:06:25,755 --> 00:06:31,585
a new hidden state S two and we're going to have a new set of the attention weights.

104
00:06:31,585 --> 00:06:38,325
We're going to have Alpha two, one to tell us when we generate in the second word.

105
00:06:38,325 --> 00:06:42,065
I guess this will be visits maybe that being the ground trip label.

106
00:06:42,065 --> 00:06:47,760
How much should we paying attention to the first word in the french input and also,

107
00:06:47,760 --> 00:06:50,880
Alpha two, two and so on.

108
00:06:50,880 --> 00:06:53,020
How much should we paying attention the word visite,

109
00:06:53,020 --> 00:06:55,851
how much should we pay attention to the free and so on.

110
00:06:55,851 --> 00:07:01,054
And of course, the first word we generate in Jane is also an input to this,

111
00:07:01,054 --> 00:07:05,850
and then we have some context that we're paying attention to and the second step,

112
00:07:05,850 --> 00:07:09,646
there's also an input and that together will generate the second word and

113
00:07:09,646 --> 00:07:14,930
that leads us to the third step, S three,

114
00:07:14,930 --> 00:07:19,700
where this is an input and we have some new context C that

115
00:07:19,700 --> 00:07:24,730
depends on the various Alpha three for the different time sets,

116
00:07:24,730 --> 00:07:27,860
that tells us how much should we be paying attention to

117
00:07:27,860 --> 00:07:32,633
the different words from the input French sentence and so on.

118
00:07:32,633 --> 00:07:35,365
So, some things I haven't specified yet,

119
00:07:35,365 --> 00:07:38,650
but that will go further into detail in the next video of this,

120
00:07:38,650 --> 00:07:43,750
how exactly this context defines and the goal of the context is

121
00:07:43,750 --> 00:07:46,300
for the third word is really should capture that

122
00:07:46,300 --> 00:07:50,720
maybe we should be looking around this part of the sentence.

123
00:07:50,720 --> 00:07:55,710
The formula you use to do that will defer to

124
00:07:55,710 --> 00:08:01,177
the next video as well as how do you compute these attention weights.

125
00:08:01,177 --> 00:08:05,000
And you see in the next video that Alpha three T,

126
00:08:05,000 --> 00:08:07,300
which is, when you're trying to generate the third word,

127
00:08:07,300 --> 00:08:11,390
I guess this would be the Africa, just getting the right output.

128
00:08:11,390 --> 00:08:16,306
The amounts that this R and N step

129
00:08:16,306 --> 00:08:21,640
should be paying attention to the French word that time T,

130
00:08:21,640 --> 00:08:27,749
that depends on the activations of the bidirectional R and N at time T,

131
00:08:27,749 --> 00:08:32,677
I guess it depends on the fourth activations and the,

132
00:08:32,677 --> 00:08:37,900
backward activations at time T and it will depend on the state from the previous steps,

133
00:08:37,900 --> 00:08:39,420
it will depend on S two, and these things together will influence,

134
00:08:39,420 --> 00:08:47,640
how much you pay attention to a specific word in the input French sentence.

135
00:08:47,640 --> 00:08:49,250
But we'll flesh out all these details in the next video.

136
00:08:49,250 --> 00:08:52,775
But the key intuition to take away is that

137
00:08:52,775 --> 00:08:58,510
this way the R and N marches forward generating one word at a time,

138
00:08:58,510 --> 00:09:04,275
until eventually it generates maybe the EOS and at every step,

139
00:09:04,275 --> 00:09:06,445
there are these attention weighs.

140
00:09:06,445 --> 00:09:09,100
Alpha T.T. Prime that tells it,

141
00:09:09,100 --> 00:09:11,340
when you're trying to generate the T, English word,

142
00:09:11,340 --> 00:09:16,915
how much should you be paying attention to the T prime French words.

143
00:09:16,915 --> 00:09:20,710
And this allows it on every time step to look only maybe

144
00:09:20,710 --> 00:09:24,835
within a local window of the French sentence to pay attention to,

145
00:09:24,835 --> 00:09:28,135
when generating a specific English word.

146
00:09:28,135 --> 00:09:31,210
I hope this video conveys some intuition

147
00:09:31,210 --> 00:09:34,825
about Attention Model and that we now have a rough sense of,

148
00:09:34,825 --> 00:09:36,590
maybe how the algorithm works.

149
00:09:36,590 --> 00:09:41,000
Let's go to the next video to flesh out the details of the Attention Model.