1
00:00:00,000 --> 00:00:04,110
In this video, you learn about the beam search algorithm.

2
00:00:04,110 --> 00:00:05,545
In the last video,

3
00:00:05,545 --> 00:00:09,655
you remember how for machine translation given an input French sentence,

4
00:00:09,655 --> 00:00:12,735
you don't want to output a random English translation,

5
00:00:12,735 --> 00:00:16,365
you want to output the best and the most likely English translation.

6
00:00:16,365 --> 00:00:21,126
The same is also true for speech recognition where given an input audio clip,

7
00:00:21,126 --> 00:00:24,465
you don't want to output a random text transcript of that audio,

8
00:00:24,465 --> 00:00:25,745
you want to output the best,

9
00:00:25,745 --> 00:00:28,315
maybe the most likely, text transcript.

10
00:00:28,315 --> 00:00:30,623
Beam search is the most widely used algorithm to do this.

11
00:00:30,623 --> 00:00:33,810
And in this video, you see how to get beam search to work for yourself.

12
00:00:33,810 --> 00:00:37,605
Let's just try Beam Search using our running example of the French sentence,

13
00:00:37,605 --> 00:00:39,495
"Jane, visite l'Afrique en Septembre".

14
00:00:39,495 --> 00:00:41,034
Hopefully being translated into,

15
00:00:41,034 --> 00:00:43,170
"Jane, visits Africa in September".

16
00:00:43,170 --> 00:00:45,855
The first thing Beam search has to do is try to pick

17
00:00:45,855 --> 00:00:48,679
the first words of the English translation,

18
00:00:48,679 --> 00:00:50,085
that's going to operate.

19
00:00:50,085 --> 00:00:51,470
So here I've listed,

20
00:00:51,470 --> 00:00:54,195
say, 10,000 words into vocabulary.

21
00:00:54,195 --> 00:00:56,140
And to simplify the problem a bit,

22
00:00:56,140 --> 00:00:58,000
I'm going to ignore capitalization.

23
00:00:58,000 --> 00:01:00,605
So I'm just listing all the words in lower case.

24
00:01:00,605 --> 00:01:02,870
So, in the first step of Beam Search,

25
00:01:02,870 --> 00:01:08,085
I use this network fragment with the coalition in green and decoalition in purple,

26
00:01:08,085 --> 00:01:12,165
to try to evaluate what is the probability of that for a square.

27
00:01:12,165 --> 00:01:14,745
So, what's the probability of the first output y,

28
00:01:14,745 --> 00:01:18,670
given the input sentence x gives the French input.

29
00:01:18,670 --> 00:01:24,725
So, whereas greedy search will pick only the one most likely words and move on,

30
00:01:24,725 --> 00:01:28,590
Beam Search instead can consider multiple alternatives.

31
00:01:28,590 --> 00:01:32,344
So, the Beam Search algorithm has a parameter called B,

32
00:01:32,344 --> 00:01:33,810
which is called the beam width and for

33
00:01:33,810 --> 00:01:37,715
this example I'm going to set the beam width to be with the three.

34
00:01:37,715 --> 00:01:40,330
And what this means is Beam search will cause that

35
00:01:40,330 --> 00:01:44,030
not just one possibility but consider three at the time.

36
00:01:44,030 --> 00:01:46,950
So in particular, let's say evaluating

37
00:01:46,950 --> 00:01:50,151
this probability over different choices the first words,

38
00:01:50,151 --> 00:01:52,815
it finds that the choices in,

39
00:01:52,815 --> 00:01:55,740
Jane and September are

40
00:01:55,740 --> 00:02:01,150
the most likely three possibilities for the first words in the English outputs.

41
00:02:01,150 --> 00:02:03,905
Then Beam search will stowaway in

42
00:02:03,905 --> 00:02:07,800
computer memory that it wants to try all of three of these words,

43
00:02:07,800 --> 00:02:10,680
and if the beam width parameter were said differently,

44
00:02:10,680 --> 00:02:12,640
the beam width parameter was 10,

45
00:02:12,640 --> 00:02:15,760
then we keep track of not just three but of the ten,

46
00:02:15,760 --> 00:02:18,950
most likely possible choices for the first word.

47
00:02:18,950 --> 00:02:23,680
So, to be clear in order to perform this first step of Beam search,

48
00:02:23,680 --> 00:02:26,790
what you need to do is run the input French sentence through

49
00:02:26,790 --> 00:02:31,435
this encoder network and then this first step will then decode the network,

50
00:02:31,435 --> 00:02:35,915
this is a softmax output overall 10,000 possibilities.

51
00:02:35,915 --> 00:02:39,165
Then you would take those 10,000 possible

52
00:02:39,165 --> 00:02:44,535
outputs and keep in memory which were the top three.

53
00:02:44,535 --> 00:02:47,195
Let's go into the second step of Beam search.

54
00:02:47,195 --> 00:02:53,705
Having picked in, Jane and September as the three most likely choice of the first word,

55
00:02:53,705 --> 00:02:55,100
what Beam search will do now,

56
00:02:55,100 --> 00:02:59,330
is for each of these three choices consider what should be the second word,

57
00:02:59,330 --> 00:03:03,065
so after "in" maybe a second word is "a" or maybe as Aaron,

58
00:03:03,065 --> 00:03:05,340
I'm just listing words from the vocabulary,

59
00:03:05,340 --> 00:03:10,370
from the dictionary or somewhere down the list will be September,

60
00:03:10,370 --> 00:03:13,595
somewhere down the list there's visit and then all the way

61
00:03:13,595 --> 00:03:16,910
to z and then the last word is zulu.

62
00:03:16,910 --> 00:03:19,015
So, to evaluate the probability of second word,

63
00:03:19,015 --> 00:03:23,150
it will use this new network fragments where is coder in

64
00:03:23,150 --> 00:03:28,785
green and for the decoder portion when trying to decide what comes after in.

65
00:03:28,785 --> 00:03:33,055
Remember the decoder first outputs, y hat one.

66
00:03:33,055 --> 00:03:39,177
So, I'm going to set to this y hat one to the word "in" as it goes back in.

67
00:03:39,177 --> 00:03:43,570
So there's the word "in" because it decided for now.

68
00:03:43,570 --> 00:03:47,085
That's because It trying to figure out that the first word was "in",

69
00:03:47,085 --> 00:03:48,630
what is the second word,

70
00:03:48,630 --> 00:03:53,460
and then this will output I guess y hat two.

71
00:03:53,460 --> 00:03:58,065
And so by hard wiring y hat one here,

72
00:03:58,065 --> 00:04:01,065
really the inputs here to be the first words

73
00:04:01,065 --> 00:04:06,079
"in" this time were fragment can be used to evaluate whether it's

74
00:04:06,079 --> 00:04:09,775
the probability of the second word given

75
00:04:09,775 --> 00:04:12,400
the input french sentence and that

76
00:04:12,400 --> 00:04:16,200
the first words of the translation has been the word "in".

77
00:04:16,200 --> 00:04:20,080
Now notice that what we also need help out in this second step would be

78
00:04:20,080 --> 00:04:24,580
assertions to find the pair of the first and second words that is most

79
00:04:24,580 --> 00:04:26,560
likely it's not just a second where is most

80
00:04:26,560 --> 00:04:28,710
likely that the pair of the first and second whereas

81
00:04:28,710 --> 00:04:33,363
the most likely and by the rules of conditional probability.

82
00:04:33,363 --> 00:04:37,520
This can be expressed as P of the first words

83
00:04:37,520 --> 00:04:45,553
times P of probability of the second words.

84
00:04:45,553 --> 00:04:48,250
Which you are getting from

85
00:04:48,250 --> 00:04:55,200
this network fragment and so if for each of the three words you've chosen "in",

86
00:04:55,200 --> 00:04:59,830
"Jane," and "September" you save away this probability then you can multiply them by

87
00:04:59,830 --> 00:05:05,140
this second probabilities to get the probability of the first and second words.

88
00:05:05,140 --> 00:05:08,095
So now you've seen how if the first word was

89
00:05:08,095 --> 00:05:11,630
"in" how you can evaluate the probability of the second word.

90
00:05:11,630 --> 00:05:14,510
Now at first it was "Jane" you do the same thing.

91
00:05:14,510 --> 00:05:17,823
The sentence could be "Jane a"," Jane Aaron",

92
00:05:17,823 --> 00:05:22,185
and so on down to "Jane is",

93
00:05:22,185 --> 00:05:26,935
"Jane visits" and so on.

94
00:05:26,935 --> 00:05:30,690
And you will use this in

95
00:05:30,690 --> 00:05:37,060
your network fragments let me draw this in as well where here you will hardwire,

96
00:05:37,060 --> 00:05:39,590
Y hat One to be Jane.

97
00:05:39,590 --> 00:05:46,675
And so with the First word y one hat's hard wired

98
00:05:46,675 --> 00:05:51,095
as Jane than just the network fragments

99
00:05:51,095 --> 00:05:54,425
can tell you what's the probability of the second words to me.

100
00:05:54,425 --> 00:05:57,630
And given that the first word is "Jane".

101
00:05:57,630 --> 00:06:05,595
And then same as above you can multiply with P of Y1 to get the probability

102
00:06:05,595 --> 00:06:14,130
of Y1 and Y2 for each of these 10,000 different possible choices for the second word.

103
00:06:14,130 --> 00:06:16,980
And then finally do the same thing for

104
00:06:16,980 --> 00:06:23,935
September although words from a down to Zulu and use this network fragment.

105
00:06:23,935 --> 00:06:31,085
That just goes in as well to see if the first word was September.

106
00:06:31,085 --> 00:06:35,938
What was the most likely options for the second words.

107
00:06:35,938 --> 00:06:40,100
So for this second step of beam search because we're

108
00:06:40,100 --> 00:06:44,450
continuing to use a beam width of three and because there are

109
00:06:44,450 --> 00:06:48,400
10,000 words in the vocabulary you'd end up considering three times

110
00:06:48,400 --> 00:06:53,555
10000 or thirty thousand possibilities because there are 10,000 here,

111
00:06:53,555 --> 00:06:55,820
10,000 here, 10,000 here as

112
00:06:55,820 --> 00:07:01,880
the beam width times the number of words in the vocabulary and what you

113
00:07:01,880 --> 00:07:05,900
do is you evaluate all of these 30000 options

114
00:07:05,900 --> 00:07:10,685
according to the probably the first and second words and then pick the top three.

115
00:07:10,685 --> 00:07:12,050
So with a cut down,

116
00:07:12,050 --> 00:07:15,309
these 30,000 possibilities down to three again down

117
00:07:15,309 --> 00:07:19,770
the beam width rounded again so let's say that 30,000 choices,

118
00:07:19,770 --> 00:07:26,030
the most likely were in September and say Jane is,

119
00:07:26,030 --> 00:07:34,336
and Jane visits sorry this bit messy but those are the most likely three out of the

120
00:07:34,336 --> 00:07:37,655
30,000 choices then that's what Beam's search would memorize

121
00:07:37,655 --> 00:07:43,965
away and take on to the next step being surge.

122
00:07:43,965 --> 00:07:47,195
So notice one thing if beam search decides that

123
00:07:47,195 --> 00:07:52,305
the most likely choices are the first and second words are in September,

124
00:07:52,305 --> 00:07:55,470
or Jane is, or Jane visits.

125
00:07:55,470 --> 00:07:59,745
Then what that means is that it is now rejecting September as

126
00:07:59,745 --> 00:08:05,920
a candidate for the first words of the output English translation

127
00:08:05,920 --> 00:08:12,420
so we're now down to two possibilities for the first words but we still have a beam width

128
00:08:12,420 --> 00:08:19,660
of three keeping track of three choices for pairs of Y1,

129
00:08:19,660 --> 00:08:23,540
Y2 before going onto the third step of beam search.

130
00:08:23,540 --> 00:08:26,946
Just want to notice that because of beam width is equal to three,

131
00:08:26,946 --> 00:08:31,710
every step you instantiate three copies of

132
00:08:31,710 --> 00:08:38,160
the network to evaluate these partial sentence fragments and the output.

133
00:08:38,160 --> 00:08:42,060
And it's because of beam width is equal to three that you have

134
00:08:42,060 --> 00:08:46,740
three copies of the network with different choices for the first words,

135
00:08:46,740 --> 00:08:50,490
but these three copies of the network can be very efficiently used to

136
00:08:50,490 --> 00:08:56,110
evaluate all 30,000 options for the second word.

137
00:08:56,110 --> 00:09:02,620
So just don't instantiate 30,000 copies of the network or three copies of the network to

138
00:09:02,620 --> 00:09:09,521
very quickly evaluate all 10,000 possible outputs at that softmax output say for Y2.

139
00:09:09,521 --> 00:09:14,120
Let's just quickly illustrate one more step of beam search.

140
00:09:14,120 --> 00:09:18,822
So said that the most likely choices for first two words were in September, Jane is,

141
00:09:18,822 --> 00:09:21,180
and Jane visits and for each of

142
00:09:21,180 --> 00:09:23,850
these pairs of words which we should have saved the way in computer memory

143
00:09:23,850 --> 00:09:31,805
the probability of Y1 and Y2 given the input X given the French sentence X.

144
00:09:31,805 --> 00:09:33,120
So similar to before,

145
00:09:33,120 --> 00:09:35,735
we now want to consider what is the third word.

146
00:09:35,735 --> 00:09:37,160
So in September a?

147
00:09:37,160 --> 00:09:38,860
In September Aaron?

148
00:09:38,860 --> 00:09:40,800
All the way down to is in

149
00:09:40,800 --> 00:09:45,920
September Zulu and to evaluate possible choices for the third word,

150
00:09:45,920 --> 00:09:49,275
you use this network fragments where you Hardwire

151
00:09:49,275 --> 00:09:53,253
the first word here to be in the second word to be September.

152
00:09:53,253 --> 00:09:56,280
And so this network fragment allows you to evaluate what's

153
00:09:56,280 --> 00:09:59,940
the probability of the third word given the input French sentence

154
00:09:59,940 --> 00:10:08,110
X and given that the first two words are in September and English output.

155
00:10:08,110 --> 00:10:13,415
And then you do the same thing for the second fragment.

156
00:10:13,415 --> 00:10:15,385
So like so.

157
00:10:15,385 --> 00:10:17,930
And same thing for

158
00:10:17,930 --> 00:10:23,265
Jane visits and so

159
00:10:23,265 --> 00:10:25,735
beam search will then once again pick

160
00:10:25,735 --> 00:10:29,410
the top three possibilities may be that things in September.

161
00:10:29,410 --> 00:10:38,284
Jane is a likely outcome or Jane is visiting is likely or maybe Jane visits

162
00:10:38,284 --> 00:10:42,820
Africa is likely for that first three words and then it keeps

163
00:10:42,820 --> 00:10:44,680
going and then you go onto the fourth step of

164
00:10:44,680 --> 00:10:48,300
beam search hat one more word and on it goes.

165
00:10:48,300 --> 00:10:50,920
And the outcome of this process hopefully will be

166
00:10:50,920 --> 00:10:54,570
that adding one word at a time that Beam search will decide that.

167
00:10:54,570 --> 00:10:57,610
Jane visits Africa in September will be terminated by

168
00:10:57,610 --> 00:11:02,275
the end of sentence symbol using that system is quite common.

169
00:11:02,275 --> 00:11:06,070
They'll find that this is a likely output English sentence

170
00:11:06,070 --> 00:11:08,690
and you'll see more details of this yourself.

171
00:11:08,690 --> 00:11:15,420
In this week's exercise as well where you get to play with beam search yourself.

172
00:11:15,420 --> 00:11:21,400
So with a beam of three being searched considers three possibilities at a time.

173
00:11:21,400 --> 00:11:24,535
Notice that if the beam width was said to be equal to one,

174
00:11:24,535 --> 00:11:26,075
say cause there's only one,

175
00:11:26,075 --> 00:11:27,940
then this essentially becomes

176
00:11:27,940 --> 00:11:34,450
the greedy search algorithm which we had discussed in the last video but by considering

177
00:11:34,450 --> 00:11:38,140
multiple possibilities say three or ten or some other number at

178
00:11:38,140 --> 00:11:40,270
the same time beam search will usually

179
00:11:40,270 --> 00:11:43,645
find a much better output sentence than greedy search.

180
00:11:43,645 --> 00:11:47,470
You've now seen how Beam Search works but it turns out there's

181
00:11:47,470 --> 00:11:49,607
some additional tips and tricks from our partners

182
00:11:49,607 --> 00:11:52,445
that help you to make beam search work even better.

183
00:11:52,445 --> 00:11:54,210
Let's go onto the next video to take a look.