1
00:00:00,650 --> 00:00:04,134
In the last video,
you saw the basic beam search algorithm.

2
00:00:04,134 --> 00:00:08,680
In this video, you'll learn some little
changes that make it work even better.

3
00:00:10,240 --> 00:00:14,770
Length normalization is a small change to
the beam search algorithm that can help

4
00:00:14,770 --> 00:00:16,447
you get much better results.

5
00:00:16,447 --> 00:00:18,770
Here's what it is.

6
00:00:18,770 --> 00:00:22,526
Beam search is maximizing
this probability.

7
00:00:22,526 --> 00:00:28,349
And this product here is just expressing

8
00:00:28,349 --> 00:00:35,028
the observation that P(y1) up to y(Ty),

9
00:00:35,028 --> 00:00:40,507
given x, can be expressed as P(y1)

10
00:00:40,507 --> 00:00:45,643
given x times P(y2), given x and

11
00:00:45,643 --> 00:00:50,779
y1 times dot dot dot, up to I guess p

12
00:00:50,779 --> 00:00:56,114
of y Ty given x and y1 up to y t1-1.

13
00:00:56,114 --> 00:01:01,982
Maybe this notation is a bit more scary
and more intimidating than it needs to be,

14
00:01:01,982 --> 00:01:06,140
but this is that probabilities
that you see previously.

15
00:01:07,430 --> 00:01:14,110
Now, if you're implementing these, these
probabilities are all numbers less than 1.

16
00:01:14,110 --> 00:01:16,170
Often they're much less than 1.

17
00:01:16,170 --> 00:01:20,190
And multiplying a lot of numbers less
than 1 will result in a tiny, tiny,

18
00:01:20,190 --> 00:01:24,098
tiny number,
which can result in numerical underflow.

19
00:01:24,098 --> 00:01:26,150
Meaning that it's too small for

20
00:01:26,150 --> 00:01:31,500
the floating part representation in
your computer to store accurately.

21
00:01:31,500 --> 00:01:38,480
So in practice, instead of maximizing
this product, we will take logs.

22
00:01:39,940 --> 00:01:45,420
And if you insert a log there, then log
of a product becomes a sum of a log,

23
00:01:45,420 --> 00:01:50,370
and maximizing this sum of log
probabilities should give you

24
00:01:50,370 --> 00:01:55,710
the same results in terms of
selecting the most likely sentence y.

25
00:01:55,710 --> 00:02:00,470
So by taking logs,
you end up with a more numerically stable

26
00:02:00,470 --> 00:02:05,890
algorithm that is less
prone to rounding errors,

27
00:02:05,890 --> 00:02:09,670
numerical rounding errors, or
to really numerical underflow.

28
00:02:09,670 --> 00:02:13,170
And because the log function,
that's the logarithmic function,

29
00:02:13,170 --> 00:02:18,728
this is strictly monotonically
increasing function, maximizing P(y).

30
00:02:20,110 --> 00:02:22,090
And because the logarithmic function,

31
00:02:22,090 --> 00:02:26,910
here's the log function, is a strictly
monotonically increasing function,

32
00:02:26,910 --> 00:02:32,070
we know that maximizing log
P(y) given x should give

33
00:02:32,070 --> 00:02:37,370
you the same result as
maximizing P(y) given x.

34
00:02:37,370 --> 00:02:43,350
As in the same value of y that maximizes
this should also maximize that.

35
00:02:43,350 --> 00:02:46,782
So in most implementations,
you keep track of the sum

36
00:02:46,782 --> 00:02:52,030
of logs of the probabilities rather
than the protocol of probabilities.

37
00:02:52,030 --> 00:02:56,780
Now, there's one other change
to this objective function

38
00:02:56,780 --> 00:03:01,270
that makes the machine translation
algorithm work even better.

39
00:03:03,690 --> 00:03:08,160
Which is that, if you referred to
this original objective up here,

40
00:03:09,170 --> 00:03:13,040
if you have a very long sentence,
the probability of that sentence

41
00:03:13,040 --> 00:03:17,530
is going to be low, because you're
multiplying as many terms here.

42
00:03:17,530 --> 00:03:22,260
Lots of numbers are less than 1 to
estimate the probability of that sentence.

43
00:03:22,260 --> 00:03:25,798
And so if you multiply all the numbers
that are less than 1 together,

44
00:03:25,798 --> 00:03:28,427
you just tend to end up
with a smaller probability.

45
00:03:30,334 --> 00:03:34,787
And so this objective function
has an undesirable effect,

46
00:03:34,787 --> 00:03:39,960
that maybe it unnaturally tends to
prefer very short translations.

47
00:03:39,960 --> 00:03:42,260
It tends to prefer very short outputs.

48
00:03:43,800 --> 00:03:48,062
Because the probability of a short
sentence is determined just by multiplying

49
00:03:48,062 --> 00:03:50,264
fewer of these numbers are less than 1.

50
00:03:52,526 --> 00:03:56,060
And so the product would
just be not quite as small.

51
00:03:57,190 --> 00:03:59,600
And by the way,
the same thing is true for this.

52
00:03:59,600 --> 00:04:05,407
The log of our probability is
always less than or equal to 1.

53
00:04:05,407 --> 00:04:07,660
You're actually in this range of the log.

54
00:04:07,660 --> 00:04:12,064
So the more terms you have together,
the more negative this thing becomes.

55
00:04:15,191 --> 00:04:20,118
So there's one other change to
the algorithm that makes it work better,

56
00:04:20,118 --> 00:04:25,530
which is instead of using this as the
objective you're trying to maximize, one

57
00:04:25,530 --> 00:04:30,970
thing you could do is normalize this by
the number of words in your translation.

58
00:04:30,970 --> 00:04:38,279
And so this takes the average of the log
of the probability of each word.

59
00:04:38,279 --> 00:04:44,940
And this significantly reduces the penalty
for outputting longer translations.

60
00:04:44,940 --> 00:04:50,280
And in practice, as a heuristic instead
of dividing by Ty, by the number

61
00:04:50,280 --> 00:04:54,610
of words in the output sentence,
sometimes you use a softer approach.

62
00:04:54,610 --> 00:05:00,349
We have Ty to the power of alpha,
where maybe alpha is equal to 0.7.

63
00:05:00,349 --> 00:05:05,200
So if alpha was equal to 1, then yeah,
completely normalizing by length.

64
00:05:05,200 --> 00:05:07,690
If alpha was equal to 0, then,

65
00:05:07,690 --> 00:05:12,710
well, Ty to the 0 would be 1,
then you're just not normalizing at all.

66
00:05:12,710 --> 00:05:17,497
And this is somewhat in between full
normalization, and no normalization, and

67
00:05:17,497 --> 00:05:22,080
alpha's another hyper parameter you
have within that you can tune to

68
00:05:22,080 --> 00:05:24,430
try to get the best results.

69
00:05:26,330 --> 00:05:31,640
And have to admit, using alpha this way,
this is a heuristic or this is a hack.

70
00:05:31,640 --> 00:05:35,159
There isn't a great theoretical
justification for it, but

71
00:05:35,159 --> 00:05:37,236
people have found this works well.

72
00:05:37,236 --> 00:05:41,430
People have found that it works well in
practice, so many groups will do this.

73
00:05:41,430 --> 00:05:45,680
And you can try out different
values of alpha and

74
00:05:45,680 --> 00:05:47,490
see which one gives you the best result.

75
00:05:49,350 --> 00:05:52,050
So just to wrap up how
you run beam search,

76
00:05:52,050 --> 00:05:56,490
as you run beam search you see a lot
of sentences with length equal 1,

77
00:05:56,490 --> 00:06:02,304
a lot of sentences with length equal 2,
a lot of sentences with length equals 3.

78
00:06:03,350 --> 00:06:06,550
And so on, and
maybe you run beam search for 30 steps and

79
00:06:06,550 --> 00:06:11,790
you consider output sentences
up to length 30, let's say.

80
00:06:11,790 --> 00:06:16,710
And so with beam with a 3,
you will be keeping

81
00:06:16,710 --> 00:06:21,580
track of the top three possibilities for
each of these possible sentence lengths,

82
00:06:21,580 --> 00:06:25,820
1, 2, 3, 4 and so on, up to 30.

83
00:06:25,820 --> 00:06:30,800
Then, you would look at all of the output

84
00:06:30,800 --> 00:06:37,270
sentences and
score them against this score.

85
00:06:37,270 --> 00:06:42,790
And so you can take your top sentences and
just compute this

86
00:06:42,790 --> 00:06:49,660
objective function onto sentences that you
have seen through the beam search process.

87
00:06:49,660 --> 00:06:54,484
And then finally, of all of these
sentences that you validate this way,

88
00:06:54,484 --> 00:06:59,150
you pick the one that achieves
the highest value on this normalized log

89
00:06:59,150 --> 00:07:00,920
probability objective.

90
00:07:00,920 --> 00:07:04,040
Sometimes it's called a normalized
log likelihood objective.

91
00:07:04,040 --> 00:07:07,120
And then that would be the final
translation, your outputs.

92
00:07:08,510 --> 00:07:11,030
So that's how you implement beam search,
and

93
00:07:11,030 --> 00:07:15,040
you get to play this yourself
in this week's problem exercise.

94
00:07:15,040 --> 00:07:21,080
Finally, a few implementational details,
how do you choose the beam width B?

95
00:07:21,080 --> 00:07:24,990
The larger B is, the more
possibilities you're considering, and

96
00:07:24,990 --> 00:07:28,130
does the better the sentence
you probably find.

97
00:07:28,130 --> 00:07:31,534
But the larger B is,
the more computationally expensive your

98
00:07:31,534 --> 00:07:35,929
algorithm is, because you're also
keeping a lot more possibilities around.

99
00:07:35,929 --> 00:07:37,624
All right, so finally,

100
00:07:37,624 --> 00:07:42,480
let's just wrap up with some thoughts
on how to choose the beam width B.

101
00:07:43,530 --> 00:07:49,020
So here are the pros and cons of setting
B to be very large versus very small.

102
00:07:49,020 --> 00:07:52,110
If the beam width is very large,

103
00:07:52,110 --> 00:07:58,310
then you consider a lot of possibilities,
and so you tend to get a better result

104
00:07:58,310 --> 00:08:03,270
because you are consuming a lot of
different options, but it will be slower.

105
00:08:03,270 --> 00:08:07,790
And the memory requirements will also
grow, will also be compositionally slower.

106
00:08:08,900 --> 00:08:13,420
Whereas if you use a very small beam
width, then you get a worse result because

107
00:08:13,420 --> 00:08:18,480
you're just keeping less possibilities
in mind as the algorithm is running.

108
00:08:18,480 --> 00:08:24,600
But you get a result faster and the memory
requirements will also be lower.

109
00:08:24,600 --> 00:08:29,720
So in the previous video, we used in our
running example a beam width of three,

110
00:08:29,720 --> 00:08:32,630
so we're keeping three
possibilities in mind.

111
00:08:32,630 --> 00:08:35,220
In practice, that is on the small side.

112
00:08:35,220 --> 00:08:39,990
In production systems, it's not uncommon
to see a beam width maybe around 10,

113
00:08:39,990 --> 00:08:44,780
and I think beam width of 100
would be considered very large for

114
00:08:44,780 --> 00:08:48,052
a production system,
depending on the application.

115
00:08:48,052 --> 00:08:52,880
But for research systems where people
want to squeeze out every last drop of

116
00:08:52,880 --> 00:08:56,110
performance in order to publish
the paper with the best possible result.

117
00:08:56,110 --> 00:09:01,279
It's not uncommon to see people
use beam widths of 1,000 or 3,000,

118
00:09:01,279 --> 00:09:06,880
but this is very application,
that's why it's a domain dependent.

119
00:09:06,880 --> 00:09:12,850
So I would say try other variety of values
of B as you work through your application.

120
00:09:12,850 --> 00:09:17,525
But when B gets very large,
there is often diminishing returns.

121
00:09:18,997 --> 00:09:23,198
So for many applications, I would
expect to see a huge gain as you go

122
00:09:23,198 --> 00:09:28,011
from a beam widht of 1, which is very
greedy search, to 3, to maybe 10.

123
00:09:28,011 --> 00:09:33,827
But the gains as you go from 1,000 to
3,000 in beam width might not be as big.

124
00:09:34,847 --> 00:09:41,717
And for those of you that have taken maybe
a lot of computer science courses before,

125
00:09:41,717 --> 00:09:45,760
if you're familiar with computer
science search algorithms like BFS,

126
00:09:45,760 --> 00:09:49,120
Breadth First Search, or
DFS, Depth First Search.

127
00:09:49,120 --> 00:09:52,990
The way to think about beam search is
that, unlike those other algorithms which

128
00:09:52,990 --> 00:09:56,640
you have learned about in a computer
science algorithms course, and

129
00:09:56,640 --> 00:09:58,690
don't worry about it if you've
not heard of these algorithms.

130
00:09:58,690 --> 00:10:02,650
But if you've heard of Breadth First
Search and Depth First Search then unlike

131
00:10:02,650 --> 00:10:05,770
those algorithms,
which are exact search algorithms.

132
00:10:05,770 --> 00:10:11,690
Beam search runs much faster but does not
guarantee to find the exact maximum for

133
00:10:11,690 --> 00:10:14,940
this arg max that you would like to find.

134
00:10:14,940 --> 00:10:17,210
If you haven't heard of breadth
first search or depth first search,

135
00:10:17,210 --> 00:10:19,958
don't worry about it,
it's not important for our purposes.

136
00:10:19,958 --> 00:10:25,050
But if you have, this is how beam
search relates to those algorithms.

137
00:10:25,050 --> 00:10:29,080
So that's it for beam search,
which is a widely used algorithm in

138
00:10:29,080 --> 00:10:33,300
many production systems, or
in many commercial systems.

139
00:10:33,300 --> 00:10:37,480
Now, in the circles in the sequence
of courses of deep learning,

140
00:10:37,480 --> 00:10:40,410
we talked a lot about error analysis.

141
00:10:40,410 --> 00:10:44,090
It turns out, one of the most useful
tools I've found is to be able to do

142
00:10:44,090 --> 00:10:46,180
error analysis on beam search.

143
00:10:46,180 --> 00:10:48,710
So you sometimes wonder,
should I increase my beam width?

144
00:10:48,710 --> 00:10:50,830
Is my beam width working well enough?

145
00:10:50,830 --> 00:10:53,720
And there's some simple things
you can compute to give you

146
00:10:53,720 --> 00:10:58,240
guidance on whether you need to work
on improving your search algorithm.

147
00:10:58,240 --> 00:11:00,090
Let's talk about that in the next video.