1
00:00:00,210 --> 00:00:05,570
In the third course of this sequence of
five courses, you saw how error analysis

2
00:00:05,570 --> 00:00:11,110
can help you focus your time on doing
the most useful work for your project.

3
00:00:11,110 --> 00:00:14,122
Now, beam search is
an approximate search algorithm,

4
00:00:14,122 --> 00:00:16,500
also called a heuristic search algorithm.

5
00:00:16,500 --> 00:00:20,439
And so it doesn't always output
the most likely sentence.

6
00:00:20,439 --> 00:00:26,140
It's only keeping track of B equals 3 or
10 or 100 top possibilities.

7
00:00:26,140 --> 00:00:29,180
So what if beam search makes a mistake?

8
00:00:29,180 --> 00:00:33,888
In this video, you'll learn how error
analysis interacts with beam search and

9
00:00:33,888 --> 00:00:38,457
how you can figure out whether it is
the beam search algorithm that's causing

10
00:00:38,457 --> 00:00:40,722
problems and worth spending time on.

11
00:00:40,722 --> 00:00:44,350
Or whether it might be your RNN
model that is causing problems and

12
00:00:44,350 --> 00:00:46,290
worth spending time on.

13
00:00:46,290 --> 00:00:50,580
Let's take a look at how to do
error analysis with beam search.

14
00:00:50,580 --> 00:00:56,550
Let's use this example of Jane
visite l'Afrique en septembre.

15
00:00:56,550 --> 00:01:00,110
So let's say that in your
machine translation dev set,

16
00:01:00,110 --> 00:01:04,202
your development set,
the human provided this translation and

17
00:01:04,202 --> 00:01:08,311
Jane visits Africa in September,
and I'm going to call this y*.

18
00:01:08,311 --> 00:01:11,975
So it is a pretty good
translation written by a human.

19
00:01:11,975 --> 00:01:16,367
Then let's say that when you run beam
search on your learned RNN model and

20
00:01:16,367 --> 00:01:20,545
your learned translation model,
it ends up with this translation,

21
00:01:20,545 --> 00:01:24,506
which we will call y-hat,
Jane visited Africa last September,

22
00:01:24,506 --> 00:01:28,262
which is a much worse translation
of the French sentence.

23
00:01:28,262 --> 00:01:32,682
It actually changes the meaning,
so it's not a good translation.

24
00:01:32,682 --> 00:01:35,981
Now, your model has two main components.

25
00:01:35,981 --> 00:01:40,833
There is a neural network model,
the sequence to sequence model.

26
00:01:40,833 --> 00:01:43,626
We shall just call this your RNN model.

27
00:01:43,626 --> 00:01:45,986
It's really an encoder and a decoder.

28
00:01:45,986 --> 00:01:49,268
And you have your beam search algorithm,

29
00:01:49,268 --> 00:01:52,934
which you're running
with some beam width b.

30
00:01:52,934 --> 00:01:56,408
And wouldn't it be nice if you
could attribute this error,

31
00:01:56,408 --> 00:02:00,240
this not very good translation,
to one of these two components?

32
00:02:00,240 --> 00:02:04,812
Was it the RNN or really the neural
network that is more to blame, or

33
00:02:04,812 --> 00:02:08,662
is it the beam search algorithm,
that is more to blame?

34
00:02:08,662 --> 00:02:12,246
And what you saw in the third
course of the sequence is that

35
00:02:12,246 --> 00:02:17,170
it's always tempting to collect more
training data that never hurts.

36
00:02:17,170 --> 00:02:21,665
So in similar way, it's always tempting
to increase the beam width that never

37
00:02:21,665 --> 00:02:23,700
hurts or pretty much never hurts.

38
00:02:23,700 --> 00:02:28,890
But just as getting more
training data by itself might not

39
00:02:28,890 --> 00:02:31,470
get you to the level of
performance you want.

40
00:02:31,470 --> 00:02:32,490
In the same way,

41
00:02:32,490 --> 00:02:37,540
increasing the beam width by itself might
not get you to where you want to go.

42
00:02:38,570 --> 00:02:40,580
But how do you decide whether or

43
00:02:40,580 --> 00:02:44,325
not improving the search algorithm
is a good use of your time?

44
00:02:44,325 --> 00:02:46,917
So just how you can break
the problem down and

45
00:02:46,917 --> 00:02:50,012
figure out what's actually
a good use of your time.

46
00:02:50,012 --> 00:02:52,339
Now, the RNN, the neural network,

47
00:02:52,339 --> 00:02:56,263
what was called RNN really means
the encoder and the decoder.

48
00:02:56,263 --> 00:03:02,248
It computes P(y given x).

49
00:03:02,248 --> 00:03:07,268
So for example, for
a sentence, Jane visits Africa

50
00:03:07,268 --> 00:03:11,850
in September,
you plug in Jane visits Africa.

51
00:03:11,850 --> 00:03:15,741
Again, I'm ignoring upper versus
lowercase now, right, and so on.

52
00:03:15,741 --> 00:03:18,810
And this computes P(y given x).

53
00:03:18,810 --> 00:03:22,965
So it turns out that
the most useful thing for

54
00:03:22,965 --> 00:03:28,804
you to do at this point is to
compute using this model to compute

55
00:03:28,804 --> 00:03:36,840
P(y* given x) as well as to compute
P(y-hat given x) using your RNN model.

56
00:03:36,840 --> 00:03:39,570
And then to see which
of these two is bigger.

57
00:03:39,570 --> 00:03:43,120
So it's possible that the left side
is bigger than the right hand side.

58
00:03:43,120 --> 00:03:47,744
It's also possible that P(y*) is less
than P(y-hat) actually, or less than or

59
00:03:47,744 --> 00:03:48,776
equal to, right?

60
00:03:48,776 --> 00:03:53,481
Depending on which of these two cases
hold true, you'd be able to more

61
00:03:53,481 --> 00:03:58,511
clearly ascribe this particular error,
this particular bad translation

62
00:03:58,511 --> 00:04:04,010
to one of the RNN or the beam search
algorithm being had greater fault.

63
00:04:04,010 --> 00:04:07,252
So let's take out the logic behind this.

64
00:04:07,252 --> 00:04:09,550
Here are the two sentences
from the previous slide.

65
00:04:09,550 --> 00:04:14,460
And remember,
we're going to compute P(y* given x) and

66
00:04:14,460 --> 00:04:19,870
P(y-hat given x) and
see which of these two is bigger.

67
00:04:19,870 --> 00:04:21,196
So there are going to be two cases.

68
00:04:21,196 --> 00:04:26,184
In case 1,
P(y* given x) as output by the RNN

69
00:04:26,184 --> 00:04:31,360
model is greater than P(y-hat given x).

70
00:04:31,360 --> 00:04:32,140
What does this mean?

71
00:04:32,140 --> 00:04:37,778
Well, the beam search
algorithm chose y-hat, right?

72
00:04:37,778 --> 00:04:44,194
The way you got y-hat was you had
an RNN that was computing P(y given x).

73
00:04:44,194 --> 00:04:50,390
And beam search's job was to try to find
a value of y that gives that arg max.

74
00:04:51,500 --> 00:04:57,389
But in this case,
y* actually attains a higher value for

75
00:04:57,389 --> 00:05:00,720
P(y given x) than the y-hat.

76
00:05:00,720 --> 00:05:05,340
So what this allows you to conclude is
beam search is failing to actually give

77
00:05:05,340 --> 00:05:10,980
you the value of y that maximizes
P(y given x) because the one

78
00:05:10,980 --> 00:05:15,850
job that beam search had was to find
the value of y that makes this really big.

79
00:05:15,850 --> 00:05:19,660
But it chose y-hat,
the y* actually gets a much bigger value.

80
00:05:19,660 --> 00:05:23,601
So in this case, you could conclude
that beam search is at fault.

81
00:05:24,750 --> 00:05:26,219
Now, how about the other case?

82
00:05:26,219 --> 00:05:30,438
In case 2, P(y* given x) is less than or

83
00:05:30,438 --> 00:05:34,319
equal to P(y-hat given x), right?

84
00:05:34,319 --> 00:05:37,410
And then either this or
this has gotta be true.

85
00:05:37,410 --> 00:05:40,530
So either case 1 or
case 2 has to hold true.

86
00:05:40,530 --> 00:05:43,820
What do you conclude under case 2?

87
00:05:43,820 --> 00:05:47,095
Well, in our example,

88
00:05:47,095 --> 00:05:51,490
y* is a better translation than y-hat.

89
00:05:51,490 --> 00:05:57,081
But according to the RNN,
P(y*) is less than P(y-hat),

90
00:05:57,081 --> 00:06:02,670
so saying that y* is a less
likely output than y-hat.

91
00:06:02,670 --> 00:06:07,810
So in this case,
it seems that the RNN model is

92
00:06:07,810 --> 00:06:12,440
at fault and it might be worth
spending more time working on the RNN.

93
00:06:13,510 --> 00:06:16,310
There's some subtleties here pertaining to

94
00:06:16,310 --> 00:06:18,970
length normalizations
that I'm glossing over.

95
00:06:18,970 --> 00:06:23,487
There's some subtleties pertaining
to length normalizations that I'm

96
00:06:23,487 --> 00:06:24,520
glossing over.

97
00:06:24,520 --> 00:06:28,330
And if you are using some
sort of length normalization,

98
00:06:28,330 --> 00:06:32,880
instead of evaluating these probabilities,
you should be evaluating the optimization

99
00:06:32,880 --> 00:06:36,620
objective that takes into
account length normalization.

100
00:06:36,620 --> 00:06:41,048
But ignoring that complication for now,
in this case, what this tells you is that

101
00:06:41,048 --> 00:06:46,830
even though y* is a better translation,

102
00:06:46,830 --> 00:06:53,142
the RNN ascribed y* in lower probability
than the inferior translation.

103
00:06:53,142 --> 00:06:57,570
So in this case,
I will say the RNN model is at fault.

104
00:06:57,570 --> 00:07:01,460
So the error analysis
process looks as follows.

105
00:07:01,460 --> 00:07:03,572
You go through the development set and

106
00:07:03,572 --> 00:07:07,270
find the mistakes that the algorithm
made in the development set.

107
00:07:08,300 --> 00:07:16,103
And so in this example, let's say that
P(y* given x) was 2 x 10 to the -10,

108
00:07:16,103 --> 00:07:21,610
whereas, P(y-hat given x)
was 1 x 10 to the -10.

109
00:07:21,610 --> 00:07:26,746
Using the logic from the previous slide,
in this case, we see that

110
00:07:26,746 --> 00:07:32,725
beam search actually chose y-hat,
which has a lower probability than y*.

111
00:07:32,725 --> 00:07:35,220
So I will say beam search is at fault.

112
00:07:35,220 --> 00:07:36,348
So I'll abbreviate that B.

113
00:07:36,348 --> 00:07:39,440
And then you go through
a second mistake or

114
00:07:39,440 --> 00:07:43,390
second bad output by the algorithm,
look at these probabilities.

115
00:07:43,390 --> 00:07:47,150
And maybe for the second example,
you think the model is at fault.

116
00:07:47,150 --> 00:07:50,045
I'm going to abbreviate
the RNN model with R.

117
00:07:50,045 --> 00:07:52,253
And you go through more examples.

118
00:07:52,253 --> 00:07:57,037
And sometimes the beam search is at fault,
sometimes the model is at fault,

119
00:07:57,037 --> 00:07:57,720
and so on.

120
00:07:58,760 --> 00:08:04,530
And through this process, you can then
carry out error analysis to figure out

121
00:08:04,530 --> 00:08:10,560
what fraction of errors are due to
beam search versus the RNN model.

122
00:08:10,560 --> 00:08:16,660
And with an error analysis process like
this, for every example in your dev sets,

123
00:08:16,660 --> 00:08:23,450
where the algorithm gives a much worse
output than the human translation,

124
00:08:23,450 --> 00:08:28,230
you can try to ascribe the error
to either the search algorithm or

125
00:08:28,230 --> 00:08:32,350
to the objective function, or
to the RNN model that generates

126
00:08:32,350 --> 00:08:37,240
the objective function that beam
search is supposed to be maximizing.

127
00:08:37,240 --> 00:08:41,340
And through this, you can try to figure
out which of these two components is

128
00:08:41,340 --> 00:08:43,470
responsible for more errors.

129
00:08:43,470 --> 00:08:46,940
And only if you find that beam search
is responsible for a lot of errors,

130
00:08:46,940 --> 00:08:51,300
then maybe is we're working hard
to increase the beam width.

131
00:08:51,300 --> 00:08:55,080
Whereas in contrast, if you find
that the RNN model is at fault,

132
00:08:55,080 --> 00:08:59,350
then you could do a deeper layer of
analysis to try to figure out if you want

133
00:08:59,350 --> 00:09:02,571
to add regularization, or
get more training data, or

134
00:09:02,571 --> 00:09:06,164
try a different network architecture,
or something else.

135
00:09:06,164 --> 00:09:10,597
And so a lot of the techniques that
you saw in the third course in

136
00:09:10,597 --> 00:09:13,505
the sequence will be applicable there.

137
00:09:13,505 --> 00:09:17,740
So that's it for
error analysis using beam search.

138
00:09:17,740 --> 00:09:22,170
I found this particular error analysis
process very useful whenever you have

139
00:09:22,170 --> 00:09:25,870
an approximate optimization algorithm,
such as beam search

140
00:09:25,870 --> 00:09:29,690
that is working to optimize some sort of
objective, some sort of cost function

141
00:09:29,690 --> 00:09:33,120
that is output by a learning algorithm,
such as a sequence-to-sequence model or

142
00:09:33,120 --> 00:09:37,110
a sequence-to-sequence RNN that we've
been discussing in these lectures.

143
00:09:37,110 --> 00:09:41,840
So with that, I hope that you'll be more
efficient at making these types of models

144
00:09:41,840 --> 00:09:43,170
work well for your applications.