1
00:00:00,000 --> 00:00:03,505
You've seen how set to have a dev set and evaluation metric

2
00:00:03,505 --> 00:00:07,170
is like placing a target somewhere for your team to aim at.

3
00:00:07,170 --> 00:00:09,330
But sometimes partway through a project you might

4
00:00:09,330 --> 00:00:12,040
realize you put your target in the wrong place.

5
00:00:12,040 --> 00:00:14,365
In that case you should move your target.

6
00:00:14,365 --> 00:00:16,156
Let's take a look at an example.

7
00:00:16,156 --> 00:00:21,210
Let's say you build a cat classifier to try to find lots of pictures of cats to show to

8
00:00:21,210 --> 00:00:26,850
your cat loving users and the metric that you decided to use is classification error.

9
00:00:26,850 --> 00:00:29,655
So algorithms A and B have, respectively,

10
00:00:29,655 --> 00:00:32,025
3 percent error and 5 percent error,

11
00:00:32,025 --> 00:00:34,955
so it seems like Algorithm A is doing better.

12
00:00:34,955 --> 00:00:38,290
But let's say you try out these algorithms, you look at these algorithms and Algorithm A,

13
00:00:38,290 --> 00:00:43,760
for some reason, is letting through a lot of the pornographic images.

14
00:00:43,760 --> 00:00:46,985
So if you shift Algorithm A the users would

15
00:00:46,985 --> 00:00:51,185
see more cat images because you'll see 3 percent error and identify cats,

16
00:00:51,185 --> 00:00:53,280
but it also shows the users

17
00:00:53,280 --> 00:00:57,465
some pornographic images which is totally unacceptable both for your company,

18
00:00:57,465 --> 00:00:59,455
as well as for your users.

19
00:00:59,455 --> 00:01:03,620
In contrast, Algorithm B has 5 percent error so this

20
00:01:03,620 --> 00:01:08,720
classifies fewer images but it doesn't have pornographic images.

21
00:01:08,720 --> 00:01:10,845
So from your company's point of view,

22
00:01:10,845 --> 00:01:13,530
as well as from a user acceptance point of view,

23
00:01:13,530 --> 00:01:15,920
Algorithm B is actually a much better algorithm

24
00:01:15,920 --> 00:01:19,305
because it's not letting through any pornographic images.

25
00:01:19,305 --> 00:01:22,165
So, what has happened in this example is that Algorithm A

26
00:01:22,165 --> 00:01:25,590
is doing better on evaluation metric.

27
00:01:25,590 --> 00:01:29,990
It's getting 3 percent error but it is actually a worse algorithm.

28
00:01:29,990 --> 00:01:33,150
In this case, the evaluation metric plus

29
00:01:33,150 --> 00:01:38,460
the dev set prefers Algorithm A because they're saying, look,

30
00:01:38,460 --> 00:01:43,860
Algorithm A has lower error which is the metric you're using but you and

31
00:01:43,860 --> 00:01:51,010
your users prefer Algorithm B because it's not letting through pornographic images.

32
00:01:51,010 --> 00:01:52,780
So when this happens,

33
00:01:52,780 --> 00:01:55,785
when your evaluation metric is no longer

34
00:01:55,785 --> 00:01:59,530
correctly rank ordering preferences between algorithms,

35
00:01:59,530 --> 00:02:04,020
in this case is mispredicting that Algorithm A is a better algorithm,

36
00:02:04,020 --> 00:02:05,970
then that's a sign that you should change

37
00:02:05,970 --> 00:02:13,500
your evaluation metric or perhaps your development set or test set.

38
00:02:13,500 --> 00:02:16,840
In this case the misclassification error metric

39
00:02:16,840 --> 00:02:20,340
that you're using can be written as follows: this one over m,

40
00:02:20,340 --> 00:02:23,870
a number of examples in your development set,

41
00:02:23,870 --> 00:02:30,091
of sum from i equals 1 to mdev,

42
00:02:30,091 --> 00:02:37,190
number of examples in this development set of indicator of whether or not the prediction

43
00:02:37,190 --> 00:02:44,995
of example i in your development set is not equal to the actual label i,

44
00:02:44,995 --> 00:02:50,390
where they use this notation to denote their predictive value.

45
00:02:50,390 --> 00:02:51,990
Right. So these are zero.

46
00:02:51,990 --> 00:02:54,795
And this indicates a function notation,

47
00:02:54,795 --> 00:03:00,595
counts up the number of examples on which this thing inside it's true.

48
00:03:00,595 --> 00:03:06,533
So this formula just counts up the number of misclassified examples.

49
00:03:06,533 --> 00:03:09,360
The problem with this evaluation metric is that they treat

50
00:03:09,360 --> 00:03:13,957
pornographic and non-pornographic images equally

51
00:03:13,957 --> 00:03:18,750
but you really want your classifier to not mislabel pornographic images,

52
00:03:18,750 --> 00:03:21,470
like maybe you recognize a pornographic image in cat image and

53
00:03:21,470 --> 00:03:24,605
therefore show it to unsuspecting user,

54
00:03:24,605 --> 00:03:31,727
therefore very unhappy with unexpectedly seeing porn.

55
00:03:31,727 --> 00:03:38,590
One way to change this evaluation metric would be if you add the weight term here,

56
00:03:38,590 --> 00:03:48,385
we call this w(i) where w(i) is going to be equal to 1 if x(i) is

57
00:03:48,385 --> 00:03:53,950
non-porn and maybe 10 or

58
00:03:53,950 --> 00:04:00,925
maybe even large number like a 100 if x(i) is porn.

59
00:04:00,925 --> 00:04:05,095
So this way you're giving a much larger weight

60
00:04:05,095 --> 00:04:09,220
to examples that are pornographic so that the error term goes up

61
00:04:09,220 --> 00:04:12,800
much more if the algorithm makes a mistake on classifying

62
00:04:12,800 --> 00:04:16,931
a pornographic image as a cat image.

63
00:04:16,931 --> 00:04:19,390
In this example you giving

64
00:04:19,390 --> 00:04:25,355
10 times bigger weights to classify pornographic images correctly.

65
00:04:25,355 --> 00:04:27,465
If you want this normalization constant,

66
00:04:27,465 --> 00:04:30,743
technically this becomes sum over i of w(i),

67
00:04:30,743 --> 00:04:35,633
so then this error would still be between zero and one.

68
00:04:35,633 --> 00:04:40,630
The details of this weighting aren't important and actually to implement this weighting,

69
00:04:40,630 --> 00:04:43,690
you need to actually go through your dev and test sets,

70
00:04:43,690 --> 00:04:47,415
so label the pornographic images

71
00:04:47,415 --> 00:04:50,960
in your dev and test sets so you can implement this weighting function.

72
00:04:50,960 --> 00:04:53,095
But the high level of take away is,

73
00:04:53,095 --> 00:04:56,965
if you find that evaluation metric is not giving

74
00:04:56,965 --> 00:05:01,405
the correct rank order preference for what is actually better algorithm,

75
00:05:01,405 --> 00:05:06,880
then there's a time to think about defining a new evaluation metric.

76
00:05:06,880 --> 00:05:12,200
And this is just one possible way that you could define an evaluation metric.

77
00:05:12,200 --> 00:05:15,220
The goal of the evaluation metric is accurately tell you,

78
00:05:15,220 --> 00:05:20,154
given two classifiers, which one is better for your application.

79
00:05:20,154 --> 00:05:21,650
For the purpose of this video,

80
00:05:21,650 --> 00:05:25,863
don't worry too much about the details of how we define a new error metric,

81
00:05:25,863 --> 00:05:29,255
the point is that if you're not satisfied with your old error metric

82
00:05:29,255 --> 00:05:33,050
then don't keep coasting with an error metric you're unsatisfied with,

83
00:05:33,050 --> 00:05:36,260
instead try to define a new one that you think better captures

84
00:05:36,260 --> 00:05:39,659
your preferences in terms of what's actually a better algorithm.

85
00:05:39,659 --> 00:05:42,890
One thing you might notice is that so far we've only talked about

86
00:05:42,890 --> 00:05:46,255
how to define a metric to evaluate classifiers.

87
00:05:46,255 --> 00:05:50,450
That is, we've defined an evaluation metric that helps us

88
00:05:50,450 --> 00:05:53,780
better rank order classifiers when

89
00:05:53,780 --> 00:05:57,887
they are performing at varying levels in terms of streaming of porn.

90
00:05:57,887 --> 00:06:01,505
And this is actually an example of an orthogonalization where

91
00:06:01,505 --> 00:06:05,480
I think you should take a machine learning problem and break it into distinct steps.

92
00:06:05,480 --> 00:06:14,525
One step is to figure out how to define a metric that captures what you want to do,

93
00:06:14,525 --> 00:06:21,677
and I would worry separately about how to actually do well on this metric.

94
00:06:21,677 --> 00:06:26,480
So think of the machine learning task as two distinct steps.

95
00:06:26,480 --> 00:06:28,145
To use the target analogy,

96
00:06:28,145 --> 00:06:32,890
the first step is to place the target.

97
00:06:32,890 --> 00:06:37,777
So define where you want to aim and then as a completely separate step,

98
00:06:37,777 --> 00:06:40,340
this is one you can tune which is how do you

99
00:06:40,340 --> 00:06:44,005
place the target as a completely separate problem.

100
00:06:44,005 --> 00:06:48,854
Think of it as a separate step to tune in terms of how to do well at this algorithm,

101
00:06:48,854 --> 00:06:58,888
how to aim accurately or how to shoot at the target.

102
00:06:58,888 --> 00:07:06,200
Defining the metric is step one and you do something else for step two.

103
00:07:06,200 --> 00:07:08,140
In terms of shooting at the target,

104
00:07:08,140 --> 00:07:11,910
maybe your learning algorithm is optimizing some cost function that looks like this,

105
00:07:11,910 --> 00:07:21,907
where you are minimizing some of losses on your training set.

106
00:07:21,907 --> 00:07:25,880
One thing you could do is to also modify

107
00:07:25,880 --> 00:07:28,160
this in order to incorporate these weights

108
00:07:28,160 --> 00:07:31,070
and maybe end up changing this normalization constant as well.

109
00:07:31,070 --> 00:07:34,240
So it just 1 over a sum of w(i).

110
00:07:34,240 --> 00:07:36,990
Again, the details of how you define J aren't important,

111
00:07:36,990 --> 00:07:42,050
but the point was with the philosophy of orthogonalization think of placing the target as

112
00:07:42,050 --> 00:07:48,456
one step and aiming and shooting at a target as a distinct step which you do separately.

113
00:07:48,456 --> 00:07:49,975
In other words I encourage you to think of,

114
00:07:49,975 --> 00:07:55,225
defining the metric as one step and only after you define a metric,

115
00:07:55,225 --> 00:07:57,640
figure out how to do well on that metric which might be

116
00:07:57,640 --> 00:08:00,745
changing the cost function J that your neural network is optimizing.

117
00:08:00,745 --> 00:08:03,675
Before going on, let's look at just one more example.

118
00:08:03,675 --> 00:08:08,135
Let's say that your two cat classifiers A and B have, respectively,

119
00:08:08,135 --> 00:08:13,300
3 percent error and 5 percent error as evaluated on your dev set.

120
00:08:13,300 --> 00:08:17,725
Or maybe even on your test set which are images downloaded off the internet,

121
00:08:17,725 --> 00:08:19,615
so high quality well framed images.

122
00:08:19,615 --> 00:08:21,760
But maybe when you deploy your algorithm product,

123
00:08:21,760 --> 00:08:24,895
you find that algorithm B actually looks like it's performing better,

124
00:08:24,895 --> 00:08:27,865
even though it's doing better on your dev set.

125
00:08:27,865 --> 00:08:30,820
And you find that you've been training off

126
00:08:30,820 --> 00:08:33,640
very nice high quality images downloaded off

127
00:08:33,640 --> 00:08:36,985
the Internet but when you deploy those on the mobile app,

128
00:08:36,985 --> 00:08:39,620
users are uploading all sorts of pictures, they're much less framed,

129
00:08:39,620 --> 00:08:42,835
you haven't only covered the cat, the cats have funny facial expressions,

130
00:08:42,835 --> 00:08:44,778
maybe images are much blurrier,

131
00:08:44,778 --> 00:08:51,745
and when you test out your algorithms you find that Algorithm B is actually doing better.

132
00:08:51,745 --> 00:08:58,755
So this would be another example of your metric and dev test sets falling down.

133
00:08:58,755 --> 00:09:01,140
The problem is that you're evaluating on

134
00:09:01,140 --> 00:09:04,265
the dev and test sets a very nice, high resolution,

135
00:09:04,265 --> 00:09:06,830
well-framed images but what your users

136
00:09:06,830 --> 00:09:09,750
really care about is you have them doing well on images they are uploading,

137
00:09:09,750 --> 00:09:15,935
which are maybe less professional shots and blurrier and less well framed.

138
00:09:15,935 --> 00:09:17,655
So the guideline is,

139
00:09:17,655 --> 00:09:20,030
if doing well on your metric and

140
00:09:20,030 --> 00:09:23,455
your current dev sets or dev and test sets' distribution,

141
00:09:23,455 --> 00:09:27,845
if that does not correspond to doing well on the application you actually care about,

142
00:09:27,845 --> 00:09:32,695
then change your metric and your dev test set.

143
00:09:32,695 --> 00:09:38,490
In other words, if we discover that your dev test set has these very high quality images

144
00:09:38,490 --> 00:09:41,960
but evaluating on this dev test set

145
00:09:41,960 --> 00:09:45,915
is not predictive of how well your app actually performs,

146
00:09:45,915 --> 00:09:47,300
because your app needs to deal with lower quality images,

147
00:09:47,300 --> 00:09:51,335
then that's a good time to change your dev test

148
00:09:51,335 --> 00:09:56,875
set so that your data better reflects the type of data you actually need to do well on.

149
00:09:56,875 --> 00:10:00,560
But the overall guideline is if your current metric and data you are

150
00:10:00,560 --> 00:10:04,905
evaluating on doesn't correspond to doing well on what you actually care about,

151
00:10:04,905 --> 00:10:07,820
then change your metrics and/or your dev/test set to

152
00:10:07,820 --> 00:10:11,206
better capture what you need your algorithm to actually do well on.

153
00:10:11,206 --> 00:10:14,690
Having an evaluation metric and the dev set allows you to

154
00:10:14,690 --> 00:10:18,685
much more quickly make decisions about is Algorithm A or Algorithm B better.

155
00:10:18,685 --> 00:10:22,485
It really speeds up how quickly you and your team can iterate.

156
00:10:22,485 --> 00:10:24,110
So my recommendation is,

157
00:10:24,110 --> 00:10:28,220
even if you can't define the perfect evaluation metric and dev set,

158
00:10:28,220 --> 00:10:32,780
just set something up quickly and use that to drive the speed of your team iterating.

159
00:10:32,780 --> 00:10:36,060
And if later down the line you find out that it wasn't a good one,

160
00:10:36,060 --> 00:10:39,675
you have better idea, change it at that time, it's perfectly okay.

161
00:10:39,675 --> 00:10:42,230
But what I recommend against for the most teams is

162
00:10:42,230 --> 00:10:45,800
to run for too long without any evaluation metric and

163
00:10:45,800 --> 00:10:48,500
dev set up because that can slow down

164
00:10:48,500 --> 00:10:52,750
the efficiency of what your team can iterate and improve your algorithm.

165
00:10:52,750 --> 00:10:58,795
So that says on when to change your evaluation metric and/or dev and test sets.

166
00:10:58,795 --> 00:11:02,480
I hope that these guidelines help you set up your whole team to have

167
00:11:02,480 --> 00:11:07,370
a well-defined target that you can iterate efficiently towards improving performance.