1
00:00:00,410 --> 00:00:01,520
In the last video, we talked

2
00:00:01,820 --> 00:00:04,130
about precision and recall as

3
00:00:04,280 --> 00:00:06,180
an evaluation metric for classification

4
00:00:06,840 --> 00:00:08,220
problems with skew classes.

5
00:00:09,530 --> 00:00:11,020
For many applications, we'll want

6
00:00:11,180 --> 00:00:13,350
to somehow control the trade

7
00:00:13,630 --> 00:00:15,640
off between position and recall.

8
00:00:16,500 --> 00:00:17,310
Let me tell you how

9
00:00:17,470 --> 00:00:19,020
to do that and also show

10
00:00:19,390 --> 00:00:20,520
you some, even more effective

11
00:00:21,050 --> 00:00:22,810
ways to use precision and

12
00:00:22,980 --> 00:00:24,290
recall as an evaluation

13
00:00:24,720 --> 00:00:27,380
metric for learning algorithms.

14
00:00:28,620 --> 00:00:30,180
As a reminder, here are the

15
00:00:30,250 --> 00:00:32,150
definitions of precision and

16
00:00:32,380 --> 00:00:34,100
recall from the previous video.

17
00:00:35,920 --> 00:00:37,650
Let's continue our cancer classification

18
00:00:38,680 --> 00:00:39,980
example, where y equals

19
00:00:40,370 --> 00:00:41,790
one if the patient has cancer

20
00:00:42,270 --> 00:00:43,310
and y equals zero otherwise.

21
00:00:44,800 --> 00:00:46,060
And let's say we've trained in

22
00:00:46,360 --> 00:00:48,580
logistic regression classifier, which outputs

23
00:00:49,070 --> 00:00:50,690
probabilities between zero and one.

24
00:00:51,740 --> 00:00:52,830
So, as usual, we're

25
00:00:53,010 --> 00:00:54,690
going to predict one, y equals

26
00:00:55,080 --> 00:00:56,290
one if h of x

27
00:00:56,560 --> 00:00:57,980
is greater than or equal to

28
00:00:58,090 --> 00:00:59,720
0.5 and predict zero if

29
00:01:00,140 --> 00:01:01,570
the hypothesis outputs a value

30
00:01:01,820 --> 00:01:03,720
less than 0.5 and this

31
00:01:04,040 --> 00:01:05,400
classifier may give us

32
00:01:05,710 --> 00:01:08,430
some value for precision and some value for recall.

33
00:01:10,420 --> 00:01:11,860
But now, suppose we want

34
00:01:12,140 --> 00:01:13,440
to predict that a patient

35
00:01:13,730 --> 00:01:15,510
has cancer only if we're

36
00:01:15,750 --> 00:01:17,190
very confident that they really do.

37
00:01:18,010 --> 00:01:18,900
Because you know if you go

38
00:01:19,140 --> 00:01:20,180
to a patient and you tell

39
00:01:20,480 --> 00:01:21,570
them that they have cancer, it's

40
00:01:21,710 --> 00:01:22,450
going to give them a huge

41
00:01:22,680 --> 00:01:23,860
shock because this is seriously

42
00:01:24,220 --> 00:01:25,610
bad news and they may

43
00:01:25,700 --> 00:01:27,080
end up going through a pretty

44
00:01:27,660 --> 00:01:29,570
painful treatment process and so on.

45
00:01:29,780 --> 00:01:30,770
And so maybe we want to

46
00:01:30,980 --> 00:01:31,880
tell someone that we think

47
00:01:32,090 --> 00:01:34,240
they have cancer only if they're very confident.

48
00:01:36,230 --> 00:01:37,210
One way to do this would

49
00:01:37,320 --> 00:01:38,940
be to modify the algorithm, so

50
00:01:39,120 --> 00:01:40,290
that instead of setting the

51
00:01:40,710 --> 00:01:42,270
threshold at 0.5, we

52
00:01:42,820 --> 00:01:44,360
might instead say that we'll

53
00:01:44,510 --> 00:01:45,370
predict that y is equal

54
00:01:46,330 --> 00:01:48,630
to 1, only if H of

55
00:01:48,700 --> 00:01:50,200
x is greater than or equal to 0.7.

56
00:01:50,490 --> 00:01:51,620
So this, I think

57
00:01:52,360 --> 00:01:53,400
will tell someone if they

58
00:01:53,510 --> 00:01:54,530
have cancer only if we think

59
00:01:54,810 --> 00:01:56,280
there's a greater than, greater

60
00:01:56,730 --> 00:01:59,060
than or equal to 70% that they have cancer.

61
00:02:00,830 --> 00:02:02,000
And if you do this then

62
00:02:02,850 --> 00:02:03,740
you're predicting some of this

63
00:02:03,840 --> 00:02:04,990
cancer only when you're

64
00:02:05,100 --> 00:02:07,230
more confident, and so

65
00:02:07,520 --> 00:02:08,830
you end up with a classifier

66
00:02:09,920 --> 00:02:13,410
that has higher precision, because

67
00:02:14,140 --> 00:02:15,300
all the patients that you're

68
00:02:15,450 --> 00:02:16,630
going to and say, you know,

69
00:02:16,860 --> 00:02:18,220
we think you have cancer, all

70
00:02:18,440 --> 00:02:19,760
of those patients are now

71
00:02:20,350 --> 00:02:21,420
pretty, once they hear, pretty

72
00:02:21,720 --> 00:02:23,100
confident they actually have cancer.

73
00:02:24,260 --> 00:02:26,050
And so, a higher fraction of

74
00:02:26,150 --> 00:02:27,460
the patients that you predict to

75
00:02:27,530 --> 00:02:28,990
have cancer, will actually turn

76
00:02:29,280 --> 00:02:30,720
out to have cancer, because in

77
00:02:31,000 --> 00:02:32,870
making those predictions we are pretty confident.

78
00:02:34,510 --> 00:02:36,360
But in contrast, this classifier will

79
00:02:36,540 --> 00:02:38,530
have lower recall, because

80
00:02:39,140 --> 00:02:40,220
now we are going

81
00:02:40,340 --> 00:02:41,650
to make predictions, we are

82
00:02:41,740 --> 00:02:44,180
going to predict y equals one, on a smaller number of patients.

83
00:02:45,090 --> 00:02:45,920
Now we could even take this further.

84
00:02:46,330 --> 00:02:47,520
Instead of setting the threshold

85
00:02:48,080 --> 00:02:49,210
at 0.7, we can set

86
00:02:49,490 --> 00:02:51,550
this at 0.9 and we'll predict

87
00:02:52,430 --> 00:02:53,270
y1 only if we are

88
00:02:53,320 --> 00:02:54,560
more than 90% certain that

89
00:02:55,380 --> 00:02:57,020
the patient has cancer, and so,

90
00:02:57,600 --> 00:02:58,720
you know, a large fraction that

91
00:02:58,850 --> 00:02:59,820
those patients will turn out

92
00:03:00,020 --> 00:03:01,380
to have cancer and so,

93
00:03:01,560 --> 00:03:03,060
this is the high precision classifier

94
00:03:04,160 --> 00:03:06,090
will have lower recall because we

95
00:03:06,190 --> 00:03:08,550
want to correctly detect that those patients have cancer.

96
00:03:09,310 --> 00:03:10,780
Now consider a different example.

97
00:03:12,100 --> 00:03:13,200
Suppose we want to avoid

98
00:03:13,470 --> 00:03:15,530
missing too many actual cases of cancer.

99
00:03:15,960 --> 00:03:17,480
So we want to avoid the false negatives.

100
00:03:18,600 --> 00:03:19,820
In particular, if a patient

101
00:03:20,350 --> 00:03:22,280
actually has cancer, but we

102
00:03:22,520 --> 00:03:23,700
fail to tell them that

103
00:03:23,860 --> 00:03:25,710
they have cancer, then that can be really bad.

104
00:03:25,880 --> 00:03:27,460
Because if we tell

105
00:03:27,760 --> 00:03:28,870
a patient that they don't

106
00:03:29,240 --> 00:03:31,460
have cancer then they are

107
00:03:31,530 --> 00:03:32,870
not going to go for treatment and

108
00:03:32,980 --> 00:03:33,890
if it turns out that they

109
00:03:34,050 --> 00:03:35,380
have cancer or we fail

110
00:03:35,520 --> 00:03:36,410
to tell them they have

111
00:03:36,660 --> 00:03:39,060
cancer, well they may not get treated at all.

112
00:03:39,430 --> 00:03:40,520
And so that would be

113
00:03:40,640 --> 00:03:41,820
a really bad outcome because he

114
00:03:42,080 --> 00:03:43,050
died because we told them

115
00:03:43,140 --> 00:03:44,560
they don't have cancer they failed

116
00:03:44,670 --> 00:03:46,780
to get treated, but it turns

117
00:03:48,230 --> 00:03:48,790
out that they actually have cancer.

118
00:03:49,260 --> 00:03:50,260
When in doubt, we want to

119
00:03:50,360 --> 00:03:52,430
predict that y equals one.

120
00:03:52,720 --> 00:03:54,260
So when in doubt, we want

121
00:03:54,480 --> 00:03:55,510
to predict that they have

122
00:03:55,770 --> 00:03:56,820
cancer so that at least

123
00:03:57,110 --> 00:03:58,150
they look further into it

124
00:03:59,400 --> 00:04:00,720
and this can get treated,

125
00:04:01,180 --> 00:04:02,750
in case they do turn out to have cancer.

126
00:04:04,870 --> 00:04:06,300
In this case, rather than setting

127
00:04:06,750 --> 00:04:08,920
higher probability threshold, we might

128
00:04:09,100 --> 00:04:11,370
instead take this value

129
00:04:12,270 --> 00:04:13,310
and this then sets it to

130
00:04:13,540 --> 00:04:14,710
a lower value, so maybe

131
00:04:15,060 --> 00:04:17,390
0.3 like so.

132
00:04:18,760 --> 00:04:19,780
By doing so, we're saying

133
00:04:20,070 --> 00:04:21,380
that, you know what, if we

134
00:04:21,480 --> 00:04:22,190
think there's more than a 30%

135
00:04:22,220 --> 00:04:24,660
chance that they have caner, we better

136
00:04:24,890 --> 00:04:26,270
be more conservative and tell

137
00:04:26,510 --> 00:04:27,330
them that they may have cancer,

138
00:04:27,850 --> 00:04:29,610
so they can seek treatment if necessary.

139
00:04:31,110 --> 00:04:32,630
And in this case, what

140
00:04:32,790 --> 00:04:34,200
we would have is going to

141
00:04:35,120 --> 00:04:38,260
be a higher recall classifier,

142
00:04:39,550 --> 00:04:41,440
because we're going to

143
00:04:41,580 --> 00:04:43,330
be correctly flagging a higher

144
00:04:43,580 --> 00:04:44,760
fraction of all of

145
00:04:44,800 --> 00:04:45,920
the patients that actually do have

146
00:04:46,130 --> 00:04:47,570
cancer, but we're going

147
00:04:47,740 --> 00:04:51,040
to end up with lower precision,

148
00:04:51,670 --> 00:04:53,490
because the higher fraction of

149
00:04:53,600 --> 00:04:54,700
the patients that we said have

150
00:04:54,820 --> 00:04:57,530
cancer, the higher fraction of them will turn out not to have cancer after all.

151
00:05:00,470 --> 00:05:01,320
And by the way, just as an

152
00:05:01,400 --> 00:05:02,640
aside, when I talk

153
00:05:02,920 --> 00:05:04,900
about this to other

154
00:05:05,160 --> 00:05:07,760
students up until before, it's pretty amazing.

155
00:05:08,390 --> 00:05:09,720
Some of my students say is

156
00:05:09,850 --> 00:05:11,960
how I can tell the story both ways.

157
00:05:12,550 --> 00:05:14,220
Why we might want to have

158
00:05:14,450 --> 00:05:15,490
higher precision or higher recall

159
00:05:16,130 --> 00:05:18,570
and the story actually seems to work both ways.

160
00:05:19,340 --> 00:05:20,550
But I hope the details of

161
00:05:20,670 --> 00:05:22,720
the algorithm is true and the

162
00:05:22,990 --> 00:05:24,360
more general principle is, depending

163
00:05:24,780 --> 00:05:26,150
on where you want, whether

164
00:05:26,330 --> 00:05:28,010
you want high precision, lower recall

165
00:05:28,540 --> 00:05:30,340
or higher recall, lower precision, you

166
00:05:30,450 --> 00:05:32,100
can end up predicting y equals

167
00:05:32,540 --> 00:05:35,040
one when h(x) is greater than some threshold.

168
00:05:36,590 --> 00:05:39,240
And so, in general, for

169
00:05:39,880 --> 00:05:41,330
most classifiers, there is going

170
00:05:41,540 --> 00:05:44,200
to be a trade off between precision and recall.

171
00:05:45,360 --> 00:05:46,540
And as you vary the value

172
00:05:47,050 --> 00:05:48,700
of this threshold, this value,

173
00:05:49,030 --> 00:05:49,850
this special that I have

174
00:05:49,910 --> 00:05:51,470
joined here, you can actually

175
00:05:51,790 --> 00:05:53,850
plot us some curve that

176
00:05:54,030 --> 00:05:56,060
trades off precision and

177
00:05:56,200 --> 00:05:58,010
recall, where a value

178
00:05:58,410 --> 00:06:00,020
up here, this would correspond

179
00:06:01,360 --> 00:06:02,620
to a very high value of

180
00:06:02,770 --> 00:06:04,490
the threshold, maybe threshold equals

181
00:06:05,420 --> 00:06:06,790
over 0.99, so that say, predict

182
00:06:07,090 --> 00:06:08,270
y equals 1 only where

183
00:06:08,480 --> 00:06:09,640
no more than 99 percent

184
00:06:10,290 --> 00:06:11,700
confident, at least 99

185
00:06:11,950 --> 00:06:13,460
percent probability this once, so

186
00:06:13,760 --> 00:06:15,390
that will be a precision relatively

187
00:06:15,960 --> 00:06:17,550
low recall, whereas the point

188
00:06:17,820 --> 00:06:20,380
down here will correspond to

189
00:06:20,490 --> 00:06:22,240
a value of the threshold that's

190
00:06:22,450 --> 00:06:24,940
much lower, maybe 0.01.

191
00:06:25,520 --> 00:06:26,810
When in doubt at all, put down y1.

192
00:06:27,120 --> 00:06:28,380
And if you do that, you

193
00:06:28,520 --> 00:06:29,570
end up with a much

194
00:06:29,760 --> 00:06:31,730
lower precision higher recall classifier.

195
00:06:33,350 --> 00:06:34,970
And as you vary the threshold, if

196
00:06:35,430 --> 00:06:36,550
you want, you can actually trace

197
00:06:37,000 --> 00:06:38,280
all the curve from your classifier

198
00:06:38,930 --> 00:06:41,420
to see the range of different values you can get for precision recall.

199
00:06:43,050 --> 00:06:43,810
And by the way, the position

200
00:06:44,230 --> 00:06:46,860
recall curve can look like many different shapes.

201
00:06:47,260 --> 00:06:49,140
Sometimes it'll look this, sometimes

202
00:06:50,550 --> 00:06:51,260
it'll look like that.

203
00:06:52,330 --> 00:06:53,210
Now, there are many different possible

204
00:06:53,610 --> 00:06:54,820
shapes in the position of recall

205
00:06:55,020 --> 00:06:56,850
curve, depending on the details of the classifier.

206
00:06:57,990 --> 00:06:59,620
So this raises another

207
00:06:59,900 --> 00:07:01,680
interesting question which is, is

208
00:07:01,870 --> 00:07:03,130
there a way to choose

209
00:07:03,510 --> 00:07:06,100
this threshold automatically?

210
00:07:06,810 --> 00:07:07,890
Or, more generally, if we have

211
00:07:08,500 --> 00:07:10,060
a few different algorithms or a

212
00:07:10,150 --> 00:07:12,290
few different ideas for algorithms, how

213
00:07:12,490 --> 00:07:15,340
do we compare different precision recall numbers?

214
00:07:16,400 --> 00:07:16,400
completely.

215
00:07:17,170 --> 00:07:18,250
Suppose we have three different

216
00:07:18,590 --> 00:07:20,050
learning algorithms, or actually maybe

217
00:07:20,120 --> 00:07:22,060
these are three different learning algorithms, may

218
00:07:22,250 --> 00:07:25,010
be these are the same algorithm, but just with different values for the threshold.

219
00:07:26,190 --> 00:07:28,560
How do we decide which of these algorithms is best?

220
00:07:29,770 --> 00:07:30,460
One of the things we talked

221
00:07:30,680 --> 00:07:31,630
about earlier is the importance

222
00:07:32,520 --> 00:07:34,590
of a single real number evaluation metric.

223
00:07:35,880 --> 00:07:36,890
And that is the idea of

224
00:07:36,970 --> 00:07:38,050
having a number that just

225
00:07:38,370 --> 00:07:40,130
tells you how well is your classifier doing.

226
00:07:41,270 --> 00:07:42,260
But by switching to the precision

227
00:07:42,690 --> 00:07:44,330
recall metric, we've actually lost that.

228
00:07:44,580 --> 00:07:46,090
We now have two real numbers.

229
00:07:47,190 --> 00:07:48,600
And so we often end up

230
00:07:48,770 --> 00:07:50,580
facing situations, like if

231
00:07:50,750 --> 00:07:52,770
we are trying to compare algorithm 1

232
00:07:52,970 --> 00:07:54,350
to algorithm 2, we

233
00:07:54,420 --> 00:07:55,420
end up asking ourselves, Is a

234
00:07:55,450 --> 00:07:56,550
position of point five and

235
00:07:56,700 --> 00:07:57,580
a recall of point four, well

236
00:07:57,830 --> 00:07:58,830
is that better or worse than

237
00:07:58,960 --> 00:08:00,120
a position of point seven or

238
00:08:00,300 --> 00:08:01,890
a recall point one?

239
00:08:02,150 --> 00:08:03,020
If every time you try

240
00:08:03,350 --> 00:08:04,730
on a new algorithm you end up

241
00:08:04,890 --> 00:08:06,070
having to sit around and think

242
00:08:06,530 --> 00:08:07,710
well, maybe point five

243
00:08:07,960 --> 00:08:09,170
point four, is better than point

244
00:08:09,330 --> 00:08:11,120
seven point one, maybe not, I do not know.

245
00:08:11,590 --> 00:08:13,740
If you end up having to sit around and think and make these decisions.

246
00:08:14,440 --> 00:08:15,830
that really slows down your

247
00:08:16,030 --> 00:08:18,710
decision making process, for what

248
00:08:19,120 --> 00:08:21,560
changes are useful to incorporate into your algorithm.

249
00:08:23,070 --> 00:08:24,810
Where as in contrast, if we

250
00:08:24,880 --> 00:08:26,410
had a single real number evaluation metric,

251
00:08:27,220 --> 00:08:28,240
like a number that just

252
00:08:28,590 --> 00:08:31,140
tells us is either algorithm 1 or is algorithm 2 better.

253
00:08:31,710 --> 00:08:33,110
That helps us much more

254
00:08:33,370 --> 00:08:34,840
quickly decide which algorithm to

255
00:08:34,950 --> 00:08:36,290
go with and helps us

256
00:08:36,450 --> 00:08:37,520
as well to much more quickly

257
00:08:38,110 --> 00:08:39,700
evaluate different changes that we

258
00:08:39,830 --> 00:08:41,370
may be contemplating for an algorithm.

259
00:08:42,050 --> 00:08:43,080
So, how can we get

260
00:08:43,480 --> 00:08:45,910
a single real number evaluation metric.

261
00:08:47,480 --> 00:08:48,590
One natural thing that you

262
00:08:48,660 --> 00:08:49,910
might try is to look

263
00:08:50,150 --> 00:08:52,110
at the average between precision and recall.

264
00:08:52,330 --> 00:08:53,310
So using p and r

265
00:08:53,570 --> 00:08:54,800
to denote position and recall, what

266
00:08:55,010 --> 00:08:56,300
you could do is just compute the

267
00:08:56,520 --> 00:08:57,280
average and look at what classifier

268
00:08:57,770 --> 00:08:59,300
has the highest average value.

269
00:09:00,840 --> 00:09:01,990
But this turns out not to

270
00:09:02,040 --> 00:09:04,990
be such a good solution because, similar

271
00:09:05,350 --> 00:09:06,480
to the example we had earlier,

272
00:09:07,860 --> 00:09:08,970
it turns out that if we

273
00:09:09,070 --> 00:09:10,260
have a classifier that predicts

274
00:09:11,310 --> 00:09:13,830
y1 all the time, then if

275
00:09:13,960 --> 00:09:15,540
you do that, you can get a very high recall.

276
00:09:16,290 --> 00:09:18,700
That's you end up with a very low value of Vision.

277
00:09:19,690 --> 00:09:21,230
Conversely,if you have a classifier

278
00:09:21,640 --> 00:09:24,060
that predicts y=0 almost all

279
00:09:25,340 --> 00:09:26,400
the time, that is, if

280
00:09:26,490 --> 00:09:28,100
it predicts y one very sparingly.

281
00:09:28,910 --> 00:09:30,820
This corresponds to setting

282
00:09:31,130 --> 00:09:34,190
a very high threshold using the notation of previous line.

283
00:09:34,490 --> 00:09:35,610
And then you can actually

284
00:09:35,670 --> 00:09:37,650
end up with a very high precision with a very low recall.

285
00:09:39,280 --> 00:09:40,470
So the two extremes of either

286
00:09:40,790 --> 00:09:42,380
are a very high threshold or a

287
00:09:42,540 --> 00:09:44,050
very low threshold, neither of

288
00:09:44,170 --> 00:09:45,610
them would give it paticularary good classifier.

289
00:09:46,280 --> 00:09:47,560
And we recognize that is

290
00:09:47,650 --> 00:09:48,650
by seeing if we end

291
00:09:48,710 --> 00:09:49,830
up with a very low

292
00:09:50,030 --> 00:09:52,710
precision or a very low recall.

293
00:09:54,140 --> 00:09:56,120
And if you just take the average of people's ro2.

294
00:09:57,140 --> 00:09:58,980
One does the example the average

295
00:09:59,760 --> 00:10:01,410
is actually highest for algorithm 3.

296
00:10:01,810 --> 00:10:02,800
Even though you can get

297
00:10:02,910 --> 00:10:04,010
that sort of performance by predicting

298
00:10:04,510 --> 00:10:05,850
y1 all the time.

299
00:10:06,220 --> 00:10:08,580
And that is just not a very good classifier, right?

300
00:10:08,670 --> 00:10:09,680
You predict y equals one all

301
00:10:09,780 --> 00:10:11,010
the time is just not a

302
00:10:11,210 --> 00:10:13,950
useful classifier if all it does is prints out y equals one.

303
00:10:15,000 --> 00:10:16,580
And so algorithm one or algorithm

304
00:10:17,040 --> 00:10:18,080
two would be more

305
00:10:18,280 --> 00:10:19,620
useful than algorithm three,

306
00:10:20,500 --> 00:10:22,330
but in this example algorithm three

307
00:10:23,080 --> 00:10:24,840
has a higher average value of

308
00:10:24,920 --> 00:10:27,460
precision recall than algorithm one and two.

309
00:10:28,770 --> 00:10:29,780
So we usually think of

310
00:10:29,900 --> 00:10:31,410
this average of precision recall

311
00:10:32,280 --> 00:10:35,000
as not a particularly good way to evaluate our learning algorithm.

312
00:10:38,200 --> 00:10:39,820
In contrast, there is a

313
00:10:40,030 --> 00:10:41,770
different way of combining precision recall.

314
00:10:42,370 --> 00:10:44,940
It is called the f score and it uses that formula.

315
00:10:46,420 --> 00:10:48,740
So, in this example, here are the f scores.

316
00:10:49,530 --> 00:10:50,440
And so we would tell

317
00:10:50,900 --> 00:10:52,140
from these f scores and

318
00:10:52,270 --> 00:10:53,660
we'll say algorithm 1 has

319
00:10:53,820 --> 00:10:55,290
the highest f score.

320
00:10:55,620 --> 00:10:56,910
Algorithm 2 has the second highest and

321
00:10:57,180 --> 00:10:58,560
algorithm 3 has the lowest and so

322
00:10:59,040 --> 00:10:59,920
you know, if we go by

323
00:11:00,190 --> 00:11:02,700
the f score, we would pick probably algorithm of 1 over the others.

324
00:11:04,950 --> 00:11:06,120
The f score, which is also

325
00:11:06,310 --> 00:11:07,470
called the f1 score,

326
00:11:07,670 --> 00:11:09,110
is usually written f1 score

327
00:11:09,340 --> 00:11:11,620
that I have here, but often people will just say f score.

328
00:11:12,800 --> 00:11:14,750
It determines use is a

329
00:11:15,080 --> 00:11:16,130
little bit like taking the

330
00:11:16,290 --> 00:11:17,660
average of precision of recall,

331
00:11:18,080 --> 00:11:19,220
but it gives the lower

332
00:11:19,580 --> 00:11:20,860
value of precision and recall

333
00:11:21,060 --> 00:11:23,460
- whichever it is - it gives it a higher weight.

334
00:11:23,950 --> 00:11:25,220
And so, you see in

335
00:11:25,360 --> 00:11:27,090
the numerator here that the

336
00:11:27,250 --> 00:11:29,910
f score takes a product or position of equal.

337
00:11:30,460 --> 00:11:31,900
And so, if either position is

338
00:11:32,050 --> 00:11:33,070
0 or recall is equal to

339
00:11:33,180 --> 00:11:34,310
0, the f score will

340
00:11:34,600 --> 00:11:35,590
be equal to o. So

341
00:11:35,690 --> 00:11:38,290
in that sense, it kind of combines position and recall.

342
00:11:38,850 --> 00:11:40,160
but for the f score to

343
00:11:40,300 --> 00:11:41,600
be large, both position

344
00:11:42,100 --> 00:11:43,480
and recall have to be pretty large.

345
00:11:44,490 --> 00:11:45,770
I should say that there are

346
00:11:45,950 --> 00:11:47,950
many different possible formulas for

347
00:11:48,060 --> 00:11:49,450
combining position and recall.

348
00:11:50,040 --> 00:11:51,400
This f score formula is

349
00:11:51,730 --> 00:11:53,470
really, maybe just one out

350
00:11:53,640 --> 00:11:54,800
of a much larger number of

351
00:11:54,880 --> 00:11:57,200
possibilities, but historically or

352
00:11:57,270 --> 00:11:58,310
traditionally this is what

353
00:11:58,460 --> 00:12:00,110
people in machine learning use.

354
00:12:01,550 --> 00:12:02,840
And the term f score, you

355
00:12:02,960 --> 00:12:04,160
know, it doesn't really mean

356
00:12:04,390 --> 00:12:05,460
anything, so don't worry about

357
00:12:05,690 --> 00:12:07,680
why it's called f score or f1 score.

358
00:12:08,510 --> 00:12:10,900
But this usually gives

359
00:12:11,370 --> 00:12:12,230
you the effect that you want

360
00:12:12,600 --> 00:12:14,070
because if either position is

361
00:12:14,370 --> 00:12:15,410
0 or recall is 0, this

362
00:12:15,470 --> 00:12:17,470
gives you a very low f score.

363
00:12:17,610 --> 00:12:18,730
And so, to have a

364
00:12:18,770 --> 00:12:20,030
high f score you can't

365
00:12:20,280 --> 00:12:21,790
need a preserve quality 1

366
00:12:22,230 --> 00:12:24,630
and completely if p

367
00:12:25,010 --> 00:12:26,300
equals zero or i

368
00:12:26,450 --> 00:12:28,440
equals zero then this

369
00:12:28,650 --> 00:12:31,540
gives you the f score equals zero.

370
00:12:33,430 --> 00:12:34,630
Where as a perfect f

371
00:12:34,820 --> 00:12:36,120
score, so if position equals

372
00:12:36,550 --> 00:12:38,520
one and  [xx] equals

373
00:12:38,940 --> 00:12:40,380
one that would give

374
00:12:40,580 --> 00:12:43,450
you an f score that's

375
00:12:43,680 --> 00:12:44,780
equal to one times one

376
00:12:45,100 --> 00:12:46,650
over two times two.

377
00:12:46,800 --> 00:12:47,590
So the f score will be

378
00:12:47,900 --> 00:12:48,610
equal to 1 if you

379
00:12:48,680 --> 00:12:50,300
have perfect precision and perfect recall.

380
00:12:51,280 --> 00:12:53,230
And intermediate values between 0

381
00:12:53,520 --> 00:12:54,980
and 1, this usually gives a

382
00:12:55,210 --> 00:12:57,240
reasonable rank ordering of different classifiers.

383
00:13:00,000 --> 00:13:01,070
So this video we talked

384
00:13:01,370 --> 00:13:03,240
about the notion of trading

385
00:13:03,760 --> 00:13:05,290
off between position and recall

386
00:13:06,140 --> 00:13:07,410
and how we can vary the

387
00:13:07,540 --> 00:13:09,110
threshold that we use to

388
00:13:09,250 --> 00:13:10,340
decide whether to predict y

389
00:13:10,540 --> 00:13:11,980
equals one or y equals zero.

390
00:13:12,180 --> 00:13:13,990
This threshold that says do

391
00:13:14,070 --> 00:13:15,180
we need to be at least

392
00:13:15,500 --> 00:13:16,970
seventy percent confident or ninety

393
00:13:17,200 --> 00:13:19,340
percent confident or whatever before

394
00:13:19,650 --> 00:13:21,150
we predict y equals one and

395
00:13:21,260 --> 00:13:22,610
by varying the threshold you

396
00:13:22,950 --> 00:13:23,930
can control a trade off

397
00:13:24,300 --> 00:13:25,960
between precision and recall.

398
00:13:26,430 --> 00:13:27,150
Ross talked about the f score

399
00:13:27,420 --> 00:13:28,850
which takes precision and recall

400
00:13:29,640 --> 00:13:30,730
and gives you a single

401
00:13:31,270 --> 00:13:32,480
real number evaluation metric.

402
00:13:33,320 --> 00:13:34,460
And of course, if your goal is

403
00:13:34,740 --> 00:13:36,590
to automatically set that

404
00:13:36,880 --> 00:13:38,390
threshold, to decide which

405
00:13:38,590 --> 00:13:39,320
one of y equals 1 or

406
00:13:39,520 --> 00:13:41,180
y equals 0, one pretty

407
00:13:41,420 --> 00:13:42,410
reasonable way to do that

408
00:13:42,740 --> 00:13:44,140
would also be to try

409
00:13:44,640 --> 00:13:46,350
a range of different values of thresholds.

410
00:13:46,930 --> 00:13:47,740
So, try a range of values

411
00:13:48,190 --> 00:13:50,430
of thresholds and evaluate these

412
00:13:50,620 --> 00:13:51,610
different threshold on say your

413
00:13:51,790 --> 00:13:53,650
cross validation set, and then

414
00:13:53,840 --> 00:13:55,760
to pick whatever value of threshold

415
00:13:56,580 --> 00:13:57,910
gives you the highest f score

416
00:13:58,060 --> 00:13:59,760
on your cross validation setting.

417
00:14:00,130 --> 00:14:01,220
That would be a pretty reasonable way

418
00:14:01,720 --> 00:14:04,620
to automatically chose the threshold for your classifier as well.