1
00:00:00,290 --> 00:00:01,690
In the previous video, I talked

2
00:00:02,060 --> 00:00:03,900
about error analysis and the

3
00:00:04,350 --> 00:00:06,070
importance of having error

4
00:00:06,330 --> 00:00:07,480
metrics, that is of having

5
00:00:08,210 --> 00:00:10,200
a single real number evaluation metric

6
00:00:11,020 --> 00:00:13,290
for your learning algorithm to tell how well it's doing.

7
00:00:14,310 --> 00:00:15,670
In the context of evaluation

8
00:00:16,700 --> 00:00:18,320
and of error metrics, there is

9
00:00:18,430 --> 00:00:20,290
one important case, where it's

10
00:00:20,480 --> 00:00:22,180
particularly tricky to come

11
00:00:22,510 --> 00:00:24,430
up with an appropriate error metric,

12
00:00:24,930 --> 00:00:26,990
or evaluation metric, for your learning algorithm.

13
00:00:28,040 --> 00:00:29,140
That case is the case

14
00:00:29,610 --> 00:00:31,310
of what's called skewed classes.

15
00:00:32,610 --> 00:00:33,480
Let me tell you what that means.

16
00:00:36,170 --> 00:00:37,550
Consider the problem of cancer

17
00:00:38,180 --> 00:00:40,040
classification, where we have

18
00:00:40,300 --> 00:00:41,960
features of medical patients and

19
00:00:42,070 --> 00:00:44,050
we want to decide whether or not they have cancer.

20
00:00:44,630 --> 00:00:45,790
So this is like the malignant

21
00:00:46,350 --> 00:00:48,290
versus benign tumor classification

22
00:00:48,930 --> 00:00:50,070
example that we had earlier.

23
00:00:51,140 --> 00:00:52,360
So let's say y equals 1 if the

24
00:00:52,550 --> 00:00:53,780
patient has cancer and y equals 0

25
00:00:54,280 --> 00:00:56,530
if they do not.

26
00:00:56,810 --> 00:00:57,460
We have trained the progression

27
00:00:57,940 --> 00:00:59,780
classifier and let's say

28
00:01:00,000 --> 00:01:01,520
we test our classifier on

29
00:01:01,660 --> 00:01:04,470
a test set and find that we get 1 percent error.

30
00:01:04,790 --> 00:01:05,720
So, we're making 99% correct diagnosis.

31
00:01:06,530 --> 00:01:09,610
Seems like a really impressive result, right.

32
00:01:09,910 --> 00:01:10,920
We're correct 99% percent of the time.

33
00:01:12,560 --> 00:01:13,630
But now, let's say we find

34
00:01:13,940 --> 00:01:15,660
out that only 0.5 percent

35
00:01:16,510 --> 00:01:17,950
of patients in our

36
00:01:18,160 --> 00:01:19,590
training test sets actually have cancer.

37
00:01:20,400 --> 00:01:21,900
So only half a

38
00:01:21,950 --> 00:01:23,460
percent of the patients that

39
00:01:23,580 --> 00:01:25,500
come through our screening process have cancer.

40
00:01:26,560 --> 00:01:27,970
In this case, the 1%

41
00:01:28,270 --> 00:01:30,010
error no longer looks so impressive.

42
00:01:31,130 --> 00:01:32,510
And in particular, here's a piece

43
00:01:32,670 --> 00:01:33,730
of code, here's actually a piece

44
00:01:33,850 --> 00:01:35,730
of non learning code that takes

45
00:01:36,080 --> 00:01:38,260
this input of features x and it ignores it.

46
00:01:38,480 --> 00:01:39,820
It just sets y equals 0

47
00:01:39,950 --> 00:01:41,640
and always predicts, you

48
00:01:41,720 --> 00:01:43,920
know, nobody has cancer and this

49
00:01:44,170 --> 00:01:45,720
algorithm would actually get

50
00:01:46,000 --> 00:01:47,840
0.5 percent error.

51
00:01:48,830 --> 00:01:50,280
So this is even better than

52
00:01:50,400 --> 00:01:51,140
the 1% error that we were getting just now

53
00:01:51,240 --> 00:01:52,960
and this is a non

54
00:01:53,160 --> 00:01:54,600
learning algorithm that you know, it is just

55
00:01:54,800 --> 00:01:56,950
predicting y equals 0 all the time.

56
00:01:57,990 --> 00:01:59,430
So this setting of when

57
00:02:00,060 --> 00:02:01,980
the ratio of positive to

58
00:02:02,180 --> 00:02:04,130
negative examples is very close

59
00:02:04,810 --> 00:02:06,480
to one of two extremes, where,

60
00:02:07,040 --> 00:02:08,620
in this case, the number of

61
00:02:08,710 --> 00:02:10,050
positive examples is much,

62
00:02:10,350 --> 00:02:11,310
much smaller than the number

63
00:02:11,620 --> 00:02:13,180
of negative examples because y

64
00:02:13,480 --> 00:02:15,500
equals one so rarely, this

65
00:02:15,730 --> 00:02:16,850
is what we call the

66
00:02:17,000 --> 00:02:18,600
case of skewed classes.

67
00:02:20,790 --> 00:02:21,710
We just have a lot more

68
00:02:22,000 --> 00:02:23,140
of examples from one class

69
00:02:23,570 --> 00:02:25,040
than from the other class.

70
00:02:25,220 --> 00:02:26,560
And by just predicting y equals

71
00:02:26,920 --> 00:02:28,270
0 all the time, or maybe

72
00:02:28,650 --> 00:02:29,650
our predicting y equals 1

73
00:02:29,790 --> 00:02:32,080
all the time, an algorithm can do pretty well.

74
00:02:32,980 --> 00:02:34,050
So the problem with using

75
00:02:34,670 --> 00:02:36,210
classification error or classification

76
00:02:36,590 --> 00:02:39,240
accuracy as our evaluation metric is the following.

77
00:02:40,430 --> 00:02:41,360
Let's say you have one joining

78
00:02:41,700 --> 00:02:43,570
algorithm that's getting 99.2% accuracy.

79
00:02:46,530 --> 00:02:47,200
So, that's a 0.8% error.

80
00:02:47,330 --> 00:02:50,850
Let's say you

81
00:02:51,000 --> 00:02:52,000
make a change to your algorithm

82
00:02:52,810 --> 00:02:53,890
and you now are getting

83
00:02:54,280 --> 00:02:56,080
99.5% accuracy.

84
00:02:59,280 --> 00:03:02,110
That is 0.5% error.

85
00:03:04,230 --> 00:03:06,460
So, is this an improvement to the algorithm or not?

86
00:03:06,770 --> 00:03:07,930
One of the nice things

87
00:03:08,300 --> 00:03:09,990
about having a single real

88
00:03:10,120 --> 00:03:11,480
number evaluation metric is this

89
00:03:11,650 --> 00:03:13,080
helps us to quickly decide if

90
00:03:13,240 --> 00:03:15,530
we just need a good change to the algorithm or not.

91
00:03:16,370 --> 00:03:20,160
By going from 99.2% accuracy to 99.5% accuracy.

92
00:03:21,430 --> 00:03:22,490
You know, did we just do something

93
00:03:22,780 --> 00:03:23,640
useful or did we

94
00:03:23,770 --> 00:03:25,150
just replace our code with

95
00:03:25,320 --> 00:03:26,690
something that just predicts y equals

96
00:03:27,000 --> 00:03:28,830
zero more often?

97
00:03:29,300 --> 00:03:30,430
So, if you have very skewed classes

98
00:03:31,340 --> 00:03:33,280
it becomes much harder to use

99
00:03:33,640 --> 00:03:36,000
just classification accuracy, because you

100
00:03:36,120 --> 00:03:37,730
can get very high classification accuracies

101
00:03:38,420 --> 00:03:40,950
or very low errors, and

102
00:03:41,110 --> 00:03:42,880
it's not always clear if

103
00:03:43,070 --> 00:03:44,190
doing so is really improving

104
00:03:44,770 --> 00:03:45,780
the quality of your classifier

105
00:03:46,400 --> 00:03:48,320
because predicting y equals 0 all the

106
00:03:48,380 --> 00:03:50,710
time doesn't seem like

107
00:03:51,570 --> 00:03:52,570
a particularly good classifier.

108
00:03:53,900 --> 00:03:55,500
But just predicting y equals 0 more

109
00:03:55,720 --> 00:03:57,300
often can bring your error

110
00:03:57,830 --> 00:03:59,460
down to, you know, maybe as

111
00:03:59,650 --> 00:04:01,120
low as 0.5%.

112
00:04:01,490 --> 00:04:02,590
When we're faced with such

113
00:04:02,770 --> 00:04:04,990
a skewed classes therefore we

114
00:04:05,250 --> 00:04:06,350
would want to come up

115
00:04:06,470 --> 00:04:07,920
with a different error metric

116
00:04:08,320 --> 00:04:09,500
or a different evaluation metric.

117
00:04:10,290 --> 00:04:12,360
One such evaluation metric are

118
00:04:12,870 --> 00:04:14,240
what's called precision recall.

119
00:04:15,440 --> 00:04:16,410
Let me explain what that is.

120
00:04:17,520 --> 00:04:19,890
Let's say we are evaluating a classifier on the test set.

121
00:04:20,750 --> 00:04:21,800
For the examples in the

122
00:04:21,890 --> 00:04:23,890
test set the actual

123
00:04:25,450 --> 00:04:26,880
class of that example

124
00:04:27,320 --> 00:04:28,440
in the test set is going to

125
00:04:28,550 --> 00:04:29,810
be either one or zero, right,

126
00:04:30,440 --> 00:04:32,520
if there is a binary classification problem.

127
00:04:33,870 --> 00:04:34,960
And what our learning algorithm

128
00:04:35,360 --> 00:04:37,070
will do is it will, you know,

129
00:04:37,930 --> 00:04:39,270
predict some value for the

130
00:04:39,450 --> 00:04:41,160
class and our learning

131
00:04:41,560 --> 00:04:43,300
algorithm will predict the value

132
00:04:43,760 --> 00:04:44,830
for each example in my

133
00:04:44,910 --> 00:04:46,520
test set and the predicted value

134
00:04:46,920 --> 00:04:48,560
will also be either one or zero.

135
00:04:50,050 --> 00:04:52,060
So let me draw a two

136
00:04:52,270 --> 00:04:53,340
by two table as follows,

137
00:04:53,910 --> 00:04:55,870
depending on a full of these entries

138
00:04:56,320 --> 00:04:57,800
depending on what was the

139
00:04:57,960 --> 00:04:59,350
actual class and what was the predicted class.

140
00:05:00,220 --> 00:05:01,270
If we have an

141
00:05:01,570 --> 00:05:02,890
example where the actual class is

142
00:05:02,970 --> 00:05:03,950
one and the predicted class

143
00:05:04,240 --> 00:05:06,140
is one then that's called

144
00:05:07,620 --> 00:05:08,640
an example that's a true

145
00:05:08,940 --> 00:05:10,300
positive, meaning our algorithm

146
00:05:10,730 --> 00:05:11,700
predicted that it's positive

147
00:05:12,400 --> 00:05:15,780
and in reality the example is positive.

148
00:05:16,240 --> 00:05:17,300
If our learning algorithm predicted that

149
00:05:17,490 --> 00:05:19,010
something is negative, class zero,

150
00:05:19,570 --> 00:05:20,620
and the actual class is also

151
00:05:20,970 --> 00:05:23,650
class zero then that's what's called a true negative.

152
00:05:24,070 --> 00:05:26,370
We predicted zero and it actually is zero.

153
00:05:27,880 --> 00:05:28,740
To find the other two boxes,

154
00:05:29,470 --> 00:05:31,120
if our learning algorithm predicts that

155
00:05:31,360 --> 00:05:33,210
the class is one but the

156
00:05:34,340 --> 00:05:36,370
actual class is zero, then

157
00:05:36,670 --> 00:05:37,910
that's called a false positive.

158
00:05:39,350 --> 00:05:40,630
So that means our algorithm for

159
00:05:40,830 --> 00:05:41,970
the patient is cancelled out in

160
00:05:42,190 --> 00:05:43,520
reality if the patient does not.

161
00:05:44,730 --> 00:05:47,340
And finally, the last box is a zero, one.

162
00:05:48,200 --> 00:05:50,330
That's called a false negative

163
00:05:51,180 --> 00:05:52,690
because our algorithm predicted

164
00:05:53,450 --> 00:05:56,170
zero, but the actual class was one.

165
00:05:57,230 --> 00:05:59,020
And so, we

166
00:05:59,150 --> 00:06:00,830
have this little sort of two by

167
00:06:00,990 --> 00:06:02,720
two table based on

168
00:06:03,250 --> 00:06:05,500
what was the actual class and what was the predicted class.

169
00:06:07,080 --> 00:06:08,380
So here's a different way

170
00:06:08,690 --> 00:06:10,310
of evaluating the performance of

171
00:06:10,420 --> 00:06:11,940
our algorithm. We're

172
00:06:12,550 --> 00:06:12,870
going to compute two numbers.

173
00:06:13,310 --> 00:06:14,780
The first is called precision -

174
00:06:14,940 --> 00:06:16,100
and what that says is,

175
00:06:17,170 --> 00:06:18,330
of all the patients where we've

176
00:06:18,580 --> 00:06:19,580
predicted that they have cancer,

177
00:06:20,640 --> 00:06:23,140
what fraction of them actually have cancer?

178
00:06:24,560 --> 00:06:25,310
So let me write this down,

179
00:06:26,020 --> 00:06:27,300
the precision of a classifier

180
00:06:27,680 --> 00:06:29,070
is the number of true

181
00:06:29,310 --> 00:06:31,880
positives divided by

182
00:06:32,940 --> 00:06:35,190
the number that we predicted

183
00:06:37,140 --> 00:06:37,370
as positive, right?

184
00:06:39,150 --> 00:06:40,660
So of all the patients that

185
00:06:41,090 --> 00:06:43,590
we went to those patients and we told them, "We think you have cancer."

186
00:06:43,890 --> 00:06:45,730
Of all those patients, what

187
00:06:45,890 --> 00:06:47,410
fraction of them actually have cancer?

188
00:06:47,500 --> 00:06:48,920
So that's called precision.

189
00:06:49,800 --> 00:06:50,680
And another way to write this

190
00:06:50,950 --> 00:06:54,920
would be true positives and

191
00:06:55,010 --> 00:06:56,430
then in the denominator is the

192
00:06:56,670 --> 00:06:59,050
number of predicted positives, and

193
00:06:59,210 --> 00:07:00,160
so that would be the

194
00:07:00,240 --> 00:07:01,730
sum of the, you know, entries

195
00:07:02,410 --> 00:07:04,510
in this first row of the table.

196
00:07:04,720 --> 00:07:07,760
So it would be true positives divided by true positives.

197
00:07:08,670 --> 00:07:10,470
I'm going to abbreviate positive

198
00:07:11,220 --> 00:07:12,980
as POS and then

199
00:07:13,130 --> 00:07:15,470
plus false positives, again

200
00:07:15,890 --> 00:07:18,550
abbreviating positive using POS.

201
00:07:20,030 --> 00:07:21,850
So that's called precision, and as you

202
00:07:21,920 --> 00:07:23,490
can tell high precision would be good.

203
00:07:23,660 --> 00:07:24,680
That means that all the patients

204
00:07:25,070 --> 00:07:27,100
that we went to and we said, "You know, we're very sorry.

205
00:07:27,510 --> 00:07:28,960
We think you have cancer," high precision

206
00:07:29,440 --> 00:07:31,750
means that of that group

207
00:07:31,980 --> 00:07:33,160
of patients most of them

208
00:07:33,390 --> 00:07:34,460
we had actually made accurate

209
00:07:34,820 --> 00:07:36,630
predictions on them and they do have cancer.

210
00:07:38,840 --> 00:07:39,880
The second number we're going to compute

211
00:07:40,440 --> 00:07:41,730
is called recall, and what

212
00:07:42,060 --> 00:07:44,230
recall say is, if all

213
00:07:44,480 --> 00:07:46,100
the patients in, let's say,

214
00:07:46,190 --> 00:07:47,510
in the test set or the

215
00:07:47,620 --> 00:07:48,830
cross-validation set, but if

216
00:07:48,960 --> 00:07:49,980
all the patients in the data

217
00:07:50,150 --> 00:07:51,550
set that actually have cancer,

218
00:07:52,670 --> 00:07:54,240
what fraction of them that

219
00:07:54,400 --> 00:07:56,250
we correctly detect as having cancer.

220
00:07:56,950 --> 00:07:57,870
So if all the patients have

221
00:07:58,090 --> 00:07:59,170
cancer, how many of them

222
00:07:59,400 --> 00:08:01,130
did we actually go to them

223
00:08:01,320 --> 00:08:03,850
and you know, correctly told them that we think they need treatment.

224
00:08:05,860 --> 00:08:07,010
So, writing this down,

225
00:08:07,360 --> 00:08:08,970
recall is defined as the

226
00:08:09,040 --> 00:08:12,020
number of positives, the number

227
00:08:12,470 --> 00:08:14,760
of true positives,

228
00:08:15,260 --> 00:08:16,320
meaning the number of people

229
00:08:16,520 --> 00:08:17,890
that have cancer and that

230
00:08:18,030 --> 00:08:19,280
we correctly predicted have cancer

231
00:08:20,310 --> 00:08:21,440
and we take that and divide

232
00:08:21,790 --> 00:08:23,510
that by, divide that by

233
00:08:23,740 --> 00:08:29,300
the number of actual positives,

234
00:08:31,200 --> 00:08:32,070
so this is the right

235
00:08:32,510 --> 00:08:35,190
number of actual positives of all the people that do have cancer.

236
00:08:35,850 --> 00:08:37,000
What fraction do we directly

237
00:08:37,430 --> 00:08:38,950
flag and you know, send the treatment.

238
00:08:40,560 --> 00:08:41,780
So, to rewrite this in

239
00:08:41,930 --> 00:08:44,060
a different form, the denominator would

240
00:08:44,210 --> 00:08:45,160
be the number of actual

241
00:08:45,430 --> 00:08:46,990
positives as you know, is the sum

242
00:08:47,220 --> 00:08:49,480
of the entries in this first column over here.

243
00:08:50,600 --> 00:08:51,660
And so writing things out differently,

244
00:08:52,160 --> 00:08:53,470
this is therefore, the number of

245
00:08:53,650 --> 00:08:57,120
true positives, divided by

246
00:08:59,010 --> 00:09:01,340
the number of true positives

247
00:09:02,790 --> 00:09:05,430
plus the number of

248
00:09:06,750 --> 00:09:07,690
false negatives.

249
00:09:09,570 --> 00:09:12,180
And so once again, having a high recall would be a good thing.

250
00:09:14,180 --> 00:09:15,810
So by computing precision and

251
00:09:15,930 --> 00:09:17,100
recall this will usually

252
00:09:17,340 --> 00:09:18,740
give us a better sense of

253
00:09:19,140 --> 00:09:20,560
how well our classifier is doing.

254
00:09:21,620 --> 00:09:22,960
And in particular if we have

255
00:09:23,330 --> 00:09:24,740
a learning algorithm that predicts

256
00:09:25,520 --> 00:09:27,020
y equals zero all

257
00:09:27,190 --> 00:09:28,290
the time, if it predicts no

258
00:09:28,460 --> 00:09:30,080
one has cancer, then this

259
00:09:30,250 --> 00:09:31,880
classifier will have a

260
00:09:32,070 --> 00:09:33,820
recall equal to zero,

261
00:09:34,370 --> 00:09:35,300
because there won't be any

262
00:09:35,570 --> 00:09:36,940
true positives and so that's

263
00:09:37,190 --> 00:09:37,930
a quick way for us to

264
00:09:38,010 --> 00:09:40,290
recognize that, you know, a

265
00:09:40,360 --> 00:09:41,570
classifier that predicts y equals 0 all the time,

266
00:09:42,050 --> 00:09:43,350
just isn't a very good classifier.

267
00:09:44,000 --> 00:09:46,660
And more generally,

268
00:09:47,450 --> 00:09:48,830
even for settings where we

269
00:09:48,950 --> 00:09:50,800
have very skewed classes, it's

270
00:09:51,050 --> 00:09:53,350
not possible for an

271
00:09:53,440 --> 00:09:54,900
algorithm to sort of "cheat"

272
00:09:55,450 --> 00:09:56,400
and somehow get a very

273
00:09:56,750 --> 00:09:57,930
high precision and a

274
00:09:58,010 --> 00:09:59,360
very high recall by doing

275
00:09:59,620 --> 00:10:00,800
some simple thing like predicting

276
00:10:01,050 --> 00:10:02,680
y equals 0 all the time or

277
00:10:02,720 --> 00:10:04,720
predicting y equals 1 all the time.

278
00:10:04,960 --> 00:10:06,540
And so we're much

279
00:10:06,680 --> 00:10:08,230
more sure that a classifier

280
00:10:08,840 --> 00:10:09,780
of a high precision or high recall

281
00:10:10,610 --> 00:10:11,550
actually is a good classifier,

282
00:10:12,470 --> 00:10:13,940
and this gives us a

283
00:10:14,040 --> 00:10:15,660
more useful evaluation metric that

284
00:10:15,910 --> 00:10:16,960
is a more direct way to

285
00:10:17,230 --> 00:10:20,360
actually understand whether, you know, our algorithm may be doing well.

286
00:10:21,680 --> 00:10:23,000
So one final note in

287
00:10:23,200 --> 00:10:24,960
the definition of precision and

288
00:10:25,150 --> 00:10:26,190
recall, that we would define

289
00:10:26,720 --> 00:10:28,720
precision and recall, usually we

290
00:10:29,100 --> 00:10:31,970
use the convention that y is equal to 1, in

291
00:10:32,090 --> 00:10:33,700
the presence of the more rare class.

292
00:10:34,160 --> 00:10:35,410
So if we are trying to detect.

293
00:10:35,880 --> 00:10:37,300
rare conditions such as cancer,

294
00:10:37,720 --> 00:10:38,610
hopefully that's a rare condition,

295
00:10:39,340 --> 00:10:40,950
precision and recall are

296
00:10:41,000 --> 00:10:42,440
defined setting y equals

297
00:10:42,790 --> 00:10:43,930
1, rather than y

298
00:10:44,190 --> 00:10:45,690
equals 0, to be sort of

299
00:10:45,820 --> 00:10:47,100
that the presence of that rare

300
00:10:47,250 --> 00:10:50,220
class that we're trying to detect.

301
00:10:50,450 --> 00:10:51,960
And by using precision and recall,

302
00:10:52,890 --> 00:10:54,250
we find, what happens is

303
00:10:54,390 --> 00:10:55,400
that even if we have

304
00:10:55,610 --> 00:10:57,400
very skewed classes, it's not

305
00:10:57,590 --> 00:10:59,080
possible for an algorithm to

306
00:10:59,600 --> 00:11:01,060
you know, "cheat" and predict

307
00:11:01,380 --> 00:11:02,400
y equals 1 all the time,

308
00:11:02,760 --> 00:11:03,870
or predict y equals 0 all

309
00:11:03,980 --> 00:11:05,750
the time, and get high precision and recall.

310
00:11:06,640 --> 00:11:07,830
And in particular, if a classifier

311
00:11:08,480 --> 00:11:09,700
is getting high precision and high

312
00:11:09,880 --> 00:11:11,160
recall, then we are

313
00:11:11,270 --> 00:11:13,040
actually confident that the algorithm

314
00:11:13,590 --> 00:11:15,120
has to be doing well, even

315
00:11:15,400 --> 00:11:16,620
if we have very skewed classes.

316
00:11:18,030 --> 00:11:20,360
So for the problem of skewed classes precision

317
00:11:20,950 --> 00:11:22,560
recall gives us more

318
00:11:22,780 --> 00:11:24,670
direct insight into how

319
00:11:24,910 --> 00:11:26,010
the learning algorithm is doing

320
00:11:26,660 --> 00:11:27,980
and this is often a much

321
00:11:28,070 --> 00:11:29,360
better way to evaluate our learning algorithms,

322
00:11:30,270 --> 00:11:32,200
than looking at classification error

323
00:11:32,510 --> 00:11:35,200
or classification accuracy, when the classes are very skewed.