1 00:00:00,290 --> 00:00:01,690 In the previous video, I talked 2 00:00:02,060 --> 00:00:03,900 about error analysis and the 3 00:00:04,350 --> 00:00:06,070 importance of having error 4 00:00:06,330 --> 00:00:07,480 metrics, that is of having 5 00:00:08,210 --> 00:00:10,200 a single real number evaluation metric 6 00:00:11,020 --> 00:00:13,290 for your learning algorithm to tell how well it's doing. 7 00:00:14,310 --> 00:00:15,670 In the context of evaluation 8 00:00:16,700 --> 00:00:18,320 and of error metrics, there is 9 00:00:18,430 --> 00:00:20,290 one important case, where it's 10 00:00:20,480 --> 00:00:22,180 particularly tricky to come 11 00:00:22,510 --> 00:00:24,430 up with an appropriate error metric, 12 00:00:24,930 --> 00:00:26,990 or evaluation metric, for your learning algorithm. 13 00:00:28,040 --> 00:00:29,140 That case is the case 14 00:00:29,610 --> 00:00:31,310 of what's called skewed classes. 15 00:00:32,610 --> 00:00:33,480 Let me tell you what that means. 16 00:00:36,170 --> 00:00:37,550 Consider the problem of cancer 17 00:00:38,180 --> 00:00:40,040 classification, where we have 18 00:00:40,300 --> 00:00:41,960 features of medical patients and 19 00:00:42,070 --> 00:00:44,050 we want to decide whether or not they have cancer. 20 00:00:44,630 --> 00:00:45,790 So this is like the malignant 21 00:00:46,350 --> 00:00:48,290 versus benign tumor classification 22 00:00:48,930 --> 00:00:50,070 example that we had earlier. 23 00:00:51,140 --> 00:00:52,360 So let's say y equals 1 if the 24 00:00:52,550 --> 00:00:53,780 patient has cancer and y equals 0 25 00:00:54,280 --> 00:00:56,530 if they do not. 26 00:00:56,810 --> 00:00:57,460 We have trained the progression 27 00:00:57,940 --> 00:00:59,780 classifier and let's say 28 00:01:00,000 --> 00:01:01,520 we test our classifier on 29 00:01:01,660 --> 00:01:04,470 a test set and find that we get 1 percent error. 30 00:01:04,790 --> 00:01:05,720 So, we're making 99% correct diagnosis. 31 00:01:06,530 --> 00:01:09,610 Seems like a really impressive result, right. 32 00:01:09,910 --> 00:01:10,920 We're correct 99% percent of the time. 33 00:01:12,560 --> 00:01:13,630 But now, let's say we find 34 00:01:13,940 --> 00:01:15,660 out that only 0.5 percent 35 00:01:16,510 --> 00:01:17,950 of patients in our 36 00:01:18,160 --> 00:01:19,590 training test sets actually have cancer. 37 00:01:20,400 --> 00:01:21,900 So only half a 38 00:01:21,950 --> 00:01:23,460 percent of the patients that 39 00:01:23,580 --> 00:01:25,500 come through our screening process have cancer. 40 00:01:26,560 --> 00:01:27,970 In this case, the 1% 41 00:01:28,270 --> 00:01:30,010 error no longer looks so impressive. 42 00:01:31,130 --> 00:01:32,510 And in particular, here's a piece 43 00:01:32,670 --> 00:01:33,730 of code, here's actually a piece 44 00:01:33,850 --> 00:01:35,730 of non learning code that takes 45 00:01:36,080 --> 00:01:38,260 this input of features x and it ignores it. 46 00:01:38,480 --> 00:01:39,820 It just sets y equals 0 47 00:01:39,950 --> 00:01:41,640 and always predicts, you 48 00:01:41,720 --> 00:01:43,920 know, nobody has cancer and this 49 00:01:44,170 --> 00:01:45,720 algorithm would actually get 50 00:01:46,000 --> 00:01:47,840 0.5 percent error. 51 00:01:48,830 --> 00:01:50,280 So this is even better than 52 00:01:50,400 --> 00:01:51,140 the 1% error that we were getting just now 53 00:01:51,240 --> 00:01:52,960 and this is a non 54 00:01:53,160 --> 00:01:54,600 learning algorithm that you know, it is just 55 00:01:54,800 --> 00:01:56,950 predicting y equals 0 all the time. 56 00:01:57,990 --> 00:01:59,430 So this setting of when 57 00:02:00,060 --> 00:02:01,980 the ratio of positive to 58 00:02:02,180 --> 00:02:04,130 negative examples is very close 59 00:02:04,810 --> 00:02:06,480 to one of two extremes, where, 60 00:02:07,040 --> 00:02:08,620 in this case, the number of 61 00:02:08,710 --> 00:02:10,050 positive examples is much, 62 00:02:10,350 --> 00:02:11,310 much smaller than the number 63 00:02:11,620 --> 00:02:13,180 of negative examples because y 64 00:02:13,480 --> 00:02:15,500 equals one so rarely, this 65 00:02:15,730 --> 00:02:16,850 is what we call the 66 00:02:17,000 --> 00:02:18,600 case of skewed classes. 67 00:02:20,790 --> 00:02:21,710 We just have a lot more 68 00:02:22,000 --> 00:02:23,140 of examples from one class 69 00:02:23,570 --> 00:02:25,040 than from the other class. 70 00:02:25,220 --> 00:02:26,560 And by just predicting y equals 71 00:02:26,920 --> 00:02:28,270 0 all the time, or maybe 72 00:02:28,650 --> 00:02:29,650 our predicting y equals 1 73 00:02:29,790 --> 00:02:32,080 all the time, an algorithm can do pretty well. 74 00:02:32,980 --> 00:02:34,050 So the problem with using 75 00:02:34,670 --> 00:02:36,210 classification error or classification 76 00:02:36,590 --> 00:02:39,240 accuracy as our evaluation metric is the following. 77 00:02:40,430 --> 00:02:41,360 Let's say you have one joining 78 00:02:41,700 --> 00:02:43,570 algorithm that's getting 99.2% accuracy. 79 00:02:46,530 --> 00:02:47,200 So, that's a 0.8% error. 80 00:02:47,330 --> 00:02:50,850 Let's say you 81 00:02:51,000 --> 00:02:52,000 make a change to your algorithm 82 00:02:52,810 --> 00:02:53,890 and you now are getting 83 00:02:54,280 --> 00:02:56,080 99.5% accuracy. 84 00:02:59,280 --> 00:03:02,110 That is 0.5% error. 85 00:03:04,230 --> 00:03:06,460 So, is this an improvement to the algorithm or not? 86 00:03:06,770 --> 00:03:07,930 One of the nice things 87 00:03:08,300 --> 00:03:09,990 about having a single real 88 00:03:10,120 --> 00:03:11,480 number evaluation metric is this 89 00:03:11,650 --> 00:03:13,080 helps us to quickly decide if 90 00:03:13,240 --> 00:03:15,530 we just need a good change to the algorithm or not. 91 00:03:16,370 --> 00:03:20,160 By going from 99.2% accuracy to 99.5% accuracy. 92 00:03:21,430 --> 00:03:22,490 You know, did we just do something 93 00:03:22,780 --> 00:03:23,640 useful or did we 94 00:03:23,770 --> 00:03:25,150 just replace our code with 95 00:03:25,320 --> 00:03:26,690 something that just predicts y equals 96 00:03:27,000 --> 00:03:28,830 zero more often? 97 00:03:29,300 --> 00:03:30,430 So, if you have very skewed classes 98 00:03:31,340 --> 00:03:33,280 it becomes much harder to use 99 00:03:33,640 --> 00:03:36,000 just classification accuracy, because you 100 00:03:36,120 --> 00:03:37,730 can get very high classification accuracies 101 00:03:38,420 --> 00:03:40,950 or very low errors, and 102 00:03:41,110 --> 00:03:42,880 it's not always clear if 103 00:03:43,070 --> 00:03:44,190 doing so is really improving 104 00:03:44,770 --> 00:03:45,780 the quality of your classifier 105 00:03:46,400 --> 00:03:48,320 because predicting y equals 0 all the 106 00:03:48,380 --> 00:03:50,710 time doesn't seem like 107 00:03:51,570 --> 00:03:52,570 a particularly good classifier. 108 00:03:53,900 --> 00:03:55,500 But just predicting y equals 0 more 109 00:03:55,720 --> 00:03:57,300 often can bring your error 110 00:03:57,830 --> 00:03:59,460 down to, you know, maybe as 111 00:03:59,650 --> 00:04:01,120 low as 0.5%. 112 00:04:01,490 --> 00:04:02,590 When we're faced with such 113 00:04:02,770 --> 00:04:04,990 a skewed classes therefore we 114 00:04:05,250 --> 00:04:06,350 would want to come up 115 00:04:06,470 --> 00:04:07,920 with a different error metric 116 00:04:08,320 --> 00:04:09,500 or a different evaluation metric. 117 00:04:10,290 --> 00:04:12,360 One such evaluation metric are 118 00:04:12,870 --> 00:04:14,240 what's called precision recall. 119 00:04:15,440 --> 00:04:16,410 Let me explain what that is. 120 00:04:17,520 --> 00:04:19,890 Let's say we are evaluating a classifier on the test set. 121 00:04:20,750 --> 00:04:21,800 For the examples in the 122 00:04:21,890 --> 00:04:23,890 test set the actual 123 00:04:25,450 --> 00:04:26,880 class of that example 124 00:04:27,320 --> 00:04:28,440 in the test set is going to 125 00:04:28,550 --> 00:04:29,810 be either one or zero, right, 126 00:04:30,440 --> 00:04:32,520 if there is a binary classification problem. 127 00:04:33,870 --> 00:04:34,960 And what our learning algorithm 128 00:04:35,360 --> 00:04:37,070 will do is it will, you know, 129 00:04:37,930 --> 00:04:39,270 predict some value for the 130 00:04:39,450 --> 00:04:41,160 class and our learning 131 00:04:41,560 --> 00:04:43,300 algorithm will predict the value 132 00:04:43,760 --> 00:04:44,830 for each example in my 133 00:04:44,910 --> 00:04:46,520 test set and the predicted value 134 00:04:46,920 --> 00:04:48,560 will also be either one or zero. 135 00:04:50,050 --> 00:04:52,060 So let me draw a two 136 00:04:52,270 --> 00:04:53,340 by two table as follows, 137 00:04:53,910 --> 00:04:55,870 depending on a full of these entries 138 00:04:56,320 --> 00:04:57,800 depending on what was the 139 00:04:57,960 --> 00:04:59,350 actual class and what was the predicted class. 140 00:05:00,220 --> 00:05:01,270 If we have an 141 00:05:01,570 --> 00:05:02,890 example where the actual class is 142 00:05:02,970 --> 00:05:03,950 one and the predicted class 143 00:05:04,240 --> 00:05:06,140 is one then that's called 144 00:05:07,620 --> 00:05:08,640 an example that's a true 145 00:05:08,940 --> 00:05:10,300 positive, meaning our algorithm 146 00:05:10,730 --> 00:05:11,700 predicted that it's positive 147 00:05:12,400 --> 00:05:15,780 and in reality the example is positive. 148 00:05:16,240 --> 00:05:17,300 If our learning algorithm predicted that 149 00:05:17,490 --> 00:05:19,010 something is negative, class zero, 150 00:05:19,570 --> 00:05:20,620 and the actual class is also 151 00:05:20,970 --> 00:05:23,650 class zero then that's what's called a true negative. 152 00:05:24,070 --> 00:05:26,370 We predicted zero and it actually is zero. 153 00:05:27,880 --> 00:05:28,740 To find the other two boxes, 154 00:05:29,470 --> 00:05:31,120 if our learning algorithm predicts that 155 00:05:31,360 --> 00:05:33,210 the class is one but the 156 00:05:34,340 --> 00:05:36,370 actual class is zero, then 157 00:05:36,670 --> 00:05:37,910 that's called a false positive. 158 00:05:39,350 --> 00:05:40,630 So that means our algorithm for 159 00:05:40,830 --> 00:05:41,970 the patient is cancelled out in 160 00:05:42,190 --> 00:05:43,520 reality if the patient does not. 161 00:05:44,730 --> 00:05:47,340 And finally, the last box is a zero, one. 162 00:05:48,200 --> 00:05:50,330 That's called a false negative 163 00:05:51,180 --> 00:05:52,690 because our algorithm predicted 164 00:05:53,450 --> 00:05:56,170 zero, but the actual class was one. 165 00:05:57,230 --> 00:05:59,020 And so, we 166 00:05:59,150 --> 00:06:00,830 have this little sort of two by 167 00:06:00,990 --> 00:06:02,720 two table based on 168 00:06:03,250 --> 00:06:05,500 what was the actual class and what was the predicted class. 169 00:06:07,080 --> 00:06:08,380 So here's a different way 170 00:06:08,690 --> 00:06:10,310 of evaluating the performance of 171 00:06:10,420 --> 00:06:11,940 our algorithm. We're 172 00:06:12,550 --> 00:06:12,870 going to compute two numbers. 173 00:06:13,310 --> 00:06:14,780 The first is called precision - 174 00:06:14,940 --> 00:06:16,100 and what that says is, 175 00:06:17,170 --> 00:06:18,330 of all the patients where we've 176 00:06:18,580 --> 00:06:19,580 predicted that they have cancer, 177 00:06:20,640 --> 00:06:23,140 what fraction of them actually have cancer? 178 00:06:24,560 --> 00:06:25,310 So let me write this down, 179 00:06:26,020 --> 00:06:27,300 the precision of a classifier 180 00:06:27,680 --> 00:06:29,070 is the number of true 181 00:06:29,310 --> 00:06:31,880 positives divided by 182 00:06:32,940 --> 00:06:35,190 the number that we predicted 183 00:06:37,140 --> 00:06:37,370 as positive, right? 184 00:06:39,150 --> 00:06:40,660 So of all the patients that 185 00:06:41,090 --> 00:06:43,590 we went to those patients and we told them, "We think you have cancer." 186 00:06:43,890 --> 00:06:45,730 Of all those patients, what 187 00:06:45,890 --> 00:06:47,410 fraction of them actually have cancer? 188 00:06:47,500 --> 00:06:48,920 So that's called precision. 189 00:06:49,800 --> 00:06:50,680 And another way to write this 190 00:06:50,950 --> 00:06:54,920 would be true positives and 191 00:06:55,010 --> 00:06:56,430 then in the denominator is the 192 00:06:56,670 --> 00:06:59,050 number of predicted positives, and 193 00:06:59,210 --> 00:07:00,160 so that would be the 194 00:07:00,240 --> 00:07:01,730 sum of the, you know, entries 195 00:07:02,410 --> 00:07:04,510 in this first row of the table. 196 00:07:04,720 --> 00:07:07,760 So it would be true positives divided by true positives. 197 00:07:08,670 --> 00:07:10,470 I'm going to abbreviate positive 198 00:07:11,220 --> 00:07:12,980 as POS and then 199 00:07:13,130 --> 00:07:15,470 plus false positives, again 200 00:07:15,890 --> 00:07:18,550 abbreviating positive using POS. 201 00:07:20,030 --> 00:07:21,850 So that's called precision, and as you 202 00:07:21,920 --> 00:07:23,490 can tell high precision would be good. 203 00:07:23,660 --> 00:07:24,680 That means that all the patients 204 00:07:25,070 --> 00:07:27,100 that we went to and we said, "You know, we're very sorry. 205 00:07:27,510 --> 00:07:28,960 We think you have cancer," high precision 206 00:07:29,440 --> 00:07:31,750 means that of that group 207 00:07:31,980 --> 00:07:33,160 of patients most of them 208 00:07:33,390 --> 00:07:34,460 we had actually made accurate 209 00:07:34,820 --> 00:07:36,630 predictions on them and they do have cancer. 210 00:07:38,840 --> 00:07:39,880 The second number we're going to compute 211 00:07:40,440 --> 00:07:41,730 is called recall, and what 212 00:07:42,060 --> 00:07:44,230 recall say is, if all 213 00:07:44,480 --> 00:07:46,100 the patients in, let's say, 214 00:07:46,190 --> 00:07:47,510 in the test set or the 215 00:07:47,620 --> 00:07:48,830 cross-validation set, but if 216 00:07:48,960 --> 00:07:49,980 all the patients in the data 217 00:07:50,150 --> 00:07:51,550 set that actually have cancer, 218 00:07:52,670 --> 00:07:54,240 what fraction of them that 219 00:07:54,400 --> 00:07:56,250 we correctly detect as having cancer. 220 00:07:56,950 --> 00:07:57,870 So if all the patients have 221 00:07:58,090 --> 00:07:59,170 cancer, how many of them 222 00:07:59,400 --> 00:08:01,130 did we actually go to them 223 00:08:01,320 --> 00:08:03,850 and you know, correctly told them that we think they need treatment. 224 00:08:05,860 --> 00:08:07,010 So, writing this down, 225 00:08:07,360 --> 00:08:08,970 recall is defined as the 226 00:08:09,040 --> 00:08:12,020 number of positives, the number 227 00:08:12,470 --> 00:08:14,760 of true positives, 228 00:08:15,260 --> 00:08:16,320 meaning the number of people 229 00:08:16,520 --> 00:08:17,890 that have cancer and that 230 00:08:18,030 --> 00:08:19,280 we correctly predicted have cancer 231 00:08:20,310 --> 00:08:21,440 and we take that and divide 232 00:08:21,790 --> 00:08:23,510 that by, divide that by 233 00:08:23,740 --> 00:08:29,300 the number of actual positives, 234 00:08:31,200 --> 00:08:32,070 so this is the right 235 00:08:32,510 --> 00:08:35,190 number of actual positives of all the people that do have cancer. 236 00:08:35,850 --> 00:08:37,000 What fraction do we directly 237 00:08:37,430 --> 00:08:38,950 flag and you know, send the treatment. 238 00:08:40,560 --> 00:08:41,780 So, to rewrite this in 239 00:08:41,930 --> 00:08:44,060 a different form, the denominator would 240 00:08:44,210 --> 00:08:45,160 be the number of actual 241 00:08:45,430 --> 00:08:46,990 positives as you know, is the sum 242 00:08:47,220 --> 00:08:49,480 of the entries in this first column over here. 243 00:08:50,600 --> 00:08:51,660 And so writing things out differently, 244 00:08:52,160 --> 00:08:53,470 this is therefore, the number of 245 00:08:53,650 --> 00:08:57,120 true positives, divided by 246 00:08:59,010 --> 00:09:01,340 the number of true positives 247 00:09:02,790 --> 00:09:05,430 plus the number of 248 00:09:06,750 --> 00:09:07,690 false negatives. 249 00:09:09,570 --> 00:09:12,180 And so once again, having a high recall would be a good thing. 250 00:09:14,180 --> 00:09:15,810 So by computing precision and 251 00:09:15,930 --> 00:09:17,100 recall this will usually 252 00:09:17,340 --> 00:09:18,740 give us a better sense of 253 00:09:19,140 --> 00:09:20,560 how well our classifier is doing. 254 00:09:21,620 --> 00:09:22,960 And in particular if we have 255 00:09:23,330 --> 00:09:24,740 a learning algorithm that predicts 256 00:09:25,520 --> 00:09:27,020 y equals zero all 257 00:09:27,190 --> 00:09:28,290 the time, if it predicts no 258 00:09:28,460 --> 00:09:30,080 one has cancer, then this 259 00:09:30,250 --> 00:09:31,880 classifier will have a 260 00:09:32,070 --> 00:09:33,820 recall equal to zero, 261 00:09:34,370 --> 00:09:35,300 because there won't be any 262 00:09:35,570 --> 00:09:36,940 true positives and so that's 263 00:09:37,190 --> 00:09:37,930 a quick way for us to 264 00:09:38,010 --> 00:09:40,290 recognize that, you know, a 265 00:09:40,360 --> 00:09:41,570 classifier that predicts y equals 0 all the time, 266 00:09:42,050 --> 00:09:43,350 just isn't a very good classifier. 267 00:09:44,000 --> 00:09:46,660 And more generally, 268 00:09:47,450 --> 00:09:48,830 even for settings where we 269 00:09:48,950 --> 00:09:50,800 have very skewed classes, it's 270 00:09:51,050 --> 00:09:53,350 not possible for an 271 00:09:53,440 --> 00:09:54,900 algorithm to sort of "cheat" 272 00:09:55,450 --> 00:09:56,400 and somehow get a very 273 00:09:56,750 --> 00:09:57,930 high precision and a 274 00:09:58,010 --> 00:09:59,360 very high recall by doing 275 00:09:59,620 --> 00:10:00,800 some simple thing like predicting 276 00:10:01,050 --> 00:10:02,680 y equals 0 all the time or 277 00:10:02,720 --> 00:10:04,720 predicting y equals 1 all the time. 278 00:10:04,960 --> 00:10:06,540 And so we're much 279 00:10:06,680 --> 00:10:08,230 more sure that a classifier 280 00:10:08,840 --> 00:10:09,780 of a high precision or high recall 281 00:10:10,610 --> 00:10:11,550 actually is a good classifier, 282 00:10:12,470 --> 00:10:13,940 and this gives us a 283 00:10:14,040 --> 00:10:15,660 more useful evaluation metric that 284 00:10:15,910 --> 00:10:16,960 is a more direct way to 285 00:10:17,230 --> 00:10:20,360 actually understand whether, you know, our algorithm may be doing well. 286 00:10:21,680 --> 00:10:23,000 So one final note in 287 00:10:23,200 --> 00:10:24,960 the definition of precision and 288 00:10:25,150 --> 00:10:26,190 recall, that we would define 289 00:10:26,720 --> 00:10:28,720 precision and recall, usually we 290 00:10:29,100 --> 00:10:31,970 use the convention that y is equal to 1, in 291 00:10:32,090 --> 00:10:33,700 the presence of the more rare class. 292 00:10:34,160 --> 00:10:35,410 So if we are trying to detect. 293 00:10:35,880 --> 00:10:37,300 rare conditions such as cancer, 294 00:10:37,720 --> 00:10:38,610 hopefully that's a rare condition, 295 00:10:39,340 --> 00:10:40,950 precision and recall are 296 00:10:41,000 --> 00:10:42,440 defined setting y equals 297 00:10:42,790 --> 00:10:43,930 1, rather than y 298 00:10:44,190 --> 00:10:45,690 equals 0, to be sort of 299 00:10:45,820 --> 00:10:47,100 that the presence of that rare 300 00:10:47,250 --> 00:10:50,220 class that we're trying to detect. 301 00:10:50,450 --> 00:10:51,960 And by using precision and recall, 302 00:10:52,890 --> 00:10:54,250 we find, what happens is 303 00:10:54,390 --> 00:10:55,400 that even if we have 304 00:10:55,610 --> 00:10:57,400 very skewed classes, it's not 305 00:10:57,590 --> 00:10:59,080 possible for an algorithm to 306 00:10:59,600 --> 00:11:01,060 you know, "cheat" and predict 307 00:11:01,380 --> 00:11:02,400 y equals 1 all the time, 308 00:11:02,760 --> 00:11:03,870 or predict y equals 0 all 309 00:11:03,980 --> 00:11:05,750 the time, and get high precision and recall. 310 00:11:06,640 --> 00:11:07,830 And in particular, if a classifier 311 00:11:08,480 --> 00:11:09,700 is getting high precision and high 312 00:11:09,880 --> 00:11:11,160 recall, then we are 313 00:11:11,270 --> 00:11:13,040 actually confident that the algorithm 314 00:11:13,590 --> 00:11:15,120 has to be doing well, even 315 00:11:15,400 --> 00:11:16,620 if we have very skewed classes. 316 00:11:18,030 --> 00:11:20,360 So for the problem of skewed classes precision 317 00:11:20,950 --> 00:11:22,560 recall gives us more 318 00:11:22,780 --> 00:11:24,670 direct insight into how 319 00:11:24,910 --> 00:11:26,010 the learning algorithm is doing 320 00:11:26,660 --> 00:11:27,980 and this is often a much 321 00:11:28,070 --> 00:11:29,360 better way to evaluate our learning algorithms, 322 00:11:30,270 --> 00:11:32,200 than looking at classification error 323 00:11:32,510 --> 00:11:35,200 or classification accuracy, when the classes are very skewed.