1 00:00:00,350 --> 00:00:02,080 Hello, and welcome back. 2 00:00:02,080 --> 00:00:06,550 If you're trying to get a learning algorithm to do a task that humans can do. 3 00:00:06,550 --> 00:00:10,490 And if your learning algorithm is not yet at the performance of a human. 4 00:00:10,490 --> 00:00:13,790 Then manually examining mistakes that your algorithm is making, 5 00:00:13,790 --> 00:00:16,240 can give you insights into what to do next. 6 00:00:16,240 --> 00:00:19,040 This process is called error analysis. 7 00:00:19,040 --> 00:00:20,890 Let's start with an example. 8 00:00:20,890 --> 00:00:24,520 Let's say you're working on your cat classifier, and you've achieved 90% 9 00:00:24,520 --> 00:00:29,390 accuracy, or equivalently 10% error, on your dev set. 10 00:00:29,390 --> 00:00:32,820 And let's say this is much worse than you're hoping to do. 11 00:00:32,820 --> 00:00:36,740 Maybe one of your teammates looks at some of the examples that the algorithm is 12 00:00:36,740 --> 00:00:42,340 misclassifying, and notices that it is miscategorizing some dogs as cats. 13 00:00:42,340 --> 00:00:46,080 And if you look at these two dogs, maybe they look a little bit like a cat, 14 00:00:46,080 --> 00:00:47,628 at least at first glance. 15 00:00:47,628 --> 00:00:51,160 So maybe your teammate comes to you with a proposal for 16 00:00:51,160 --> 00:00:56,110 how to make the algorithm do better, specifically on dogs, right? 17 00:00:56,110 --> 00:01:01,080 You can imagine building a focus effort, maybe to collect more dog pictures, or 18 00:01:01,080 --> 00:01:04,680 maybe to design features specific to dogs, or something. 19 00:01:04,680 --> 00:01:07,833 In order to make your cat classifier do better on dogs, so 20 00:01:07,833 --> 00:01:11,070 it stops misrecognizing these dogs as cats. 21 00:01:11,070 --> 00:01:13,980 So the question is, should you go ahead and 22 00:01:13,980 --> 00:01:18,080 start a project focus on the dog problem? 23 00:01:19,325 --> 00:01:23,740 There could be several months of works you could do in order to make your algorithm 24 00:01:23,740 --> 00:01:25,890 make few mistakes on dog pictures. 25 00:01:27,280 --> 00:01:29,550 So is that worth your effort? 26 00:01:29,550 --> 00:01:32,475 Well, rather than spending a few months doing this, 27 00:01:32,475 --> 00:01:36,175 only to risk finding out at the end that it wasn't that helpful. 28 00:01:36,175 --> 00:01:40,605 Here's an error analysis procedure that can let you very quickly tell whether or 29 00:01:40,605 --> 00:01:43,055 not this could be worth your effort. 30 00:01:43,055 --> 00:01:45,180 Here's what I recommend you do. 31 00:01:45,180 --> 00:01:51,860 First, get about, say 100 mislabeled dev set examples, then examine them manually. 32 00:01:51,860 --> 00:01:56,380 Just count them up one at a time, to see how many of these mislabeled 33 00:01:56,380 --> 00:01:59,338 examples in your dev set are actually pictures of dogs. 34 00:01:59,338 --> 00:02:02,160 Now, suppose that it turns out 35 00:02:02,160 --> 00:02:07,700 that 5% of your 100 mislabeled dev set examples are pictures of dogs. 36 00:02:07,700 --> 00:02:12,740 So, that is, if 5 out of 100 of these mislabeled 37 00:02:12,740 --> 00:02:18,231 dev set examples are dogs, what this means is that of the 100 examples. 38 00:02:18,231 --> 00:02:23,217 Of a typical set of 100 examples you're getting wrong, even if you 39 00:02:23,217 --> 00:02:28,807 completely solve the dog problem, you only get 5 out of 100 more correct. 40 00:02:28,807 --> 00:02:33,802 Or in other words, if only 5% of your errors are dog pictures, then the best you 41 00:02:33,802 --> 00:02:38,076 could easily hope to do, if you spend a lot of time on the dog problem. 42 00:02:38,076 --> 00:02:43,256 Is that your error might go down from 10% error, 43 00:02:43,256 --> 00:02:46,635 down to 9.5% error, right? 44 00:02:46,635 --> 00:02:53,455 So this a 5% relative decrease in error, from 10% down to 9.5%. 45 00:02:53,455 --> 00:02:58,220 And so you might reasonably decide that this is not the best use of your time. 46 00:02:58,220 --> 00:03:02,743 Or maybe it is, but at least this gives you a ceiling, right? 47 00:03:02,743 --> 00:03:08,566 Upper bound on how much you could improve performance by working on the dog problem, 48 00:03:08,566 --> 00:03:09,150 right? 49 00:03:10,800 --> 00:03:15,870 In machine learning, sometimes we call this the ceiling on performance. 50 00:03:15,870 --> 00:03:17,818 Which just means, what's in the best case? 51 00:03:17,818 --> 00:03:20,720 How well could working on the dog problem help you? 52 00:03:22,690 --> 00:03:25,310 But now, suppose something else happens. 53 00:03:25,310 --> 00:03:28,590 Suppose that we look at your 100 mislabeled dev set examples, 54 00:03:28,590 --> 00:03:32,340 you find that 50 of them are actually dog images. 55 00:03:32,340 --> 00:03:35,620 So 50% of them are dog pictures. 56 00:03:35,620 --> 00:03:39,760 Now you could be much more optimistic about spending time on the dog problem. 57 00:03:39,760 --> 00:03:42,807 In this case, if you actually solve the dog problem, 58 00:03:42,807 --> 00:03:47,158 your error would go down from this 10%, down to potentially 5% error. 59 00:03:47,158 --> 00:03:52,260 And you might decide that halving your error could be worth a lot of effort. 60 00:03:52,260 --> 00:03:56,150 Focus on reducing the problem of mislabeled dogs. 61 00:03:56,150 --> 00:04:00,446 I know that in machine learning, sometimes we speak disparagingly of hand 62 00:04:00,446 --> 00:04:03,660 engineering things, or using too much value insight. 63 00:04:03,660 --> 00:04:09,280 But if you're building applied systems, then this simple counting procedure, 64 00:04:09,280 --> 00:04:12,120 error analysis, can save you a lot of time. 65 00:04:12,120 --> 00:04:14,740 In terms of deciding what's the most important, or 66 00:04:14,740 --> 00:04:17,309 what's the most promising direction to focus on. 67 00:04:19,739 --> 00:04:24,263 In fact, if you're looking at 100 mislabeled dev set examples, 68 00:04:24,263 --> 00:04:27,620 maybe this is a 5 to 10 minute effort. 69 00:04:27,620 --> 00:04:29,930 To manually go through 100 examples, and 70 00:04:29,930 --> 00:04:32,860 manually count up how many of them are dogs. 71 00:04:32,860 --> 00:04:36,212 And depending on the outcome, whether there's more like 5%, or 72 00:04:36,212 --> 00:04:37,570 50%, or something else. 73 00:04:37,570 --> 00:04:39,580 This, in just 5 to 10 minutes, 74 00:04:39,580 --> 00:04:44,310 gives you an estimate of how worthwhile this direction is. 75 00:04:44,310 --> 00:04:46,430 And could help you make a much better decision, whether or 76 00:04:46,430 --> 00:04:51,470 not to spend the next few months focused on trying to find solutions to 77 00:04:51,470 --> 00:04:54,180 solve the problem of mislabeled dogs. 78 00:04:54,180 --> 00:04:58,120 In this slide, we'll describe using error analysis to evaluate whether or 79 00:04:58,120 --> 00:05:02,380 not a single idea, dogs in this case, is worth working on. 80 00:05:02,380 --> 00:05:08,260 Sometimes you can also evaluate multiple ideas in parallel doing error analysis. 81 00:05:08,260 --> 00:05:12,920 For example, let's say you have several ideas in improving your cat detector. 82 00:05:12,920 --> 00:05:16,460 Maybe you can improve performance on dogs? 83 00:05:16,460 --> 00:05:19,785 Or maybe you notice that sometimes, what are called great cats, 84 00:05:19,785 --> 00:05:22,332 such as lions, panthers, cheetahs, and so on. 85 00:05:22,332 --> 00:05:25,758 That they are being recognized as small cats, or house cats. 86 00:05:25,758 --> 00:05:28,031 So you could maybe find a way to work on that. 87 00:05:28,031 --> 00:05:32,632 Or maybe you find that some of your images are blurry, and it would be nice if you 88 00:05:32,632 --> 00:05:36,489 could design something that just works better on blurry images. 89 00:05:37,560 --> 00:05:39,280 And maybe you have some ideas on how to do that. 90 00:05:41,480 --> 00:05:45,430 So if carrying out error analysis to evaluate these three ideas, 91 00:05:45,430 --> 00:05:48,430 what I would do is create a table like this. 92 00:05:50,760 --> 00:05:53,940 And I usually do this in a spreadsheet, but 93 00:05:53,940 --> 00:05:56,520 using an ordinary text file will also be okay. 94 00:05:57,610 --> 00:05:58,605 And on the left side, 95 00:05:58,605 --> 00:06:02,430 this goes through the set of images you plan to look at manually. 96 00:06:02,430 --> 00:06:06,010 So this maybe goes from 1 to 100, if you look at 100 pictures. 97 00:06:06,010 --> 00:06:09,910 And the columns of this table, of the spreadsheet, 98 00:06:09,910 --> 00:06:12,570 will correspond to the ideas you're evaluating. 99 00:06:12,570 --> 00:06:18,490 So the dog problem, the problem of great cats, and blurry images. 100 00:06:18,490 --> 00:06:23,870 And I usually also leave space in the spreadsheet to write comments. 101 00:06:23,870 --> 00:06:25,724 So remember, during error analysis, 102 00:06:25,724 --> 00:06:29,610 you're just looking at dev set examples that your algorithm has misrecognized. 103 00:06:30,670 --> 00:06:34,640 So if you find that the first misrecognized image is a picture of a dog, 104 00:06:34,640 --> 00:06:36,550 then I'd put a check mark there. 105 00:06:36,550 --> 00:06:39,540 And to help myself remember these images, 106 00:06:39,540 --> 00:06:41,830 sometimes I'll make a note in the comments. 107 00:06:41,830 --> 00:06:44,380 So maybe that was a pit bull picture. 108 00:06:44,380 --> 00:06:48,110 If the second picture was blurry, then make a note there. 109 00:06:48,110 --> 00:06:53,317 If the third one was a lion, on a rainy day, in the zoo that was misrecognized. 110 00:06:53,317 --> 00:06:56,030 Then that's a great cat, and the blurry data. 111 00:06:56,030 --> 00:07:00,920 Make a note in the comment section, rainy day at zoo, and 112 00:07:00,920 --> 00:07:03,620 it was the rain that made it blurry, and so on. 113 00:07:05,670 --> 00:07:08,616 Then finally, having gone through some set of images, 114 00:07:08,616 --> 00:07:11,508 I would count up what percentage of these algorithms. 115 00:07:11,508 --> 00:07:16,951 Or what percentage of each of these error categories were attributed to the dog, 116 00:07:16,951 --> 00:07:19,360 or great cat, blurry categories. 117 00:07:19,360 --> 00:07:26,484 So maybe 8% of these images you examine turn out be dogs, and 118 00:07:26,484 --> 00:07:32,390 maybe 43% great cats, and 61% were blurry. 119 00:07:32,390 --> 00:07:34,720 So this just means going down each column, and 120 00:07:34,720 --> 00:07:39,290 counting up what percentage of images have a check mark in that column. 121 00:07:39,290 --> 00:07:41,567 As you're part way through this process, 122 00:07:41,567 --> 00:07:44,420 sometimes you notice other categories of mistakes. 123 00:07:44,420 --> 00:07:50,240 So, for example, you might find that Instagram style filter, those fancy 124 00:07:50,240 --> 00:07:55,420 image filters, are also messing up your classifier. 125 00:07:55,420 --> 00:07:55,940 In that case, 126 00:07:55,940 --> 00:08:00,240 it's actually okay, part way through the process, to add another column like that. 127 00:08:00,240 --> 00:08:03,125 For the multi-colored filters, the Instagram filters, and 128 00:08:03,125 --> 00:08:04,650 the Snapchat filters. 129 00:08:04,650 --> 00:08:07,900 And then go through and count up those as well, and 130 00:08:07,900 --> 00:08:11,050 figure out what percentage comes from that new error category. 131 00:08:12,170 --> 00:08:16,640 The conclusion of this process gives you an estimate of how worthwhile it might 132 00:08:16,640 --> 00:08:19,880 be to work on each of these different categories of errors. 133 00:08:19,880 --> 00:08:23,820 For example, clearly in this example, a lot of the mistakes we made on blurry 134 00:08:23,820 --> 00:08:28,780 images, and quite a lot on were made on great cat images. 135 00:08:28,780 --> 00:08:35,750 And so the outcome of this analysis is not that you must work on blurry images. 136 00:08:35,750 --> 00:08:39,360 This doesn't give you a rigid mathematical formula that tells you what to do, 137 00:08:39,360 --> 00:08:43,140 but it gives you a sense of the best options to pursue. 138 00:08:43,140 --> 00:08:44,650 It also tells you, for example, 139 00:08:44,650 --> 00:08:50,490 that no matter how much better you do on dog images, or on Instagram images. 140 00:08:50,490 --> 00:08:55,130 You at most improve performance by maybe 8%, or 12%, in these examples. 141 00:08:55,130 --> 00:08:57,700 Whereas you can to better on great cat images, or 142 00:08:57,700 --> 00:09:00,246 blurry images, the potential improvement. 143 00:09:00,246 --> 00:09:03,730 Now there's a ceiling in terms of how much you could improve performance, 144 00:09:03,730 --> 00:09:05,390 is much higher. 145 00:09:05,390 --> 00:09:09,010 So depending on how many ideas you have for improving performance on great cats, 146 00:09:09,010 --> 00:09:10,320 on blurry images. 147 00:09:10,320 --> 00:09:13,856 Maybe you could pick one of the two, or if you have enough personnel on your team, 148 00:09:13,856 --> 00:09:15,630 maybe you can have two different teams. 149 00:09:15,630 --> 00:09:18,629 Have one work on improving errors on great cats, and 150 00:09:18,629 --> 00:09:22,120 a different team work on improving errors on blurry images. 151 00:09:27,184 --> 00:09:31,280 But this quick counting procedure, which you can often do in, at most, 152 00:09:31,280 --> 00:09:33,130 small numbers of hours. 153 00:09:33,130 --> 00:09:36,200 Can really help you make much better prioritization decisions, 154 00:09:36,200 --> 00:09:39,410 and understand how promising different approaches are to work on. 155 00:09:40,940 --> 00:09:44,670 So to summarize, to carry out error analysis, you should find a set of 156 00:09:44,670 --> 00:09:48,731 mislabeled examples, either in your dev set, or in your development set. 157 00:09:48,731 --> 00:09:53,420 And look at the mislabeled examples for false positives and false negatives. 158 00:09:53,420 --> 00:09:56,378 And just count up the number of errors that fall into various 159 00:09:56,378 --> 00:09:57,629 different categories. 160 00:09:57,629 --> 00:10:01,916 During this process, you might be inspired to generate new categories of errors, 161 00:10:01,916 --> 00:10:02,597 like we saw. 162 00:10:02,597 --> 00:10:06,016 If you're looking through the examples and you say gee, there are a lot of Instagram 163 00:10:06,016 --> 00:10:09,071 filters, or Snapchat filters, they're also messing up my classifier. 164 00:10:09,071 --> 00:10:11,600 You can create new categories during that process. 165 00:10:11,600 --> 00:10:14,740 But by counting up the fraction of examples that are mislabeled in 166 00:10:14,740 --> 00:10:17,375 different ways, often this will help you prioritize. 167 00:10:17,375 --> 00:10:21,140 Or give you inspiration for new directions to go in. 168 00:10:21,140 --> 00:10:23,074 Now as you're doing error analysis, 169 00:10:23,074 --> 00:10:27,600 sometimes you notice that some of your examples in your dev sets are mislabeled. 170 00:10:27,600 --> 00:10:29,380 So what do you do about that? 171 00:10:29,380 --> 00:10:31,020 Let's discuss that in the next video.