1 00:00:00,008 --> 00:00:03,400 The term human-level performance is sometimes used 2 00:00:03,400 --> 00:00:05,820 casually in research articles. 3 00:00:05,820 --> 00:00:09,070 But let me show you how we can define it a bit more precisely. 4 00:00:09,070 --> 00:00:13,350 And in particular, use the definition of the phrase, human-level performance, 5 00:00:13,350 --> 00:00:17,430 that is most useful for helping you drive progress in your machine learning project. 6 00:00:19,370 --> 00:00:23,455 So remember from our last video that one of the uses of this phrase, 7 00:00:23,455 --> 00:00:28,820 human-level error, is that it gives us a way of estimating Bayes error. 8 00:00:28,820 --> 00:00:31,914 What is the best possible error any function could, 9 00:00:31,914 --> 00:00:35,380 either now or in the future, ever, ever achieve? 10 00:00:35,380 --> 00:00:40,190 So bearing that in mind, let's look at a medical image classification example. 11 00:00:40,190 --> 00:00:43,900 Let's say that you want to look at a radiology image like this, 12 00:00:43,900 --> 00:00:48,010 and make a diagnosis classification decision. 13 00:00:49,155 --> 00:00:52,541 And suppose that a typical human, untrained human, 14 00:00:52,541 --> 00:00:55,350 achieves 3% error on this task. 15 00:00:55,350 --> 00:01:02,460 A typical doctor, maybe a typical radiologist doctor, achieves 1% error. 16 00:01:02,460 --> 00:01:06,330 An experienced doctor does even better, 0.7% error. 17 00:01:06,330 --> 00:01:11,370 And a team of experienced doctors, that is if you get a team of experienced doctors 18 00:01:11,370 --> 00:01:14,790 and have them all look at the image and discuss and debate the image, 19 00:01:14,790 --> 00:01:20,085 together their consensus opinion achieves 0.5% error. 20 00:01:20,085 --> 00:01:25,247 So the question I want to pose to you is, how should you define human-level error? 21 00:01:25,247 --> 00:01:31,080 Is human-level error 3%, 1%, 0.7% or 0.5%? 22 00:01:31,080 --> 00:01:34,910 Feel free to pause this video to think about it if you wish. 23 00:01:34,910 --> 00:01:39,690 And to answer that question, I would urge you to bear in mind that one of the most 24 00:01:39,690 --> 00:01:45,990 useful ways to think of human error is as a proxy or an estimate for Bayes error. 25 00:01:45,990 --> 00:01:50,360 So please feel free to pause this video to think about it for a while if you wish. 26 00:01:50,360 --> 00:01:55,300 But here's how I would define human-level error. 27 00:01:55,300 --> 00:02:00,014 Which is if you want a proxy or an estimate for Bayes error, 28 00:02:00,014 --> 00:02:04,924 then given that a team of experienced doctors discussing and 29 00:02:04,924 --> 00:02:08,102 debating can achieve 0.5% error, 30 00:02:08,102 --> 00:02:12,649 we know that Bayes error is less than equal to 0.5%. 31 00:02:12,649 --> 00:02:17,801 So because some system, team of these doctors can achieve 0.5% error, 32 00:02:17,801 --> 00:02:23,366 so by definition, this directly, optimal error has got to be 0.5% or lower. 33 00:02:23,366 --> 00:02:26,680 We don't know how much better it is, maybe there's a even larger team 34 00:02:26,680 --> 00:02:29,488 of even more experienced doctors who could do even better, 35 00:02:29,488 --> 00:02:32,087 so maybe it's even a little bit better than 0.5%. 36 00:02:32,087 --> 00:02:36,325 But we know the optimal error cannot be higher than 0.5%. 37 00:02:36,325 --> 00:02:43,265 So what I would do in this setting is use 0.5% as our estimate for Bayes error. 38 00:02:43,265 --> 00:02:48,548 So I would define human-level performance as 0.5%. 39 00:02:48,548 --> 00:02:52,928 At least if you're hoping to use human-level error in the analysis of bias 40 00:02:52,928 --> 00:02:55,300 and variance as we saw in the last video. 41 00:02:56,330 --> 00:02:59,370 Now, for the purpose of publishing a research paper or for 42 00:02:59,370 --> 00:03:03,535 the purpose of deploying a system, maybe there's a different definition of 43 00:03:03,535 --> 00:03:06,725 human-level error that you can use which is so 44 00:03:06,725 --> 00:03:10,675 long as you surpass the performance of a typical doctor. 45 00:03:10,675 --> 00:03:13,865 That seems like maybe a very useful result if accomplished, and 46 00:03:13,865 --> 00:03:18,002 maybe surpassing a single radiologist, a single doctor's performance 47 00:03:18,002 --> 00:03:21,312 might mean the system is good enough to deploy in some context. 48 00:03:22,342 --> 00:03:26,412 So maybe the takeaway from this is to be clear about what your purpose is 49 00:03:26,412 --> 00:03:28,902 in defining the term human-level error. 50 00:03:28,902 --> 00:03:34,006 And if it is to show that you can surpass a single human and therefore argue for 51 00:03:34,006 --> 00:03:39,126 deploying your system in some context, maybe this is the appropriate definition. 52 00:03:39,126 --> 00:03:41,642 But if your goal is the proxy for Bayes error, 53 00:03:41,642 --> 00:03:44,976 then this is the appropriate definition. 54 00:03:44,976 --> 00:03:48,662 To see why this matters, let's look at an error analysis example. 55 00:03:51,642 --> 00:03:55,830 Let's say, for a medical imaging diagnosis example, 56 00:03:55,830 --> 00:04:00,320 that your training error is 5% and your dev error is 6%. 57 00:04:00,320 --> 00:04:05,260 And in the example from the previous slide, our human-level performance, 58 00:04:05,260 --> 00:04:08,688 and I'm going to think of this as proxy for Bayes error. 59 00:04:12,460 --> 00:04:17,577 Depending on whether you defined it as a typical doctor's performance or 60 00:04:17,577 --> 00:04:22,362 experienced doctor or team of doctors, you would have either 1% or 61 00:04:22,362 --> 00:04:24,599 0.7% or 0.5% for this. 62 00:04:24,599 --> 00:04:28,476 And remember also our definitions from the previous video, 63 00:04:28,476 --> 00:04:32,504 that this gap between Bayes error or estimate of Bayes error and 64 00:04:32,504 --> 00:04:36,625 training error is calling that a measure of the avoidable bias. 65 00:04:36,625 --> 00:04:40,633 And this as a measure or an estimate of how much of a variance problem you have in 66 00:04:40,633 --> 00:04:42,067 your learning algorithm. 67 00:04:44,417 --> 00:04:49,008 So in this first example, whichever of these choices you make, 68 00:04:49,008 --> 00:04:53,271 the measure of avoidable bias will be something like 4%. 69 00:04:53,271 --> 00:04:56,784 It will be somewhere between I guess, 4%, 70 00:04:56,784 --> 00:05:02,526 if you take that to 4.5%, if you use 0.5%, whereas this is 1%. 71 00:05:06,108 --> 00:05:08,013 So in this example, I would say, 72 00:05:08,013 --> 00:05:12,780 it doesn't really matter which of the definitions of human-level error you use, 73 00:05:12,780 --> 00:05:15,435 whether you use the typical doctor's error or 74 00:05:15,435 --> 00:05:20,361 the single experienced doctor's error or the team of experienced doctor's error. 75 00:05:20,361 --> 00:05:27,520 Whether this is 4% or 4.5%, this is clearly bigger than the variance problem. 76 00:05:27,520 --> 00:05:29,020 And so in this case, 77 00:05:29,020 --> 00:05:34,490 you should focus on bias reduction techniques such as train a bigger network. 78 00:05:34,490 --> 00:05:36,970 Now let's look at a second example. 79 00:05:36,970 --> 00:05:42,880 Let's see your training error is 1% and your dev error is 5%. 80 00:05:42,880 --> 00:05:45,386 Then again it doesn't really matter, seems but 81 00:05:45,386 --> 00:05:49,617 academic whether the human-level performance is 1% or 0.7% or 0.5%. 82 00:05:49,617 --> 00:05:54,599 Because whichever of these definitions you use, your measure of avoidable bias 83 00:05:54,599 --> 00:05:59,517 will be, I guess somewhere between 0% if you use that, to 0.5%, right? 84 00:05:59,517 --> 00:06:03,268 That's the gap between the human-level performance and your training error, 85 00:06:03,268 --> 00:06:04,516 whereas this gap is 4%. 86 00:06:04,516 --> 00:06:08,976 So this 4% is going to be much bigger than the avoidable bias either way. 87 00:06:08,976 --> 00:06:13,355 And so they'll just suggest you should focus on variance reduction techniques 88 00:06:13,355 --> 00:06:16,571 such as regularization or getting a bigger training set. 89 00:06:16,571 --> 00:06:20,946 But where it really matters will be if your training error is 0.7%. 90 00:06:20,946 --> 00:06:26,830 So you're doing really well now, and your dev error is 0.8%. 91 00:06:26,830 --> 00:06:33,597 In this case, it really matters that you use your estimate for Bayes error as 0.5%. 92 00:06:36,027 --> 00:06:41,303 Because in this case, your measure of how much avoidable bias you have is 0.2% 93 00:06:41,303 --> 00:06:46,512 which is twice as big as your measure for your variance, which is just 0.1%. 94 00:06:48,592 --> 00:06:52,226 And so this suggests that maybe both the bias and variance are both problems but 95 00:06:52,226 --> 00:06:54,880 maybe the avoidable bias is a bit bigger of a problem. 96 00:06:54,880 --> 00:07:00,000 And in this example, 0.5% as we discussed on the previous slide was the best measure 97 00:07:00,000 --> 00:07:04,130 of Bayes error, because a team of human doctors could achieve that performance. 98 00:07:04,130 --> 00:07:08,711 If you use 0.7 as your proxy for Bayes error, you would have estimated 99 00:07:08,711 --> 00:07:13,200 avoidable bias as pretty much 0%, and you might have missed that. 100 00:07:13,200 --> 00:07:15,870 You actually should try to do better on your training set. 101 00:07:18,085 --> 00:07:22,640 So I hope this gives a sense also of why making progress in a machine learning 102 00:07:22,640 --> 00:07:27,660 problem gets harder as you achieve or as you approach human-level performance. 103 00:07:27,660 --> 00:07:31,600 In this example, once you've approached 0.7% error, 104 00:07:31,600 --> 00:07:35,300 unless you're very careful about estimating Bayes error, 105 00:07:35,300 --> 00:07:38,620 you might not know how far away you are from Bayes error. 106 00:07:38,620 --> 00:07:42,999 And therefore how much you should be trying to reduce aviodable bias. 107 00:07:42,999 --> 00:07:47,691 In fact, if all you knew was that a single typical doctor achieves 1% error, and 108 00:07:47,691 --> 00:07:52,247 it might be very difficult to know if you should be trying to fit your training set 109 00:07:52,247 --> 00:07:53,070 even better. 110 00:07:54,860 --> 00:07:58,720 And this problem arose only when you're doing very well on your problem already, 111 00:07:58,720 --> 00:08:02,764 only when you're doing 0.7%, 0.8%, really close to human-level performance. 112 00:08:04,430 --> 00:08:09,254 Whereas in the two examples on the left, when you are further away human-level 113 00:08:09,254 --> 00:08:13,530 performance, it was easier to target your focus on bias or variance. 114 00:08:13,530 --> 00:08:17,209 So this is maybe an illustration of why as your pro human-level performance is 115 00:08:17,209 --> 00:08:20,071 actually harder to tease out the bias and variance effects. 116 00:08:20,071 --> 00:08:23,555 And therefore why progress on your machine learning project just gets harder as 117 00:08:23,555 --> 00:08:24,810 you're doing really well. 118 00:08:25,930 --> 00:08:28,590 So just to summarize what we've talked about. 119 00:08:28,590 --> 00:08:31,863 If you're trying to understand bias and variance where 120 00:08:31,863 --> 00:08:36,734 you have an estimate of human-level error for a task that humans can do quite well, 121 00:08:36,734 --> 00:08:41,419 you can use human-level error as a proxy or as a approximation for Bayes error. 122 00:08:47,130 --> 00:08:51,917 And so the difference between your estimate of Bayes error tells you how 123 00:08:51,917 --> 00:08:56,568 much avoidable bias is a problem, how much avoidable bias there is. 124 00:08:56,568 --> 00:08:59,767 And the difference between training error and dev error, 125 00:08:59,767 --> 00:09:04,075 that tells you how much variance is a problem, whether your algorithm's able 126 00:09:04,075 --> 00:09:07,500 to generalize from the training set to the dev set. 127 00:09:07,500 --> 00:09:10,708 And the big difference between our discussion here and 128 00:09:10,708 --> 00:09:15,740 what we saw in an earlier course was that instead of comparing training error to 0%, 129 00:09:18,471 --> 00:09:23,553 And just calling that the estimate of the bias. 130 00:09:23,553 --> 00:09:28,400 In contrast, in this video we have a more nuanced analysis in which there is no 131 00:09:28,400 --> 00:09:31,999 particular expectation that you should get 0% error. 132 00:09:31,999 --> 00:09:36,269 Because sometimes Bayes error is non zero and sometimes it's just not possible for 133 00:09:36,269 --> 00:09:39,198 anything to do better than a certain threshold of error. 134 00:09:41,723 --> 00:09:46,305 And so in the earlier course, we were measuring training error, and 135 00:09:46,305 --> 00:09:49,895 seeing how much bigger training error was than zero. 136 00:09:49,895 --> 00:09:53,720 And just using that to try to understand how big our bias is. 137 00:09:53,720 --> 00:09:58,458 And that turns out to work just fine for problems where Bayes error is nearly 0%, 138 00:09:58,458 --> 00:10:00,085 such as recognizing cats. 139 00:10:00,085 --> 00:10:04,005 Humans are near perfect for that, so Bayes error is also near perfect for that. 140 00:10:04,005 --> 00:10:07,525 So that actually works okay when Bayes error is nearly zero. 141 00:10:07,525 --> 00:10:11,441 But for problems where the data is noisy, like speech recognition on very noisy 142 00:10:11,441 --> 00:10:14,831 audio where it's just impossible sometimes to hear what was said and 143 00:10:14,831 --> 00:10:16,594 to get the correct transcription. 144 00:10:16,594 --> 00:10:19,239 For problems like that, having a better estimate for 145 00:10:19,239 --> 00:10:22,787 Bayes error can help you better estimate avoidable bias and variance. 146 00:10:22,787 --> 00:10:26,569 And therefore make better decisions on whether to focus on bias reduction 147 00:10:26,569 --> 00:10:28,955 tactics, or on variance reduction tactics. 148 00:10:30,915 --> 00:10:34,855 So to recap, having an estimate of human-level performance gives you 149 00:10:34,855 --> 00:10:36,442 an estimate of Bayes error. 150 00:10:36,442 --> 00:10:40,468 And this allows you to more quickly make decisions as to whether you should focus 151 00:10:40,468 --> 00:10:44,390 on trying to reduce a bias or trying to reduce the variance of your algorithm. 152 00:10:45,690 --> 00:10:50,190 And these techniques will tend to work well until you surpass human-level 153 00:10:50,190 --> 00:10:54,750 performance, whereupon you might no longer have a good estimate of Bayes error that 154 00:10:54,750 --> 00:10:57,420 still helps you make this decision really clearly. 155 00:10:58,470 --> 00:11:01,980 Now, one of the exciting developments in deep learning has been that for 156 00:11:01,980 --> 00:11:06,360 more and more tasks we're actually able to surpass human-level performance. 157 00:11:06,360 --> 00:11:07,630 In the next video, 158 00:11:07,630 --> 00:11:11,394 let's talk more about the process of surpassing human-level performance.