1 00:00:00,252 --> 00:00:03,417 Whether you're tuning hyperparameters, or trying out different ideas for 2 00:00:03,417 --> 00:00:06,047 learning algorithms, or just trying out different options for 3 00:00:06,047 --> 00:00:07,764 building your machine learning system. 4 00:00:07,764 --> 00:00:12,016 You'll find that your progress will be much faster if you have a single real 5 00:00:12,016 --> 00:00:16,064 number evaluation metric that lets you quickly tell if the new thing you 6 00:00:16,064 --> 00:00:20,260 just tried is working better or worse than your last idea. 7 00:00:20,260 --> 00:00:24,710 So when teams are starting on a machine learning project, I often recommend that 8 00:00:24,710 --> 00:00:29,570 you set up a single real number evaluation metric for your problem. 9 00:00:29,570 --> 00:00:30,600 Let's look at an example. 10 00:00:32,400 --> 00:00:35,244 You've heard me say before that applied machine learning is a very 11 00:00:35,244 --> 00:00:36,165 empirical process. 12 00:00:36,165 --> 00:00:40,360 We often have an idea, code it up, run the experiment to see how it did, and 13 00:00:40,360 --> 00:00:44,100 then use the outcome of the experiment to refine your ideas. 14 00:00:44,100 --> 00:00:48,590 And then keep going around this loop as you keep on improving your algorithm. 15 00:00:48,590 --> 00:00:54,124 So let's say for your classifier, you had previously built some classifier A. 16 00:00:54,124 --> 00:00:58,036 And by changing the hyperparameters and the training sets or 17 00:00:58,036 --> 00:01:02,032 some other thing, you've now trained a new classifier, B. 18 00:01:02,032 --> 00:01:06,866 So one reasonable way to evaluate the performance of your classifiers is to look 19 00:01:06,866 --> 00:01:08,680 at its precision and recall. 20 00:01:08,680 --> 00:01:12,804 The exact details of what's precision and recall don't matter too much for 21 00:01:12,804 --> 00:01:13,650 this example. 22 00:01:13,650 --> 00:01:16,594 But briefly, the definition of precision is, 23 00:01:16,594 --> 00:01:20,207 of the examples that your classifier recognizes as cats, 24 00:01:23,068 --> 00:01:26,741 What percentage actually are cats? 25 00:01:32,341 --> 00:01:37,045 So if classifier A has 95% precision, this means that when classifier A says 26 00:01:37,045 --> 00:01:41,830 something is a cat, there's a 95% chance it really is a cat. 27 00:01:41,830 --> 00:01:45,878 And recall is, of all the images that really are cats, 28 00:01:45,878 --> 00:01:50,731 what percentage were correctly recognized by your classifier? 29 00:01:50,731 --> 00:01:57,110 So what percentage of actual cats, Are correctly recognized? 30 00:02:04,331 --> 00:02:08,986 So if classifier A is 90% recall, this means that of all of the images in, say, 31 00:02:08,986 --> 00:02:11,010 your dev sets that really are cats, 32 00:02:11,010 --> 00:02:13,987 classifier A accurately pulled out 90% of them. 33 00:02:13,987 --> 00:02:19,049 So don't worry too much about the definitions of precision and recall. 34 00:02:19,049 --> 00:02:23,933 It turns out that there's often a tradeoff between precision and 35 00:02:23,933 --> 00:02:26,845 recall, and you care about both. 36 00:02:26,845 --> 00:02:29,455 You want that, when the classifier says something is a cat, 37 00:02:29,455 --> 00:02:31,765 there's a high chance it really is a cat. 38 00:02:31,765 --> 00:02:33,475 But of all the images that are cats, 39 00:02:33,475 --> 00:02:37,905 you also want it to pull a large fraction of them as cats. 40 00:02:37,905 --> 00:02:40,865 So it might be reasonable to try to evaluate 41 00:02:40,865 --> 00:02:44,685 the classifiers in terms of its precision and its recall. 42 00:02:44,685 --> 00:02:49,728 The problem with using precision recall as your evaluation metric is that if 43 00:02:49,728 --> 00:02:54,926 classifier A does better on recall, which it does here, the classifier B does 44 00:02:54,926 --> 00:02:59,840 better on precision, then you're not sure which classifier is better. 45 00:03:03,481 --> 00:03:06,976 And if you're trying out a lot of different ideas, a lot of different 46 00:03:06,976 --> 00:03:11,076 hyperparameters, you want to rather quickly try out not just two classifiers, 47 00:03:11,076 --> 00:03:14,932 but maybe a dozen classifiers and quickly pick out the, quote, best ones, 48 00:03:14,932 --> 00:03:17,010 so you can keep on iterating from there. 49 00:03:19,850 --> 00:03:23,570 And with two evaluation metrics, it is difficult to know 50 00:03:23,570 --> 00:03:27,380 how to quickly pick one of the two or quickly pick one of the ten. 51 00:03:29,170 --> 00:03:33,220 So what I recommend is rather than using two numbers, precision and 52 00:03:33,220 --> 00:03:35,870 recall, to pick a classifier, 53 00:03:35,870 --> 00:03:40,440 you just have to find a new evaluation metric that combines precision and recall. 54 00:03:41,740 --> 00:03:45,205 In the machine learning literature, the standard way to combine precision and 55 00:03:45,205 --> 00:03:47,028 recall is something called an F1 score. 56 00:03:47,028 --> 00:03:52,777 And the details of F1 score aren't too important, but informally, 57 00:03:52,777 --> 00:03:58,541 you can think of this as the average of precision, P, and recall, R. 58 00:03:58,541 --> 00:04:04,574 Formally, the F1 score is defined by this formula, 59 00:04:04,574 --> 00:04:07,670 it's 2/ 1/P + 1/R. 60 00:04:07,670 --> 00:04:12,240 And in mathematics, this function is called the harmonic 61 00:04:12,240 --> 00:04:16,860 mean of precision P and recall R. 62 00:04:16,860 --> 00:04:17,850 But less formally, 63 00:04:17,850 --> 00:04:21,721 you can think of this as some way that averages precision and recall. 64 00:04:22,840 --> 00:04:25,190 Only instead of taking the arithmetic mean, 65 00:04:25,190 --> 00:04:28,800 you take the harmonic mean, which is defined by this formula. 66 00:04:28,800 --> 00:04:33,410 And it has some advantages in terms of trading off precision and recall. 67 00:04:33,410 --> 00:04:34,953 But in this example, 68 00:04:34,953 --> 00:04:39,853 you can then see right away that classifier A has a better F1 score. 69 00:04:39,853 --> 00:04:43,825 And assuming F1 score is a reasonable way to combine precision and recall, 70 00:04:43,825 --> 00:04:47,000 you can then quickly select classifier A over classifier B. 71 00:04:48,100 --> 00:04:48,880 So what I found for 72 00:04:48,880 --> 00:04:52,401 a lot of machine learning teams is that having a well-defined dev set, 73 00:04:52,401 --> 00:04:57,598 which is how you're measuring precision and recall, plus a single number 74 00:04:57,598 --> 00:05:03,430 evaluation metric, sometimes I'll call it single row number. 75 00:05:04,580 --> 00:05:09,147 Evaluation metric allows you to quickly tell if classifier A or 76 00:05:09,147 --> 00:05:13,971 classifier B is better, and therefore having a dev set plus single 77 00:05:13,971 --> 00:05:18,301 number evaluation metric distance to speed up iterating. 78 00:05:21,551 --> 00:05:26,980 It speeds up this iterative process of improving your machine learning algorithm. 79 00:05:26,980 --> 00:05:28,010 Let's look at another example. 80 00:05:29,130 --> 00:05:35,390 Let's say you're building a cat app for cat lovers in four major geographies, 81 00:05:35,390 --> 00:05:40,490 the US, China, India, and other, the rest of the world. 82 00:05:40,490 --> 00:05:43,940 And let's say that your two classifiers achieve different errors 83 00:05:45,370 --> 00:05:48,400 in data from these four different geographies. 84 00:05:48,400 --> 00:05:54,280 So algorithm A achieves 3% error on pictures submitted by US users and so on. 85 00:05:56,100 --> 00:05:59,140 So it might be reasonable to keep track of how well 86 00:05:59,140 --> 00:06:03,260 your classifiers do in these different markets or these different geographies. 87 00:06:03,260 --> 00:06:06,770 But by tracking four numbers, it's very difficult to look at these numbers and 88 00:06:06,770 --> 00:06:10,890 quickly decide if algorithm A or algorithm B is superior. 89 00:06:10,890 --> 00:06:13,370 And if you're testing a lot of different classifiers, 90 00:06:13,370 --> 00:06:17,590 then it's just difficult to look at all these numbers and quickly pick one. 91 00:06:17,590 --> 00:06:22,390 So what I recommend in this example is, in addition to tracking your 92 00:06:22,390 --> 00:06:26,450 performance in the four different geographies, to also compute the average. 93 00:06:26,450 --> 00:06:30,874 And assuming that average performance is a reasonable single real number 94 00:06:30,874 --> 00:06:33,799 evaluation metric, by computing the average, 95 00:06:33,799 --> 00:06:38,530 you can quickly tell that it looks like algorithm C has a lowest average error. 96 00:06:38,530 --> 00:06:40,555 And you might then go ahead with that one. 97 00:06:40,555 --> 00:06:44,490 You have to pick an algorithm to keep on iterating from. 98 00:06:44,490 --> 00:06:47,573 So your work load machine learning is often, you have an idea, 99 00:06:47,573 --> 00:06:51,970 you implement it try it out, and you want to know whether your idea helped. 100 00:06:51,970 --> 00:06:56,760 So what was seen in this video is that having a single number evaluation metric 101 00:06:56,760 --> 00:06:58,980 can really improve your efficiency or 102 00:06:58,980 --> 00:07:02,340 the efficiency of your team in making those decisions. 103 00:07:02,340 --> 00:07:03,240 Now we're not yet 104 00:07:03,240 --> 00:07:07,510 done with the discussion on how to effectively set up evaluation metrics. 105 00:07:07,510 --> 00:07:08,430 In the next video, 106 00:07:08,430 --> 00:07:13,880 I'm going to share with you how to set up optimizing, as well as satisfying matrix. 107 00:07:13,880 --> 00:07:15,480 So let's take a look at the next video.