1 00:00:00,000 --> 00:00:01,370 >> In the last video, 2 00:00:01,370 --> 00:00:05,445 you saw how your dev and test sets should come from the same distribution, 3 00:00:05,445 --> 00:00:07,370 but how long should they be? 4 00:00:07,370 --> 00:00:08,780 The guidelines to help set up 5 00:00:08,780 --> 00:00:11,955 your dev and test sets are changing in the Deep Learning era. 6 00:00:11,955 --> 00:00:14,645 Let's take a look at some best practices. 7 00:00:14,645 --> 00:00:17,870 You might have heard of the rule of thumb 8 00:00:17,870 --> 00:00:20,489 in machine learning of taking all the data you have 9 00:00:20,489 --> 00:00:26,495 and using a 70/30 split into a train and test set, 10 00:00:26,495 --> 00:00:30,800 or have you had to set up train dev and test sets maybe, 11 00:00:30,800 --> 00:00:42,705 you would use a 60% training and 20% dev and 20% tests. 12 00:00:42,705 --> 00:00:47,200 In earlier eras of machine learning, 13 00:00:47,200 --> 00:00:50,155 this was pretty reasonable, 14 00:00:50,155 --> 00:00:54,550 especially back when data set sizes were just smaller. 15 00:00:54,550 --> 00:00:57,085 So if you had a hundred examples in total, 16 00:00:57,085 --> 00:01:03,555 these 70/30 or 60/20/20 rule of thumb would be pretty reasonable. 17 00:01:03,555 --> 00:01:05,485 If you had thousand examples, 18 00:01:05,485 --> 00:01:09,070 maybe if you had ten thousand examples, 19 00:01:09,070 --> 00:01:13,070 these things are not unreasonable. 20 00:01:13,070 --> 00:01:16,255 But in the modern machine learning era, 21 00:01:16,255 --> 00:01:20,310 we are now used to working with much larger data set sizes. 22 00:01:20,310 --> 00:01:26,430 So let's say you have a million training examples, 23 00:01:26,430 --> 00:01:29,490 it might be quite reasonable to set up 24 00:01:29,490 --> 00:01:33,810 your data so that you have 98% in the training set, 25 00:01:33,810 --> 00:01:40,437 1% dev, and 1% test. 26 00:01:40,437 --> 00:01:44,590 And when you use DNT to abbreviate dev and test sets. 27 00:01:44,590 --> 00:01:46,710 Because if you have a million examples, 28 00:01:46,710 --> 00:01:48,285 then 1% of that, 29 00:01:48,285 --> 00:01:54,800 is 10,000 examples, and that might be plenty enough for a dev set or for a test set. 30 00:01:54,800 --> 00:02:00,255 So, in the modern Deep Learning era where sometimes we have much larger data sets, 31 00:02:00,255 --> 00:02:04,020 It's quite reasonable to use a much smaller than 20 32 00:02:04,020 --> 00:02:07,785 or 30% of your data for a dev set or a test set. 33 00:02:07,785 --> 00:02:12,690 And because Deep Learning algorithms have such a huge hunger for data, I'm seeing that, 34 00:02:12,690 --> 00:02:16,020 the problems we have large data sets that have 35 00:02:16,020 --> 00:02:20,430 much larger fraction of it goes into the training set. 36 00:02:20,430 --> 00:02:24,447 So, how about the test set? 37 00:02:24,447 --> 00:02:28,930 Remember the purpose of your test set is that, 38 00:02:28,930 --> 00:02:30,865 after you finish developing a system, 39 00:02:30,865 --> 00:02:34,360 the test set helps evaluate how good your final system is. 40 00:02:34,360 --> 00:02:37,690 The guideline is, to set your test set to big enough to give 41 00:02:37,690 --> 00:02:41,150 high confidence in the overall performance of your system. 42 00:02:41,150 --> 00:02:43,690 So, unless you need to have 43 00:02:43,690 --> 00:02:48,090 a very accurate measure of how well your final system is performing, 44 00:02:48,090 --> 00:02:54,059 maybe you don't need millions and millions of examples in a test set, 45 00:02:54,059 --> 00:02:57,640 and maybe for your application if you think that having 10,000 examples gives you 46 00:02:57,640 --> 00:03:00,545 enough confidence to find the performance on maybe 47 00:03:00,545 --> 00:03:03,725 100,000 or whatever it is, that might be enough. 48 00:03:03,725 --> 00:03:05,260 And this could be much less than, 49 00:03:05,260 --> 00:03:07,340 say 30% of the overall data set, 50 00:03:07,340 --> 00:03:08,440 depend on how much data you have. 51 00:03:08,440 --> 00:03:13,250 For some applications, 52 00:03:13,250 --> 00:03:18,320 maybe you don't need a high confidence in the overall performance of your final system. 53 00:03:18,320 --> 00:03:23,055 Maybe all you need is a train and dev set, 54 00:03:23,055 --> 00:03:29,230 And I think, not having a test set might be okay. 55 00:03:29,230 --> 00:03:31,685 In fact, what sometimes happened was, 56 00:03:31,685 --> 00:03:33,965 people were talking about using 57 00:03:33,965 --> 00:03:40,580 train test splits but what they were actually doing was iterating on the test set. 58 00:03:40,580 --> 00:03:42,250 So rather than test set, 59 00:03:42,250 --> 00:03:46,415 what they had was a train dev split and no test set. 60 00:03:46,415 --> 00:03:48,604 If you're actually tuning to this set, 61 00:03:48,604 --> 00:03:50,390 to this dev set and this test set, 62 00:03:50,390 --> 00:03:53,205 It's better to call the dev set. 63 00:03:53,205 --> 00:03:56,335 Although I think in the history of machine learning, 64 00:03:56,335 --> 00:03:59,875 not everyone has been completely clean and completely records 65 00:03:59,875 --> 00:04:03,895 of about calling the dev set when it really should be treated as test set. 66 00:04:03,895 --> 00:04:07,485 But, if all you care about is having some data that you train on, 67 00:04:07,485 --> 00:04:09,150 and having some data to tune to, 68 00:04:09,150 --> 00:04:11,682 and you're just going to shake the final system 69 00:04:11,682 --> 00:04:15,710 and not worry too much about how it was actually doing, 70 00:04:15,710 --> 00:04:17,940 I think it will be healthy and just call the train dev set 71 00:04:17,940 --> 00:04:20,700 and acknowledge that you have no test set. 72 00:04:20,700 --> 00:04:22,720 This a bit unusual? 73 00:04:22,720 --> 00:04:26,970 I'm definitely not recommending not having a test set when building a system. 74 00:04:26,970 --> 00:04:30,225 I do find it reassuring to have a separate test set 75 00:04:30,225 --> 00:04:33,900 you can use to get an unbiased estimate of how I was doing before you shift it, 76 00:04:33,900 --> 00:04:37,770 but if you have a very large dev 77 00:04:37,770 --> 00:04:41,650 set so that you think you won't overfit the dev set too badly. 78 00:04:41,650 --> 00:04:45,200 Maybe it's not totally unreasonable to just have a train dev set, 79 00:04:45,200 --> 00:04:48,800 although it's not what I usually recommend. 80 00:04:48,800 --> 00:04:51,600 So to summarize, in the era of big data, 81 00:04:51,600 --> 00:04:54,500 I think the old rule of thumb of a 70/30 is that, 82 00:04:54,500 --> 00:04:56,275 that no longer applies. 83 00:04:56,275 --> 00:05:01,035 And the trend has been to use more data for training and less for dev and test, 84 00:05:01,035 --> 00:05:03,220 especially when you have a very large data sets. 85 00:05:03,220 --> 00:05:06,960 And the rule of thumb is really to try to set the dev set to big enough for its purpose, 86 00:05:06,960 --> 00:05:11,110 which helps you evaluate different ideas and pick this up from AOP better. 87 00:05:11,110 --> 00:05:15,450 And the purpose of test set is to help you evaluate your final cost buys. 88 00:05:15,450 --> 00:05:18,590 You just have to set your test set big enough for that purpose, 89 00:05:18,590 --> 00:05:21,710 and that could be much less than 30% of the data. 90 00:05:21,710 --> 00:05:24,810 So, I hope that gives some guidance or some suggestions on how to 91 00:05:24,810 --> 00:05:28,710 set up your dev and test sets in the Deep Learning era. 92 00:05:28,710 --> 00:05:30,595 Next, it turns out that sometimes, 93 00:05:30,595 --> 00:05:32,640 part way through a machine learning problem, 94 00:05:32,640 --> 00:05:34,800 you might want to change your evaluation metric, 95 00:05:34,800 --> 00:05:36,615 or change your dev and test sets. 96 00:05:36,615 --> 00:05:38,250 Let's talk about it when you might want to do