1 00:00:00,740 --> 00:00:04,500 Welcome to this course on the practical aspects of deep learning. 2 00:00:04,500 --> 00:00:07,650 Perhaps now you've learned how to implement a neural network. 3 00:00:07,650 --> 00:00:10,620 In this week you'll learn the practical aspects of how to make 4 00:00:10,620 --> 00:00:12,550 your neural network work well. 5 00:00:12,550 --> 00:00:16,757 Ranging from things like hyperparameter tuning to how to set up your data, 6 00:00:16,757 --> 00:00:20,297 to how to make sure your optimization algorithm runs quickly so 7 00:00:20,297 --> 00:00:24,140 that you get your learning algorithm to learn in a reasonable time. 8 00:00:24,140 --> 00:00:27,551 In this first week we'll first talk about the cellular machine learning problem, 9 00:00:27,551 --> 00:00:29,216 then we'll talk about randomization. 10 00:00:29,216 --> 00:00:30,952 And we'll talk about some tricks for 11 00:00:30,952 --> 00:00:34,440 making sure your neural network implementation is correct. 12 00:00:34,440 --> 00:00:36,170 With that, let's get started. 13 00:00:36,170 --> 00:00:39,760 Making good choices in how you set up your training, development, and 14 00:00:39,760 --> 00:00:43,160 test sets can make a huge difference in helping you quickly 15 00:00:43,160 --> 00:00:46,090 find a good high performance neural network. 16 00:00:46,090 --> 00:00:49,230 When training a neural network you have to make a lot of decisions, 17 00:00:49,230 --> 00:00:52,310 such as how many layers will your neural network have? 18 00:00:52,310 --> 00:00:55,400 And how many hidden units do you want each layer to have? 19 00:00:55,400 --> 00:00:57,067 And what's the learning rates? 20 00:00:57,067 --> 00:01:01,150 And what are the activation functions you want to use for the different layers? 21 00:01:01,150 --> 00:01:03,040 When you're starting on a new application, 22 00:01:03,040 --> 00:01:07,400 it's almost impossible to correctly guess the right values for 23 00:01:07,400 --> 00:01:12,250 all of these, and for other hyperparameter choices, on your first attempt. 24 00:01:12,250 --> 00:01:16,290 So in practice applied machine learning is a highly iterative process, 25 00:01:16,290 --> 00:01:18,450 in which you often start with an idea, 26 00:01:18,450 --> 00:01:21,990 such as you want to build a neural network of a certain number of layers, 27 00:01:21,990 --> 00:01:25,670 certain number of hidden units, maybe on certain data sets and so on. 28 00:01:25,670 --> 00:01:29,660 And then you just have to code it up and try it by running your code. 29 00:01:29,660 --> 00:01:33,950 You run and experiment and you get back a result that tells you 30 00:01:33,950 --> 00:01:37,570 how well this particular network, or this particular configuration works. 31 00:01:37,570 --> 00:01:39,090 And based on the outcome, 32 00:01:39,090 --> 00:01:44,330 you might then refine your ideas and change your choices and 33 00:01:44,330 --> 00:01:49,474 maybe keep iterating in order to try to find a better and a better neural network. 34 00:01:50,890 --> 00:01:54,390 Today, deep learning has found great success in a lot of areas. 35 00:01:54,390 --> 00:01:59,256 Ranging from natural language processing to computer vision to 36 00:01:59,256 --> 00:02:04,579 speech recognition to a lot of applications on also structured data. 37 00:02:04,579 --> 00:02:10,000 And structured data includes everything from advertisements to web search, 38 00:02:10,000 --> 00:02:16,840 which isn't just Internet search engines it's also, for example, shopping websites. 39 00:02:16,840 --> 00:02:19,340 Already any websites that wants 40 00:02:19,340 --> 00:02:23,800 deliver great search results when you enter terms into a search bar. 41 00:02:23,800 --> 00:02:29,436 To computer security, to logistics, such as figuring out where 42 00:02:29,436 --> 00:02:34,665 to send drivers to pick up and drop off things, to many more. 43 00:02:34,665 --> 00:02:39,530 So what I'm seeing is that sometimes a researcher with a lot of experience 44 00:02:39,530 --> 00:02:43,170 in NLP might try to do something in computer vision. 45 00:02:43,170 --> 00:02:48,120 Or maybe a researcher with a lot of experience in speech recognition might 46 00:02:48,120 --> 00:02:50,190 jump in and try to do something on advertising. 47 00:02:50,190 --> 00:02:54,670 Or someone from security might want to jump in and do something on logistics. 48 00:02:54,670 --> 00:02:57,940 And what I've seen is that intuitions from one domain or 49 00:02:57,940 --> 00:03:02,920 from one application area often do not transfer to other application areas. 50 00:03:02,920 --> 00:03:06,471 And the best choices may depend on the amount of data you have, 51 00:03:06,471 --> 00:03:10,983 the number of input features you have through your computer configuration and 52 00:03:10,983 --> 00:03:13,464 whether you're training on GPUs or CPUs. 53 00:03:13,464 --> 00:03:18,280 And if so, exactly what configuration of GPUs and CPUs, and many other things. 54 00:03:18,280 --> 00:03:21,470 So for a lot of applications I think it's almost impossible. 55 00:03:21,470 --> 00:03:26,430 Even very experienced deep learning people find it almost impossible to correctly 56 00:03:26,430 --> 00:03:30,300 guess the best choice of hyperparameters the very first time. 57 00:03:30,300 --> 00:03:34,160 And so today, applied deep learning is a very iterative 58 00:03:34,160 --> 00:03:39,150 process where you just have to go around this cycle many times 59 00:03:39,150 --> 00:03:43,790 to hopefully find a good choice of network for your application. 60 00:03:43,790 --> 00:03:48,100 So one of the things that determine how quickly you can make progress is 61 00:03:48,100 --> 00:03:51,510 how efficiently you can go around this cycle. 62 00:03:51,510 --> 00:03:55,830 And setting up your data sets well in terms of your train, development and 63 00:03:55,830 --> 00:03:59,030 test sets can make you much more efficient at that. 64 00:03:59,030 --> 00:04:06,430 So if this is your training data, let's draw that as a big box. 65 00:04:06,430 --> 00:04:11,140 Then traditionally you might take all the data you have and 66 00:04:11,140 --> 00:04:15,520 carve off some portion of it to be your training set. 67 00:04:15,520 --> 00:04:21,790 Some portion of it to be your hold-out cross validation set, 68 00:04:23,290 --> 00:04:30,398 and this is sometimes also called the development set. 69 00:04:30,398 --> 00:04:33,940 And for brevity I'm just going to call this the dev set, but 70 00:04:33,940 --> 00:04:36,810 all of these terms mean roughly the same thing. 71 00:04:36,810 --> 00:04:41,940 And then you might carve out some final portion of it to be your test set. 72 00:04:41,940 --> 00:04:46,390 And so the workflow is that you keep on training algorithms on your training sets. 73 00:04:46,390 --> 00:04:51,080 And use your dev set or your hold-out cross validation set to see which 74 00:04:51,080 --> 00:04:54,670 of many different models performs best on your dev set. 75 00:04:54,670 --> 00:04:56,910 And then after having done this long enough, 76 00:04:56,910 --> 00:05:00,030 when you have a final model that you want to evaluate, 77 00:05:00,030 --> 00:05:03,420 you can take the best model you have found and evaluate it on your test set. 78 00:05:03,420 --> 00:05:08,040 In order to get an unbiased estimate of how well your algorithm is doing. 79 00:05:08,040 --> 00:05:13,054 So in the previous era of machine learning, it was common practice 80 00:05:13,054 --> 00:05:18,246 to take all your data and split it according to maybe a 70/30% in 81 00:05:18,246 --> 00:05:23,460 terms of a people often talk about the 70/30 train test splits. 82 00:05:23,460 --> 00:05:28,845 If you don't have an explicit dev set or maybe a 60/20/20% 83 00:05:28,845 --> 00:05:33,680 split in terms of 60% train, 20% dev and 20% test. 84 00:05:33,680 --> 00:05:37,300 And several years ago this was widely considered best practice 85 00:05:37,300 --> 00:05:38,910 in machine learning. 86 00:05:38,910 --> 00:05:41,470 If you have maybe 100 examples in total, 87 00:05:41,470 --> 00:05:46,740 maybe 1000 examples in total, maybe after 10,000 examples. 88 00:05:46,740 --> 00:05:50,743 These sorts of ratios were perfectly reasonable rules of thumb. 89 00:05:50,743 --> 00:05:55,920 But in the modern big data era, where, for example, 90 00:05:55,920 --> 00:06:03,600 you might have a million examples in total, then the trend is that your dev and 91 00:06:03,600 --> 00:06:09,390 test sets have been becoming a much smaller percentage of the total. 92 00:06:09,390 --> 00:06:13,410 Because remember, the goal of the dev set or the development set is that you're 93 00:06:13,410 --> 00:06:17,370 going to test different algorithms on it and see which algorithm works better. 94 00:06:17,370 --> 00:06:20,010 So the dev set just needs to be big enough for 95 00:06:20,010 --> 00:06:23,380 you to evaluate, say, two different algorithm choices or 96 00:06:23,380 --> 00:06:27,020 ten different algorithm choices and quickly decide which one is doing better. 97 00:06:27,020 --> 00:06:30,500 And you might not need a whole 20% of your data for that. 98 00:06:30,500 --> 00:06:34,200 So, for example, if you have a million training examples you might decide that 99 00:06:34,200 --> 00:06:39,250 just having 10,000 examples in your dev set is more than enough 100 00:06:39,250 --> 00:06:43,180 to evaluate which one or two algorithms does better. 101 00:06:43,180 --> 00:06:47,220 And in a similar vein, the main goal of your test set is, given your final 102 00:06:47,220 --> 00:06:51,885 classifier, to give you a pretty confident estimate of how well it's doing. 103 00:06:51,885 --> 00:06:56,695 And again, if you have a million examples maybe you might decide that 10,000 104 00:06:56,695 --> 00:07:00,960 examples is more than enough in order to evaluate a single classifier and 105 00:07:00,960 --> 00:07:03,680 give you a good estimate of how well it's doing. 106 00:07:03,680 --> 00:07:07,280 So in this example where you have a million examples, 107 00:07:07,280 --> 00:07:11,550 if you need just 10,000 for your dev and 10,000 for your test, 108 00:07:11,550 --> 00:07:17,240 your ratio will be more like this 10,000 is 1% of 1 million so 109 00:07:17,240 --> 00:07:23,330 you'll have 98% train, 1% dev, 1% test. 110 00:07:23,330 --> 00:07:25,360 And I've also seen applications where, 111 00:07:25,360 --> 00:07:29,910 if you have even more than a million examples, you might end up 112 00:07:29,910 --> 00:07:35,050 with 99.5% train and 0.25% dev, 0.25% test. 113 00:07:35,050 --> 00:07:42,060 Or maybe a 0.4% dev, 0.1% test. 114 00:07:42,060 --> 00:07:45,920 So just to recap, when setting up your machine learning problem, 115 00:07:45,920 --> 00:07:50,380 I'll often set it up into a train, dev and test sets, and 116 00:07:50,380 --> 00:07:55,740 if you have a relatively small dataset, these traditional ratios might be okay. 117 00:07:55,740 --> 00:07:59,560 But if you have a much larger data set, it's also fine to set your dev and 118 00:07:59,560 --> 00:08:05,660 test sets to be much smaller than your 20% or even 10% of your data. 119 00:08:05,660 --> 00:08:08,640 We'll give more specific guidelines on the sizes of dev and 120 00:08:08,640 --> 00:08:11,110 test sets later in this specialization. 121 00:08:11,110 --> 00:08:16,170 One other trend we're seeing in the era of modern deep learning is that more and 122 00:08:16,170 --> 00:08:20,080 more people train on mismatched train and test distributions. 123 00:08:20,080 --> 00:08:25,100 Let's say you're building an app that lets users upload a lot of pictures and 124 00:08:25,100 --> 00:08:29,380 your goal is to find pictures of cats in order to show your users. 125 00:08:29,380 --> 00:08:31,590 Maybe all your users are cat lovers. 126 00:08:31,590 --> 00:08:37,180 Maybe your training set comes from cat pictures downloaded off the Internet, but 127 00:08:37,180 --> 00:08:42,178 your dev and test sets might comprise cat pictures from users using our app. 128 00:08:42,178 --> 00:08:46,250 So maybe your training set has a lot of pictures crawled off the Internet but 129 00:08:46,250 --> 00:08:49,470 the dev and test sets are pictures uploaded by users. 130 00:08:49,470 --> 00:08:53,370 Turns out a lot of webpages have very high resolution, very professional, 131 00:08:53,370 --> 00:08:55,610 very nicely framed pictures of cats. 132 00:08:55,610 --> 00:08:58,290 But maybe your users are uploading blurrier, 133 00:08:58,290 --> 00:09:03,450 lower res images just taken with a cell phone camera in a more casual condition. 134 00:09:03,450 --> 00:09:07,960 And so these two distributions of data may be different. 135 00:09:07,960 --> 00:09:13,042 The rule of thumb I'd encourage you to follow in this case is to 136 00:09:13,042 --> 00:09:18,737 make sure that the dev and test sets come from the same distribution. 137 00:09:23,079 --> 00:09:26,199 We'll say more about this particular guideline as well, but 138 00:09:26,199 --> 00:09:30,039 because you will be using the dev set to evaluate a lot of different models and 139 00:09:30,039 --> 00:09:33,380 trying really hard to improve performance on the dev set. 140 00:09:33,380 --> 00:09:38,380 It's nice if your dev set comes from the same distribution as your test set. 141 00:09:38,380 --> 00:09:43,440 But because deep learning algorithms have such a huge hunger for training data, 142 00:09:43,440 --> 00:09:47,660 one trend I'm seeing is that you might use all sorts of creative tactics, 143 00:09:47,660 --> 00:09:49,560 such as crawling webpages, 144 00:09:49,560 --> 00:09:53,650 in order to acquire a much bigger training set than you would otherwise have. 145 00:09:53,650 --> 00:09:57,300 Even if part of the cost of that is then that your training set 146 00:09:57,300 --> 00:10:00,950 data might not come from the same distribution as your dev and test sets. 147 00:10:00,950 --> 00:10:03,980 But you find that so long as you follow this rule of thumb, 148 00:10:03,980 --> 00:10:08,600 that progress in your machine learning algorithm will be faster. 149 00:10:08,600 --> 00:10:10,750 And I'll give a more detailed explanation for 150 00:10:10,750 --> 00:10:13,910 this particular rule of thumb later in the specialization as well. 151 00:10:13,910 --> 00:10:18,320 Finally, it might be okay to not have a test set. 152 00:10:18,320 --> 00:10:22,289 Remember the goal of the test set is to give you a unbiased estimate 153 00:10:22,289 --> 00:10:26,995 of the performance of your final network, of the network that you selected. 154 00:10:26,995 --> 00:10:29,315 But if you don't need that unbiased estimate, 155 00:10:29,315 --> 00:10:32,090 then it might be okay to not have a test set. 156 00:10:32,090 --> 00:10:35,030 So what you do, if you have only a dev set but not a test set, 157 00:10:35,030 --> 00:10:40,210 is you train on the training set and then you try different model architectures. 158 00:10:40,210 --> 00:10:44,450 Evaluate them on the dev set, and then use that to iterate and 159 00:10:44,450 --> 00:10:46,140 try to get to a good model. 160 00:10:46,140 --> 00:10:48,020 Because you've fit your data to the dev set, 161 00:10:48,020 --> 00:10:50,657 this no longer gives you an unbiased estimate of performance. 162 00:10:50,657 --> 00:10:53,690 But if you don't need one, that might be perfectly fine. 163 00:10:53,690 --> 00:10:55,950 In the machine learning world, when you have just a train and 164 00:10:55,950 --> 00:10:58,500 a dev set but no separate test set. 165 00:10:58,500 --> 00:11:01,260 Most people will call this a training set and 166 00:11:01,260 --> 00:11:04,640 they will call the dev set the test set. 167 00:11:04,640 --> 00:11:08,881 But what they actually end up doing is using the test set as a hold-out cross 168 00:11:08,881 --> 00:11:09,902 validation set. 169 00:11:09,902 --> 00:11:13,460 Which maybe isn't completely a great use of terminology, 170 00:11:13,460 --> 00:11:17,320 because they're then overfitting to the test set. 171 00:11:17,320 --> 00:11:21,310 So when the team tells you that they have only a train and a test set, 172 00:11:21,310 --> 00:11:26,140 I would just be cautious and think, do they really have a train dev set? 173 00:11:26,140 --> 00:11:28,520 Because they're overfitting to the test set. 174 00:11:28,520 --> 00:11:33,348 Culturally, it might be difficult to change some of these team's terminology 175 00:11:33,348 --> 00:11:38,410 and get them to call it a trained dev set rather than a trained test set. 176 00:11:38,410 --> 00:11:40,170 Even though I think calling it a train and 177 00:11:40,170 --> 00:11:43,250 development set would be more correct terminology. 178 00:11:43,250 --> 00:11:45,970 And this is actually okay practice if you don't need a completely 179 00:11:45,970 --> 00:11:48,665 unbiased estimate of the performance of your algorithm. 180 00:11:48,665 --> 00:11:53,575 So having set up a train dev and test set will allow you to integrate more quickly. 181 00:11:53,575 --> 00:11:57,631 It will also allow you to more efficiently measure the bias and variance of your 182 00:11:57,631 --> 00:12:02,215 algorithm so you can more efficiently select ways to improve your algorithm. 183 00:12:02,215 --> 00:12:04,225 Let's start to talk about that in the next video.