1 00:00:00,560 --> 00:00:04,440 Deep learning algorithms have a huge hunger for training data. 2 00:00:04,440 --> 00:00:06,970 They just often work best when you can find enough 3 00:00:06,970 --> 00:00:10,294 label training data to put into the training set. 4 00:00:10,294 --> 00:00:14,154 This has resulted in many teams sometimes taking whatever data you can find and 5 00:00:14,154 --> 00:00:18,120 just shoving it into the training set just to get it more training data. 6 00:00:18,120 --> 00:00:21,720 Even if some of this data, or even maybe a lot of this data, 7 00:00:21,720 --> 00:00:25,870 doesn't come from the same distribution as your dev and test data. 8 00:00:25,870 --> 00:00:30,330 So in a deep learning era, more and more teams are now training on data 9 00:00:30,330 --> 00:00:34,325 that comes from a different distribution than your dev and test sets. 10 00:00:34,325 --> 00:00:37,385 And there's some subtleties and some best practices for 11 00:00:37,385 --> 00:00:41,825 dealing with when you're training and test distributions differ from each other. 12 00:00:41,825 --> 00:00:43,315 Let's take a look. 13 00:00:43,315 --> 00:00:48,630 Let's say that you're building a mobile app where users will upload 14 00:00:48,630 --> 00:00:51,970 pictures taken from their cell phones, and you want to recognize whether the pictures 15 00:00:51,970 --> 00:00:56,470 that your users upload from the mobile app is a cat or not. 16 00:00:56,470 --> 00:00:59,580 So you can now get two sources of data. 17 00:00:59,580 --> 00:01:03,940 One which is the distribution of data you really care about, this data from a mobile 18 00:01:03,940 --> 00:01:07,940 app like that on the right, which tends to be less professionally shot, 19 00:01:07,940 --> 00:01:12,800 less well framed, maybe even blurrier because it's shot by amateur users. 20 00:01:12,800 --> 00:01:16,760 The other source of data you can get is you can crawl the web and just download 21 00:01:16,760 --> 00:01:21,635 a lot of, for the sake of this example, let's say you can download a lot of very 22 00:01:21,635 --> 00:01:27,250 professionally framed, high resolution, professionally taken images of cats. 23 00:01:27,250 --> 00:01:29,950 And let's say you don't have a lot of users yet for your mobile app. 24 00:01:29,950 --> 00:01:35,632 So maybe you've gotten 10,000 pictures uploaded from the mobile app. 25 00:01:35,632 --> 00:01:40,320 But by crawling the web you can download huge numbers of cat pictures, and 26 00:01:40,320 --> 00:01:45,322 maybe you have 200,000 pictures of cats downloaded off the Internet. 27 00:01:48,079 --> 00:01:53,296 So what you really care about is that your final system does 28 00:01:53,296 --> 00:01:58,311 well on the mobile app distribution of images, right? 29 00:01:58,311 --> 00:02:01,594 Because in the end, your users will be uploading pictures like those on 30 00:02:01,594 --> 00:02:04,230 the right and you need your classifier to do well on that. 31 00:02:04,230 --> 00:02:08,500 But you now have a bit of a dilemma because you have a relatively small 32 00:02:08,500 --> 00:02:12,970 dataset, just 10,000 examples drawn from that distribution. 33 00:02:12,970 --> 00:02:16,754 And you have a much bigger dataset that's drawn from a different distribution. 34 00:02:16,754 --> 00:02:19,986 There's a different appearance of image than the one you actually want. 35 00:02:19,986 --> 00:02:24,505 So you don't want to use just those 10,000 images because it ends up 36 00:02:24,505 --> 00:02:27,330 giving you a relatively small training set. 37 00:02:28,466 --> 00:02:31,724 And using those 200,000 images seems helpful, but 38 00:02:31,724 --> 00:02:37,550 the dilemma is this 200,000 images isn't from exactly the distribution you want. 39 00:02:37,550 --> 00:02:38,496 So what can you do? 40 00:02:38,496 --> 00:02:43,340 Well, here's one option. 41 00:02:43,340 --> 00:02:47,669 One thing you can do is put both of these data sets together so 42 00:02:47,669 --> 00:02:50,910 you now have 210,000 images. 43 00:02:50,910 --> 00:02:56,835 And you can then take the 210,000 images and 44 00:02:56,835 --> 00:03:03,470 randomly shuffle them into a train, dev, and test set. 45 00:03:03,470 --> 00:03:07,020 And let's say for the sake of argument that you've decided that your dev and 46 00:03:07,020 --> 00:03:11,380 test sets will be 2,500 examples each. 47 00:03:11,380 --> 00:03:16,360 So your training set will be 205,000 examples. 48 00:03:17,450 --> 00:03:23,520 Now so set up your data this way has some advantages but also disadvantages. 49 00:03:23,520 --> 00:03:26,910 The advantage is that now you're training, dev and test sets will all come 50 00:03:26,910 --> 00:03:30,400 from the same distribution, so that makes it easier to manage. 51 00:03:30,400 --> 00:03:33,470 But the disadvantage, and this is a huge disadvantage, 52 00:03:33,470 --> 00:03:38,010 is that if you look at your dev set, of these 2,500 examples, 53 00:03:38,010 --> 00:03:43,570 a lot of it will come from the web page distribution of images, rather than 54 00:03:43,570 --> 00:03:46,660 what you actually care about, which is the mobile app distribution of images. 55 00:03:48,520 --> 00:03:53,150 So it turns out that of your total amount of data, 200,000, so 56 00:03:53,150 --> 00:03:57,309 I'll just abbreviate that 200k, out of 210,000, 57 00:03:57,309 --> 00:04:01,169 we'll write that as 210k, that comes from web pages. 58 00:04:01,169 --> 00:04:06,951 So all of these 2,500 examples on expectation, 59 00:04:06,951 --> 00:04:13,430 I think 2,381 of them will come from web pages. 60 00:04:13,430 --> 00:04:17,580 This is on expectation, the exact number will vary around depending on 61 00:04:17,580 --> 00:04:20,290 how the random shuttle operation went. 62 00:04:20,290 --> 00:04:25,250 But on average, only 119 will come from mobile app uploads. 63 00:04:27,600 --> 00:04:32,630 So remember that setting up your dev set is telling your team where to aim 64 00:04:32,630 --> 00:04:33,800 the target. 65 00:04:33,800 --> 00:04:35,150 And the way you're aiming your target, 66 00:04:35,150 --> 00:04:38,080 you're saying spend most of the time optimizing for 67 00:04:38,080 --> 00:04:41,550 the web page distribution of images, which is really not what you want. 68 00:04:42,780 --> 00:04:45,430 So I would recommend against option one, 69 00:04:45,430 --> 00:04:50,010 because this is setting up the dev set to tell your team to optimize for 70 00:04:50,010 --> 00:04:53,160 a different distribution of data than what you actually care about. 71 00:04:54,210 --> 00:04:56,018 So instead of doing this, 72 00:04:56,018 --> 00:05:01,704 I would recommend that you instead take another option, which is the following. 73 00:05:01,704 --> 00:05:08,487 The training set, let's say it's still 205,000 images, 74 00:05:08,487 --> 00:05:15,985 I would have the training set have all 200,000 images from the web. 75 00:05:15,985 --> 00:05:21,970 And then you can, if you want, add in 5,000 images from the mobile app. 76 00:05:21,970 --> 00:05:24,344 And then for your dev and test sets, 77 00:05:24,344 --> 00:05:27,599 I guess my data sets size aren't drawn to scale. 78 00:05:27,599 --> 00:05:34,993 Your dev and test sets would be all mobile app images. 79 00:05:38,870 --> 00:05:44,405 So the training set will include 200,000 images from the web and 80 00:05:44,405 --> 00:05:46,560 5,000 from the mobile app. 81 00:05:46,560 --> 00:05:51,990 The dev set will be 2,500 images from the mobile app, and 82 00:05:51,990 --> 00:05:58,680 the test set will be 2,500 images also from the mobile app. 83 00:05:58,680 --> 00:06:03,480 The advantage of this way of splitting up your data into train, dev, and test, 84 00:06:03,480 --> 00:06:07,410 is that you're now aiming the target where you want it to be. 85 00:06:07,410 --> 00:06:12,930 You're telling your team, my dev set has data uploaded from the mobile app and 86 00:06:12,930 --> 00:06:15,570 that's the distribution of images you really care about, so 87 00:06:15,570 --> 00:06:19,110 let's try to build a machine learning system that does really well on 88 00:06:19,110 --> 00:06:21,750 the mobile app distribution of images. 89 00:06:21,750 --> 00:06:25,190 The disadvantage, of course, is that now your training 90 00:06:25,190 --> 00:06:30,870 distribution is different from your dev and test set distributions. 91 00:06:30,870 --> 00:06:34,724 But it turns out that this split of your data into train, dev and 92 00:06:34,724 --> 00:06:38,227 test will get you better performance over the long term. 93 00:06:38,227 --> 00:06:42,475 And we'll discuss later some specific techniques for dealing with your 94 00:06:42,475 --> 00:06:47,160 training sets coming from different distribution than your dev and test sets. 95 00:06:47,160 --> 00:06:49,110 Let's look at another example. 96 00:06:49,110 --> 00:06:53,553 Let's say you're building a brand new product, 97 00:06:53,553 --> 00:06:58,610 a speech activated rearview mirror for a car. 98 00:06:58,610 --> 00:07:01,368 So this is a real product in China. 99 00:07:01,368 --> 00:07:05,668 It's making its way into other countries but you can build a rearview mirror to 100 00:07:05,668 --> 00:07:10,034 replace this little thing there, so that you can now talk to the rearview mirror 101 00:07:10,034 --> 00:07:13,488 and basically say, dear rearview mirror, please help me find 102 00:07:13,488 --> 00:07:17,760 navigational directions to the nearest gas station and it'll deal with it. 103 00:07:19,620 --> 00:07:22,750 So this is actually a real product, and let's say you're trying to build this for 104 00:07:22,750 --> 00:07:23,530 your own country. 105 00:07:27,160 --> 00:07:31,720 So how can you get data to train up a speech recognition system for 106 00:07:31,720 --> 00:07:32,489 this product? 107 00:07:32,489 --> 00:07:36,137 Well, maybe you've worked on speech recognition for a long time so 108 00:07:36,137 --> 00:07:39,785 you have a lot of data from other speech recognition applications, 109 00:07:39,785 --> 00:07:43,185 just not from a speech activated rearview mirror. 110 00:07:43,185 --> 00:07:47,164 Here's how you could split up your training and your dev and test sets. 111 00:07:47,164 --> 00:07:50,780 So for your training, you can take all the speech data you have that 112 00:07:50,780 --> 00:07:54,180 you've accumulated from working on other speech problems, such as 113 00:07:54,180 --> 00:07:59,060 data you purchased over the years from various speech recognition data vendors. 114 00:07:59,060 --> 00:08:03,410 And today you can actually buy data from vendors of x, y pairs, 115 00:08:03,410 --> 00:08:06,130 where x is an audio clip and y is a transcript. 116 00:08:06,130 --> 00:08:10,832 Or maybe you've worked on smart speakers, smart voice activated speakers, so 117 00:08:10,832 --> 00:08:12,990 you have some data from that. 118 00:08:12,990 --> 00:08:17,040 Maybe you've worked on voice activated keyboards and so on. 119 00:08:17,040 --> 00:08:21,515 And for the sake of argument, maybe you have 500,000 120 00:08:21,515 --> 00:08:25,330 utterences from all of these sources. 121 00:08:25,330 --> 00:08:30,078 And for your dev and test set, maybe you have a much smaller data set that 122 00:08:30,078 --> 00:08:33,892 actually came from a speech activated rearview mirror. 123 00:08:34,950 --> 00:08:38,316 Because users are asking for navigational queries or 124 00:08:38,316 --> 00:08:41,590 trying to find directions to various places. 125 00:08:41,590 --> 00:08:46,560 This data set will maybe have a lot more street addresses, right? 126 00:08:46,560 --> 00:08:49,250 Please help me navigate to this street address, or 127 00:08:49,250 --> 00:08:51,980 please help me navigate to this gas station. 128 00:08:51,980 --> 00:08:56,040 So this distribution of data will be very different than these on the left. 129 00:08:58,140 --> 00:09:01,780 But this is really the data you care about, because this is what you need your 130 00:09:01,780 --> 00:09:08,082 product to do well on, so this is what you set your dev and test set to be. 131 00:09:08,082 --> 00:09:12,868 So what you do in this example is set your training set to be 132 00:09:12,868 --> 00:09:16,960 the 500,000 utterances on the left, and 133 00:09:16,960 --> 00:09:21,847 then your dev and test sets which I'll abbreviate D and T, 134 00:09:21,847 --> 00:09:26,380 these could be maybe 10,000 utterances each. 135 00:09:26,380 --> 00:09:31,064 That's drawn from actual the speech activated rearview mirror. 136 00:09:31,064 --> 00:09:35,600 Or alternatively, if you think you don't need to put all 20,000 examples from 137 00:09:35,600 --> 00:09:38,498 your speech activated rearview mirror into the dev and 138 00:09:38,498 --> 00:09:42,470 test sets, maybe you can take half of that and put that in the training set. 139 00:09:43,730 --> 00:09:49,085 So then the training set could be 510,000 utterances, 140 00:09:49,085 --> 00:09:55,662 including all 500 from there and 10,000 from the rearview mirror. 141 00:09:58,046 --> 00:10:04,500 And then the dev and test sets could maybe be 5,000 utterances each. 142 00:10:04,500 --> 00:10:09,934 So of the 20,000 utterances, maybe 10k goes into the training set and 143 00:10:09,934 --> 00:10:14,490 5k into the dev set and 5,000 into the test set. 144 00:10:14,490 --> 00:10:18,870 So this would be another reasonable way of splitting your data into train, 145 00:10:18,870 --> 00:10:20,360 dev, and test. 146 00:10:20,360 --> 00:10:26,258 And this gives you a much bigger training set, over 500,000 utterances, than if you 147 00:10:26,258 --> 00:10:31,297 were to only use speech activated rearview mirror data for your training set. 148 00:10:31,297 --> 00:10:35,880 So in this video, you've seen a couple examples of when allowing your training 149 00:10:35,880 --> 00:10:38,790 set data to come from a different distribution than your dev and 150 00:10:38,790 --> 00:10:41,980 test set allows you to have much more training data. 151 00:10:41,980 --> 00:10:45,990 And in these examples, it will cause your learning algorithm to perform better. 152 00:10:45,990 --> 00:10:50,100 Now one question you might ask is, should you always use all the data you have? 153 00:10:50,100 --> 00:10:52,850 The answer is subtle, it is not always yes. 154 00:10:52,850 --> 00:10:54,910 Let's look at a counter-example in the next video.