1 00:00:00,050 --> 00:00:03,050 The way you set up your training dev, 2 00:00:03,050 --> 00:00:04,195 or development sets and test sets, 3 00:00:04,195 --> 00:00:06,810 can have a huge impact on how rapidly you or 4 00:00:06,810 --> 00:00:09,985 your team can make progress on building machine learning application. 5 00:00:09,985 --> 00:00:12,895 The same teams, even teams in very large companies, 6 00:00:12,895 --> 00:00:15,540 set up these data sets in ways that really slows down, 7 00:00:15,540 --> 00:00:18,125 rather than speeds up, the progress of the team. 8 00:00:18,125 --> 00:00:20,130 Let's take a look at how you can set up 9 00:00:20,130 --> 00:00:23,433 these data sets to maximize your team's efficiency. 10 00:00:23,433 --> 00:00:28,325 In this video, I want to focus on how you set up your dev and test sets. 11 00:00:28,325 --> 00:00:33,020 So, that dev set is also called the development set, 12 00:00:33,020 --> 00:00:36,940 or sometimes called the hold out cross validation set. 13 00:00:36,940 --> 00:00:42,265 And, workflow in machine learning is that you try a lot of ideas, 14 00:00:42,265 --> 00:00:44,200 train up different models on the training set, 15 00:00:44,200 --> 00:00:47,950 and then use the dev set to evaluate the different ideas and pick one. 16 00:00:47,950 --> 00:00:51,280 And, keep innovating to improve dev set performance until, finally, 17 00:00:51,280 --> 00:00:56,240 you have one clause that you're happy with that you then evaluate on your test set. 18 00:00:56,240 --> 00:00:59,800 Now, let's say, by way of example, 19 00:00:59,800 --> 00:01:01,995 that you're building a cat crossfire, 20 00:01:01,995 --> 00:01:05,500 and you are operating in these regions: in the U.S, 21 00:01:05,500 --> 00:01:07,720 U.K, other European countries, South America, 22 00:01:07,720 --> 00:01:10,490 India, China, other Asian countries, and Australia. 23 00:01:10,490 --> 00:01:14,529 So, how do you set up your dev set and your test set? 24 00:01:14,529 --> 00:01:19,285 Well, one way you could do so is to pick four of these regions. 25 00:01:19,285 --> 00:01:22,555 I'm going to use these four but it could be four randomly chosen regions. 26 00:01:22,555 --> 00:01:25,705 And say, that data from these four regions will go into the dev set. 27 00:01:25,705 --> 00:01:28,580 And, the other four regions, I'm going to use these four, 28 00:01:28,580 --> 00:01:30,530 could be randomly chosen four as well, 29 00:01:30,530 --> 00:01:33,350 that those will go into the test set. 30 00:01:33,350 --> 00:01:36,940 It turns out, this is a very bad idea because in this example, 31 00:01:36,940 --> 00:01:40,780 your dev and test sets come from different distributions. 32 00:01:40,780 --> 00:01:44,345 I would, instead, recommend that you find a way to make your dev and 33 00:01:44,345 --> 00:01:49,555 test sets come from the same distribution. So, here's what I mean. 34 00:01:49,555 --> 00:01:51,590 One picture to keep in mind is that, I think, 35 00:01:51,590 --> 00:01:54,530 setting up your dev set, plus, 36 00:01:54,530 --> 00:01:57,662 your single role number evaluation metric, 37 00:01:57,662 --> 00:01:59,840 that's like placing a target and telling 38 00:01:59,840 --> 00:02:03,395 your team where you think is the bull's eye you want to aim at. 39 00:02:03,395 --> 00:02:07,165 Because, what happen once you've established that dev set and the metric is that, 40 00:02:07,165 --> 00:02:09,925 the team can innovate very quickly, try different ideas, 41 00:02:09,925 --> 00:02:13,100 run experiments and very quickly use the dev set and 42 00:02:13,100 --> 00:02:16,997 the metric to evaluate crossfires and try to pick the best one. 43 00:02:16,997 --> 00:02:21,720 So, machine learning teams are often very good at shooting different arrows into 44 00:02:21,720 --> 00:02:26,732 targets and innovating to get closer and closer to hitting the bullseye. 45 00:02:26,732 --> 00:02:30,173 So, doing well on your metric on your dev sets. 46 00:02:30,173 --> 00:02:32,040 And, the problem with how we've set up 47 00:02:32,040 --> 00:02:34,680 the dev and test sets in the example on the left is that, 48 00:02:34,680 --> 00:02:39,450 your team might spend months innovating to do well on the dev set only to realize that, 49 00:02:39,450 --> 00:02:41,570 when you finally go to test them on the test set, 50 00:02:41,570 --> 00:02:45,900 that data from these four countries or these four regions at the bottom, 51 00:02:45,900 --> 00:02:49,520 might be very different than the regions in your dev set. 52 00:02:49,520 --> 00:02:51,765 So, you might have a nasty surprise and realize that, 53 00:02:51,765 --> 00:02:54,690 all the months of work you spent optimizing to the dev set, 54 00:02:54,690 --> 00:02:58,800 is not giving you good performance on the test set. 55 00:02:58,800 --> 00:03:03,180 So, having dev and test sets from different distributions is like setting a target, 56 00:03:03,180 --> 00:03:06,525 having your team spend months trying to aim closer and closer to bull's eye, 57 00:03:06,525 --> 00:03:08,865 only to realize after months of work that, 58 00:03:08,865 --> 00:03:10,550 you'll say, "Oh wait, to test it, 59 00:03:10,550 --> 00:03:12,005 I'm going to move target over here." 60 00:03:12,005 --> 00:03:14,160 And, the team might say, "Well, 61 00:03:14,160 --> 00:03:18,320 why did you make us spend months optimizing for a different bull's eye when suddenly, 62 00:03:18,320 --> 00:03:21,950 you can move the bull's eye to a different location somewhere else?" 63 00:03:21,950 --> 00:03:23,010 So, to avoid this, 64 00:03:23,010 --> 00:03:24,510 what I recommend instead is that, 65 00:03:24,510 --> 00:03:29,985 you take all this randomly shuffled data into the dev and test set. 66 00:03:29,985 --> 00:03:33,917 So that, both the dev and test sets have data from all eight regions 67 00:03:33,917 --> 00:03:38,205 and that the dev and test sets really come from the same distribution, 68 00:03:38,205 --> 00:03:41,490 which is the distribution of all of your data mixed together. 69 00:03:41,490 --> 00:03:43,766 Here's another example. This is a, 70 00:03:43,766 --> 00:03:46,200 actually, true story but with some details changed. 71 00:03:46,200 --> 00:03:48,210 So, I know a machine learning team that actually spent 72 00:03:48,210 --> 00:03:50,610 several months optimizing on a dev set 73 00:03:50,610 --> 00:03:55,400 which was comprised of loan approvals for medium income zip codes. 74 00:03:55,400 --> 00:03:57,465 So, the specific machine learning problem was, 75 00:03:57,465 --> 00:04:00,805 "Given an input X about a loan application, 76 00:04:00,805 --> 00:04:02,820 can you predict why and which is, 77 00:04:02,820 --> 00:04:04,907 whether or not, they'll repay the loan?" 78 00:04:04,907 --> 00:04:07,760 So, this helps you decide whether or not to approve a loan. 79 00:04:07,760 --> 00:04:11,370 And so, the dev set came from loan applications. 80 00:04:11,370 --> 00:04:13,565 They came from medium income zip codes. 81 00:04:13,565 --> 00:04:16,870 Zip codes is what we call postal codes in the United States. 82 00:04:16,870 --> 00:04:18,990 But, after working on this for a few months, the team then, 83 00:04:18,990 --> 00:04:21,555 suddenly decided to test this on 84 00:04:21,555 --> 00:04:24,650 data from low income zip codes or low income postal codes. 85 00:04:24,650 --> 00:04:27,595 And, of course, the distributional data for 86 00:04:27,595 --> 00:04:30,900 medium income and low income zip codes is very different. 87 00:04:30,900 --> 00:04:34,810 And, the crossfire, that they spend so much time optimizing in the former case, 88 00:04:34,810 --> 00:04:39,165 just didn't work well at all on the latter case. 89 00:04:39,165 --> 00:04:42,750 And so, this particular team actually wasted about three months of 90 00:04:42,750 --> 00:04:47,053 time and had to go back and really re-do a lot of work. 91 00:04:47,053 --> 00:04:48,540 And, what happened here was, 92 00:04:48,540 --> 00:04:52,035 the team spent three months aiming for one target, 93 00:04:52,035 --> 00:04:54,060 and then, after three months, 94 00:04:54,060 --> 00:04:55,490 the manager asked, "Oh, 95 00:04:55,490 --> 00:04:57,750 how are you doing on hitting this other target?" 96 00:04:57,750 --> 00:04:59,340 This is a totally different location. 97 00:04:59,340 --> 00:05:02,306 And, it just was a very frustrating experience for the team. 98 00:05:02,306 --> 00:05:05,530 So, what I recommand for setting up a dev set and test set is, 99 00:05:05,530 --> 00:05:08,520 choose a dev set and test set to reflect data you expect to get in 100 00:05:08,520 --> 00:05:11,535 future and consider important to do well on. 101 00:05:11,535 --> 00:05:14,850 And, in particular, the dev set and the test set here, 102 00:05:14,850 --> 00:05:20,338 should come from the same distribution. 103 00:05:20,338 --> 00:05:23,660 So, whatever type of data you expect to get in the future, 104 00:05:23,660 --> 00:05:25,415 and once you do well on, 105 00:05:25,415 --> 00:05:27,745 try to get data that looks like that. 106 00:05:27,745 --> 00:05:29,050 And, whatever that data is, 107 00:05:29,050 --> 00:05:32,245 put it into both your dev set and your test set. 108 00:05:32,245 --> 00:05:33,920 Because that way, you're putting 109 00:05:33,920 --> 00:05:36,440 the target where you actually want to hit and you're having 110 00:05:36,440 --> 00:05:40,705 the team innovate very efficiently to hitting that same target, 111 00:05:40,705 --> 00:05:41,826 hopefully, the same targets well. 112 00:05:41,826 --> 00:05:45,965 Since we haven't talked yet about how to set up a training set, 113 00:05:45,965 --> 00:05:48,790 we'll talk about the training set in a later video. 114 00:05:48,790 --> 00:05:51,335 But, the important take away from this video is that, 115 00:05:51,335 --> 00:05:53,690 setting up the dev set, 116 00:05:53,690 --> 00:05:56,300 as well as the validation metric, 117 00:05:56,300 --> 00:05:59,780 is really defining what target you want to aim at. 118 00:05:59,780 --> 00:06:04,145 And hopefully, by setting the dev set and the test set to the same distribution, 119 00:06:04,145 --> 00:06:08,659 you're really aiming at whatever target you hope your machine learning team will hit. 120 00:06:08,659 --> 00:06:10,870 The way you choose your training set 121 00:06:10,870 --> 00:06:14,510 will affect how well you can actually hit that target. 122 00:06:14,510 --> 00:06:18,400 But, we can talk about that separately in a later video. 123 00:06:18,400 --> 00:06:20,830 So, I know some machine learning teams that could literally have saved 124 00:06:20,830 --> 00:06:23,825 themselves months of work could they follow the guidelines in this video. 125 00:06:23,825 --> 00:06:26,235 So, I hope these guidelines will help you, too. 126 00:06:26,235 --> 00:06:29,666 Next, it turns out, that the size of your dev and test sets, 127 00:06:29,666 --> 00:06:31,015 how to choose the size of them, 128 00:06:31,015 --> 00:06:33,391 is also changing the area of deep learning. 129 00:06:33,391 --> 00:06:35,290 Let's talk about that in the next video.