1 00:00:00,000 --> 00:00:02,700 If your training set comes from a different distribution, 2 00:00:02,700 --> 00:00:04,135 than your dev and test set, 3 00:00:04,135 --> 00:00:09,623 and if error analysis shows you that you have a data mismatch problem, what can you do? 4 00:00:09,623 --> 00:00:13,105 There are completely systematic solutions to this, 5 00:00:13,105 --> 00:00:15,520 but let's look at some things you could try. 6 00:00:15,520 --> 00:00:19,107 If I find that I have a large data mismatch problem, 7 00:00:19,107 --> 00:00:23,865 what I usually do is carry out manual error analysis and try to 8 00:00:23,865 --> 00:00:31,865 understand the differences between the training set and the dev/test sets. 9 00:00:31,865 --> 00:00:34,272 To avoid overfitting the test set, 10 00:00:34,272 --> 00:00:35,800 technically for error analysis, 11 00:00:35,800 --> 00:00:40,030 you should manually only look at a dev set and not at the test set. 12 00:00:40,030 --> 00:00:42,040 But as a concrete example, 13 00:00:42,040 --> 00:00:47,020 if you're building the speech-activated rear-view mirror application, 14 00:00:47,020 --> 00:00:50,020 you might look or, I guess if it's speech, 15 00:00:50,020 --> 00:00:53,230 listen to examples in your dev set to try 16 00:00:53,230 --> 00:00:56,885 to figure out how your dev set is different than your training set. 17 00:00:56,885 --> 00:00:58,890 So, for example, you might find 18 00:00:58,890 --> 00:01:03,745 that a lot of dev set examples are very noisy and there's a lot of car noise. 19 00:01:03,745 --> 00:01:08,485 And this is one way that your dev set differs from your training set. 20 00:01:08,485 --> 00:01:11,350 And maybe you find other categories of errors. 21 00:01:11,350 --> 00:01:17,095 For example, in the speech-activated rear-view mirror in your car, 22 00:01:17,095 --> 00:01:20,650 you might find that it's often mis-recognizing 23 00:01:20,650 --> 00:01:22,600 street numbers because there are 24 00:01:22,600 --> 00:01:25,450 a lot more navigational queries which will have street address. 25 00:01:25,450 --> 00:01:28,420 So, getting street numbers right is really important. 26 00:01:28,420 --> 00:01:31,110 When you have insight into the nature of the dev set errors, 27 00:01:31,110 --> 00:01:33,960 or you have insight into how the dev 28 00:01:33,960 --> 00:01:37,055 set may be different or harder than your training set, 29 00:01:37,055 --> 00:01:41,645 what you can do is then try to find ways to make the training data more similar. 30 00:01:41,645 --> 00:01:47,290 Or, alternatively, try to collect more data similar to your dev and test sets. 31 00:01:47,290 --> 00:01:53,940 So, for example, if you find that car noise in the background is a major source of error, 32 00:01:53,940 --> 00:02:00,120 one thing you could do is simulate noisy in-car data. 33 00:02:00,120 --> 00:02:03,580 So a little bit more about how to do this on the next slide. 34 00:02:03,580 --> 00:02:06,710 Or you find that you're having a hard time recognizing street numbers, 35 00:02:06,710 --> 00:02:10,280 maybe you can go and deliberately try to get more data of 36 00:02:10,280 --> 00:02:15,150 people speaking out numbers and add that to your training set. 37 00:02:15,150 --> 00:02:20,555 Now, I realize that this slide is giving a rough guideline for things you could try. 38 00:02:20,555 --> 00:02:23,525 This isn't a systematic process and, 39 00:02:23,525 --> 00:02:27,720 I guess, it's no guarantee that you get the insights you need to make progress. 40 00:02:27,720 --> 00:02:32,045 But I have found that this manual insight, 41 00:02:32,045 --> 00:02:35,870 together we're trying to make the data more similar on the dimensions that 42 00:02:35,870 --> 00:02:39,765 matter that this often helps on a lot of the problems. 43 00:02:39,765 --> 00:02:46,010 So, if your goal is to make the training data more similar to your dev set, 44 00:02:46,010 --> 00:02:48,620 what are some things you can do? 45 00:02:48,620 --> 00:02:50,270 One of the techniques you can use is 46 00:02:50,270 --> 00:02:52,970 artificial data synthesis and let's discuss 47 00:02:52,970 --> 00:02:56,810 that in the context of addressing the car noise problem. 48 00:02:56,810 --> 00:02:59,240 So, to build a speech recognition system, 49 00:02:59,240 --> 00:03:01,970 maybe you don't have a lot of audio that was actually 50 00:03:01,970 --> 00:03:05,030 recorded inside the car with the background noise of a car, 51 00:03:05,030 --> 00:03:07,040 background noise of a highway, and so on. 52 00:03:07,040 --> 00:03:09,440 But, it turns out, there's a way to synthesize it. 53 00:03:09,440 --> 00:03:11,435 So, let's say that you've recorded 54 00:03:11,435 --> 00:03:15,620 a large amount of clean audio without this car background noise. 55 00:03:15,620 --> 00:03:20,400 So, here's an example of a clip you might have in your training set. 56 00:03:21,190 --> 00:03:26,340 By the way, this sentence is used a lot in AI for 57 00:03:26,340 --> 00:03:30,590 testing because this is a short sentence that contains every alphabet from A to Z, 58 00:03:30,590 --> 00:03:32,745 so you see this sentence a lot. 59 00:03:32,745 --> 00:03:36,540 But, given that recording of "the quick brown fox jumps over the lazy dog," you 60 00:03:36,540 --> 00:03:46,455 can then also get a recording of car noise like this. 61 00:03:46,455 --> 00:03:49,010 So, that's what the inside of a car sounds like, 62 00:03:49,010 --> 00:03:50,595 if you're driving in silence. 63 00:03:50,595 --> 00:03:53,460 And if you take these two audio clips and add them together, 64 00:03:53,460 --> 00:03:55,595 you can then synthesize what 65 00:03:55,595 --> 00:03:58,835 saying "the quick brown fox jumps over the lazy dog" would sound like, 66 00:03:58,835 --> 00:04:06,870 if you were saying that in a noisy car. So, it sounds like this. 67 00:04:06,870 --> 00:04:10,980 So, this is a relatively simple audio synthesis example. 68 00:04:10,980 --> 00:04:14,210 In practice, you might synthesize other audio effects like 69 00:04:14,210 --> 00:04:16,370 reverberation which is the sound of 70 00:04:16,370 --> 00:04:19,700 your voice bouncing off the walls of the car and so on. 71 00:04:19,700 --> 00:04:22,370 But through artificial data synthesis, 72 00:04:22,370 --> 00:04:26,900 you might be able to quickly create more data that sounds like it 73 00:04:26,900 --> 00:04:32,540 was recorded inside the car without needing to go out there and collect tons of data, 74 00:04:32,540 --> 00:04:34,850 maybe thousands or tens of thousands of hours of 75 00:04:34,850 --> 00:04:37,700 data in a car that's actually driving along. 76 00:04:37,700 --> 00:04:41,210 So, if your error analysis shows you that you should try to 77 00:04:41,210 --> 00:04:45,050 make your data sound more like it was recorded inside the car, 78 00:04:45,050 --> 00:04:47,705 then this could be a reasonable process for 79 00:04:47,705 --> 00:04:51,310 synthesizing that type of data to give you a learning algorithm. 80 00:04:51,310 --> 00:04:54,380 Now, there is one note of caution I 81 00:04:54,380 --> 00:04:57,855 want to sound on artificial data synthesis which is that, 82 00:04:57,855 --> 00:05:04,814 let's say, you have 10,000 hours of data that was recorded against a quiet background. 83 00:05:04,814 --> 00:05:11,915 And, let's say, that you have just one hour of car noise. 84 00:05:11,915 --> 00:05:14,940 So, one thing you could try is take this one hour 85 00:05:14,940 --> 00:05:17,820 of car noise and repeat it 10,000 times in 86 00:05:17,820 --> 00:05:24,695 order to add to this 10,000 hours of data recorded against a quiet background. 87 00:05:24,695 --> 00:05:29,355 If you do that, the audio will sound perfectly fine to the human ear, 88 00:05:29,355 --> 00:05:30,600 but there is a chance, 89 00:05:30,600 --> 00:05:38,790 there is a risk that your learning algorithm will over fit to the one hour of car noise. 90 00:05:38,790 --> 00:05:44,325 And, in particular, if this is the set of 91 00:05:44,325 --> 00:05:52,460 all audio that you could record in the car or, 92 00:05:52,460 --> 00:05:56,195 maybe the sets of all car noise backgrounds you can imagine, 93 00:05:56,195 --> 00:05:59,285 if you have just one hour of car noise background, 94 00:05:59,285 --> 00:06:03,660 you might be simulating just a very small subset of this space. 95 00:06:03,660 --> 00:06:09,010 You might be just synthesizing from a very small subset of this space. 96 00:06:09,010 --> 00:06:10,870 And to the human ear, 97 00:06:10,870 --> 00:06:13,990 all these audio sounds just fine because one hour of car noise 98 00:06:13,990 --> 00:06:18,030 sounds just like any other hour of car noise to the human ear. 99 00:06:18,030 --> 00:06:23,880 But, it's possible that you're synthesizing data from a very small subset of this space, 100 00:06:23,880 --> 00:06:25,840 and the neural network might be 101 00:06:25,840 --> 00:06:30,530 overfitting to the one hour of car noise that you may have. 102 00:06:30,530 --> 00:06:33,355 I don't know if it will be practically feasible to 103 00:06:33,355 --> 00:06:37,090 inexpensively collect 10,000 hours of car noise so that 104 00:06:37,090 --> 00:06:39,310 you don't need to repeat the same one hour of 105 00:06:39,310 --> 00:06:42,550 car noise over and over but you have 10,000 unique hours 106 00:06:42,550 --> 00:06:48,024 of car noise to add to 10,000 hours of unique audio recording against a clean background. 107 00:06:48,024 --> 00:06:50,900 But it's possible, no guarantees. 108 00:06:50,900 --> 00:06:56,710 But it is possible that using 10,000 hours of unique car noise rather than just one hour, 109 00:06:56,710 --> 00:07:01,167 that could result in better performance through learning algorithm. 110 00:07:01,167 --> 00:07:05,650 And the challenge with artificial data synthesis is to the human ear, 111 00:07:05,650 --> 00:07:07,340 as far as your ears can tell, 112 00:07:07,340 --> 00:07:10,850 these 10,000 hours all sound the same as this one hour, 113 00:07:10,850 --> 00:07:13,175 so you might end up creating this 114 00:07:13,175 --> 00:07:16,310 very impoverished synthesized data set from 115 00:07:16,310 --> 00:07:19,890 a much smaller subset of the space without actually realizing it. 116 00:07:19,890 --> 00:07:23,265 Here's another example of artificial data synthesis. 117 00:07:23,265 --> 00:07:26,495 Let's say you're building a self driving car and so you want to really detect 118 00:07:26,495 --> 00:07:31,260 vehicles like this and put a bounding box around it let's say. 119 00:07:31,260 --> 00:07:34,550 So, one idea that a lot of people have discussed is, well, 120 00:07:34,550 --> 00:07:39,070 why should you use computer graphics to simulate tons of images of cars? 121 00:07:39,070 --> 00:07:41,045 And, in fact, here are a couple of pictures of 122 00:07:41,045 --> 00:07:44,045 cars that were generated using computer graphics. 123 00:07:44,045 --> 00:07:46,970 And I think these graphics effects are actually pretty good and I can 124 00:07:46,970 --> 00:07:50,210 imagine that by synthesizing pictures like these, 125 00:07:50,210 --> 00:07:54,510 you could train a pretty good computer vision system for detecting cars. 126 00:07:54,510 --> 00:07:56,570 Unfortunately, the picture that I 127 00:07:56,570 --> 00:08:00,740 drew on the previous slide again applies in this setting. 128 00:08:00,740 --> 00:08:05,075 Maybe this is the set of all cars and, 129 00:08:05,075 --> 00:08:10,200 if you synthesize just a very small subset of these cars, 130 00:08:10,200 --> 00:08:12,775 then to the human eye, 131 00:08:12,775 --> 00:08:15,145 maybe the synthesized images look fine. 132 00:08:15,145 --> 00:08:18,985 But you might overfit to this small subset you're synthesizing. 133 00:08:18,985 --> 00:08:23,590 In particular, one idea that a lot of people have independently raised is, 134 00:08:23,590 --> 00:08:26,950 once you find a video game with good computer graphics of cars and just 135 00:08:26,950 --> 00:08:31,115 grab images from them and get a huge data set of pictures of cars, 136 00:08:31,115 --> 00:08:33,805 it turns out that if you look at a video game, 137 00:08:33,805 --> 00:08:38,065 if the video game has just 20 unique cars in the video game, 138 00:08:38,065 --> 00:08:39,700 then the video game looks fine 139 00:08:39,700 --> 00:08:42,190 because you're driving around in the video game and you see 140 00:08:42,190 --> 00:08:47,065 these 20 other cars and it looks like a pretty realistic simulation. 141 00:08:47,065 --> 00:08:51,715 But the world has a lot more than 20 unique designs of cars, 142 00:08:51,715 --> 00:08:56,200 and if your entire synthesized training set has only 20 distinct cars, 143 00:08:56,200 --> 00:09:00,485 then your neural network will probably overfit to these 20 cars. 144 00:09:00,485 --> 00:09:03,985 And it's difficult for a person to easily tell that, 145 00:09:03,985 --> 00:09:06,130 even though these images look realistic, 146 00:09:06,130 --> 00:09:11,780 you're really covering such a tiny subset of the sets of all possible cars. 147 00:09:11,780 --> 00:09:15,310 So, to summarize, if you think you have a data mismatch problem, 148 00:09:15,310 --> 00:09:17,640 I recommend you do error analysis, 149 00:09:17,640 --> 00:09:18,820 or look at the training set, 150 00:09:18,820 --> 00:09:20,785 or look at the dev set to try this figure out, 151 00:09:20,785 --> 00:09:24,685 to try to gain insight into how these two distributions of data might differ. 152 00:09:24,685 --> 00:09:26,950 And then see if you can find some ways to get 153 00:09:26,950 --> 00:09:30,025 more training data that looks a bit more like your dev set. 154 00:09:30,025 --> 00:09:33,185 One of the ways we talked about is artificial data synthesis. 155 00:09:33,185 --> 00:09:35,515 And artificial data synthesis does work. 156 00:09:35,515 --> 00:09:39,630 In speech recognition, I've seen artificial data synthesis significantly 157 00:09:39,630 --> 00:09:43,870 boost the performance of what were already very good speech recognition system. 158 00:09:43,870 --> 00:09:45,505 So, it can work very well. 159 00:09:45,505 --> 00:09:47,675 But, if you're using artificial data synthesis, 160 00:09:47,675 --> 00:09:51,505 just be cautious and bear in mind whether or not you might be accidentally 161 00:09:51,505 --> 00:09:57,105 simulating data only from a tiny subset of the space of all possible examples. 162 00:09:57,105 --> 00:10:01,990 So, that's it for how to deal with data mismatch. 163 00:10:01,990 --> 00:10:04,690 Next, I like to share with you some thoughts 164 00:10:04,690 --> 00:10:08,390 on how to learn from multiple types of data at the same time.