1 00:00:00,000 --> 00:00:03,836 Most computer vision task could use more data. 2 00:00:03,836 --> 00:00:07,350 And so data augmentation is one of the techniques that is 3 00:00:07,350 --> 00:00:11,995 often used to improve the performance of computer vision systems. 4 00:00:11,995 --> 00:00:15,535 I think that computer vision is a pretty complicated task. 5 00:00:15,535 --> 00:00:16,745 You have to input this image, 6 00:00:16,745 --> 00:00:21,870 all these pixels and then figure out what is in this picture. 7 00:00:21,870 --> 00:00:26,615 And it seems like you need to learn the decently complicated function to do that. 8 00:00:26,615 --> 00:00:32,160 And in practice, there almost all competing visions task having more data will help. 9 00:00:32,160 --> 00:00:36,580 This is unlike some other domains where sometimes you can get enough data, 10 00:00:36,580 --> 00:00:39,610 they don't feel as much pressure to get even more data. 11 00:00:39,610 --> 00:00:42,428 But I think today, this data computer vision is that, 12 00:00:42,428 --> 00:00:44,655 for the majority of computer vision problems, 13 00:00:44,655 --> 00:00:47,655 we feel like we just can't get enough data. 14 00:00:47,655 --> 00:00:50,783 And this is not true for all applications of machine learning, 15 00:00:50,783 --> 00:00:53,490 but it does feel like it's true for computer vision. 16 00:00:53,490 --> 00:00:57,120 So, what that means is that when you're training in computer vision model, 17 00:00:57,120 --> 00:00:59,880 often data augmentation will help. 18 00:00:59,880 --> 00:01:02,520 And this is true whether you're using transfer learning or 19 00:01:02,520 --> 00:01:05,720 using someone else's pre-trained ways to start, 20 00:01:05,720 --> 00:01:09,055 or whether you're trying to train something yourself from scratch. 21 00:01:09,055 --> 00:01:13,755 Let's take a look at the common data augmentation that is in computer vision. 22 00:01:13,755 --> 00:01:19,920 Perhaps the simplest data augmentation method is mirroring on the vertical axis, 23 00:01:19,920 --> 00:01:22,995 where if you have this example in your training set, 24 00:01:22,995 --> 00:01:27,045 you flip it horizontally to get that image on the right. 25 00:01:27,045 --> 00:01:29,300 And for most computer vision task, 26 00:01:29,300 --> 00:01:33,475 if the left picture is a cat then mirroring it is though a cat. 27 00:01:33,475 --> 00:01:35,610 And if the mirroring operation 28 00:01:35,610 --> 00:01:38,890 preserves whatever you're trying to recognize in the picture, 29 00:01:38,890 --> 00:01:43,395 this would be a good data augmentation technique to use. 30 00:01:43,395 --> 00:01:47,035 Another commonly used technique is random cropping. 31 00:01:47,035 --> 00:01:48,725 So given this dataset, 32 00:01:48,725 --> 00:01:50,190 let's pick a few random crops. 33 00:01:50,190 --> 00:01:51,536 So you might pick that, 34 00:01:51,536 --> 00:01:56,442 and take that crop or you might take that, to that crop, 35 00:01:56,442 --> 00:01:59,460 take this, take that crop and so this 36 00:01:59,460 --> 00:02:02,508 gives you different examples to feed in your training sample, 37 00:02:02,508 --> 00:02:04,350 sort of different random crops of your datasets. 38 00:02:04,350 --> 00:02:08,310 So random cropping isn't a perfect data augmentation. 39 00:02:08,310 --> 00:02:14,760 What if you randomly end up taking that crop which will look much like 40 00:02:14,760 --> 00:02:18,110 a cat but in practice and worthwhile so long as 41 00:02:18,110 --> 00:02:21,920 your random crops are reasonably large subsets of the actual image. 42 00:02:21,920 --> 00:02:26,700 So, mirroring and random cropping are frequently used and in theory, 43 00:02:26,700 --> 00:02:29,580 you could also use things like rotation, 44 00:02:29,580 --> 00:02:31,086 shearing of the image, 45 00:02:31,086 --> 00:02:34,233 so that's if you do this to the image, 46 00:02:34,233 --> 00:02:35,883 distort it that way, 47 00:02:35,883 --> 00:02:39,130 introduce various forms of local warping and so on. 48 00:02:39,130 --> 00:02:42,253 And there's really no harm with trying all of these things as well, 49 00:02:42,253 --> 00:02:45,805 although in practice they seem to be used a bit less, 50 00:02:45,805 --> 00:02:48,159 or perhaps because of their complexity. 51 00:02:48,159 --> 00:02:58,345 The second type of data augmentation that is commonly used is color shifting. 52 00:02:58,345 --> 00:03:01,080 So, given a picture like this, 53 00:03:01,080 --> 00:03:04,950 let's say you add to the R, 54 00:03:04,950 --> 00:03:09,783 G and B channels different distortions. 55 00:03:09,783 --> 00:03:12,260 In this example, we are adding to 56 00:03:12,260 --> 00:03:16,410 the red and blue channels and subtracting from the green channel. 57 00:03:16,410 --> 00:03:20,320 So, red and blue make purple. 58 00:03:20,320 --> 00:03:23,360 So, this makes the whole image a bit more purpley and that 59 00:03:23,360 --> 00:03:27,080 creates a distorted image for training set. 60 00:03:27,080 --> 00:03:29,435 For illustration purposes, I'm making 61 00:03:29,435 --> 00:03:32,775 somewhat dramatic changes to the colors and practice, 62 00:03:32,775 --> 00:03:39,720 you draw R, G and B from some distribution that could be quite small as well. 63 00:03:39,720 --> 00:03:43,608 But what you do is take different values of R, 64 00:03:43,608 --> 00:03:46,410 G, and B and use them to distort the color channels. 65 00:03:46,410 --> 00:03:48,480 So, in the second example, 66 00:03:48,480 --> 00:03:50,695 we are making a less red, 67 00:03:50,695 --> 00:03:52,415 and more green and more blue, 68 00:03:52,415 --> 00:03:57,109 so that turns our image a bit more yellowish. 69 00:03:57,109 --> 00:04:01,407 And here, we are making it much more blue, 70 00:04:01,407 --> 00:04:03,155 just a tiny little bit longer. 71 00:04:03,155 --> 00:04:04,868 But in practice, the values R, 72 00:04:04,868 --> 00:04:09,465 G and B, are drawn from some probability distribution. 73 00:04:09,465 --> 00:04:15,370 And the motivation for this is that if maybe the sunlight was 74 00:04:15,370 --> 00:04:20,187 a bit yellow or maybe the in-goal illumination was a bit more yellow, 75 00:04:20,187 --> 00:04:23,730 that could easily change the color of an image, 76 00:04:23,730 --> 00:04:27,745 but the identity of the cat or the identity of the content, 77 00:04:27,745 --> 00:04:30,840 the label y, just still stay the same. 78 00:04:30,840 --> 00:04:35,798 And so introducing these color distortions or by doing color shifting, 79 00:04:35,798 --> 00:04:46,435 this makes your learning algorithm more robust to changes in the colors of your images. 80 00:04:46,435 --> 00:04:54,880 Just a comment for the advanced learners in this course, 81 00:04:54,880 --> 00:04:59,997 that is okay if you don't understand what I'm about to say when using red. 82 00:04:59,997 --> 00:05:04,280 There are different ways to sample R, G, and B. 83 00:05:04,280 --> 00:05:08,790 One of the ways to implement color distortion uses an algorithm called PCA. 84 00:05:08,790 --> 00:05:11,465 This is called Principles Component Analysis, 85 00:05:11,465 --> 00:05:14,345 which I talked about in the 86 00:05:14,345 --> 00:05:22,750 ml-class.org Machine Learning Course on Coursera. 87 00:05:22,750 --> 00:05:29,080 But the details of this are actually given in the AlexNet paper, 88 00:05:29,080 --> 00:05:36,080 and sometimes called PCA Color Augmentation. 89 00:05:36,080 --> 00:05:41,585 But the rough idea at the time PCA Color Augmentation is for example, 90 00:05:41,585 --> 00:05:44,160 if your image is mainly purple, 91 00:05:44,160 --> 00:05:47,540 if it mainly has red and blue tints, 92 00:05:47,540 --> 00:05:49,010 and very little green, 93 00:05:49,010 --> 00:05:52,399 then PCA Color Augmentation, 94 00:05:52,399 --> 00:05:55,120 will add and subtract a lot to red and blue, 95 00:05:55,120 --> 00:05:56,510 where it balance [inaudible] all the greens, 96 00:05:56,510 --> 00:06:01,770 so kind of keeps the overall color of the tint the same. 97 00:06:01,770 --> 00:06:05,390 If you didn't understand any of this, don't worry about it. 98 00:06:05,390 --> 00:06:09,677 But if you can search online for that, 99 00:06:09,677 --> 00:06:13,905 you can and if you want to read about the details of it in the AlexNet paper, 100 00:06:13,905 --> 00:06:18,500 and you can also find some open source implementations of the PCA Color Augmentation, 101 00:06:18,500 --> 00:06:21,685 and just use that. 102 00:06:21,685 --> 00:06:30,010 So, you might have your training data stored in a hard disk and uses symbol, 103 00:06:30,010 --> 00:06:33,705 this round bucket symbol to represent your hard disk. 104 00:06:33,705 --> 00:06:36,000 And if you have a small training set, 105 00:06:36,000 --> 00:06:38,336 you can do almost anything and you'll be okay. 106 00:06:38,336 --> 00:06:42,785 But the very last training set and this is how people will often implement it, 107 00:06:42,785 --> 00:06:52,705 which is you might have a CPU thread that is constantly loading images of your hard disk. 108 00:06:52,705 --> 00:07:00,235 So, you have this stream of images coming in from your hard disk. 109 00:07:00,235 --> 00:07:08,535 And what you can do is use maybe a CPU thread to implement the distortions, 110 00:07:08,535 --> 00:07:11,000 yet the random cropping, 111 00:07:11,000 --> 00:07:13,795 or the color shifting, or the mirroring, 112 00:07:13,795 --> 00:07:16,710 but for each image, 113 00:07:16,710 --> 00:07:21,000 you might then end up with some distorted version of it. 114 00:07:21,000 --> 00:07:22,950 So, let's see this image, 115 00:07:22,950 --> 00:07:28,310 I'm going to mirror it and if you also implement colors distortion and so on. 116 00:07:28,310 --> 00:07:35,120 And if this image ends up being color shifted, 117 00:07:35,120 --> 00:07:41,470 so you end up with some different colored cat. 118 00:07:41,470 --> 00:07:48,395 And so your CPU thread is constantly loading data as well as implementing 119 00:07:48,395 --> 00:07:56,810 whether the distortions are needed to form a batch or really many batches of data. 120 00:07:56,810 --> 00:08:05,045 And this data is then constantly passed to some other thread or some other process for 121 00:08:05,045 --> 00:08:08,815 implementing training and this could be done on the CPU or really 122 00:08:08,815 --> 00:08:14,075 increasingly on the GPU if you have a large neural network to train. 123 00:08:14,075 --> 00:08:17,710 And so, a pretty common way of implementing 124 00:08:17,710 --> 00:08:22,235 data augmentation is to really have one thread, 125 00:08:22,235 --> 00:08:26,540 almost four threads, that is 126 00:08:26,540 --> 00:08:30,635 responsible for loading the data and implementing distortions, 127 00:08:30,635 --> 00:08:32,840 and then passing that to some other thread or 128 00:08:32,840 --> 00:08:35,935 some other process that then does the training. 129 00:08:35,935 --> 00:08:38,435 And often, this and this, 130 00:08:38,435 --> 00:08:39,650 can run in parallel. 131 00:08:39,650 --> 00:08:46,121 So, that's it for data augmentation. 132 00:08:46,121 --> 00:08:49,970 And similar to other parts of training a deep neural network, 133 00:08:49,970 --> 00:08:55,250 the data augmentation process also has a few hyperparameters such as how much 134 00:08:55,250 --> 00:09:00,965 color shifting do you implement and exactly what parameters you use for random cropping? 135 00:09:00,965 --> 00:09:03,500 So, similar to elsewhere in computer vision, 136 00:09:03,500 --> 00:09:06,335 a good place to get started might be to use 137 00:09:06,335 --> 00:09:10,920 someone else's open source implementation for how they use data augmentation. 138 00:09:10,920 --> 00:09:15,640 But of course, if you want to capture more in variances, 139 00:09:15,640 --> 00:09:19,235 then you think someone else's open source implementation isn't, 140 00:09:19,235 --> 00:09:24,230 it might be reasonable also to use hyperparameters yourself. 141 00:09:24,230 --> 00:09:27,980 So with that, I hope that you're going to use data augmentation, 142 00:09:27,980 --> 00:09:31,250 to get your computer vision applications to work better.