1 00:00:00,000 --> 00:00:01,650 Deep learning has been successfully 2 00:00:01,650 --> 00:00:03,850 applied to computer vision, natural language processing, 3 00:00:03,850 --> 00:00:05,990 speech recognition, online advertising, 4 00:00:05,990 --> 00:00:08,550 logistics, many, many, many problems. 5 00:00:08,550 --> 00:00:10,470 There are a few things that are unique about 6 00:00:10,470 --> 00:00:12,990 the application of deep learning to computer vision, 7 00:00:12,990 --> 00:00:15,570 about the status of computer vision. 8 00:00:15,570 --> 00:00:20,160 In this video, I will share with you some of my observations about deep learning 9 00:00:20,160 --> 00:00:25,335 for computer vision and I hope that that will help you better navigate the literature, 10 00:00:25,335 --> 00:00:27,270 and the set of ideas out there, 11 00:00:27,270 --> 00:00:31,880 and how you build these systems yourself for computer vision. 12 00:00:31,880 --> 00:00:38,180 So, you can think of most machine learning problems as falling somewhere on 13 00:00:38,180 --> 00:00:40,730 the spectrum between where you have relatively little data 14 00:00:40,730 --> 00:00:45,585 to where you have lots of data. 15 00:00:45,585 --> 00:00:50,840 So, for example I think that today we have a decent amount of 16 00:00:50,840 --> 00:00:57,230 data for speech recognition and it's relative to the complexity of the problem. 17 00:00:57,230 --> 00:00:59,540 And even though there are 18 00:00:59,540 --> 00:01:05,315 reasonably large data sets today for image recognition or image classification, 19 00:01:05,315 --> 00:01:07,460 because image recognition is 20 00:01:07,460 --> 00:01:11,222 just a complicated problem to look at all those pixels and figure out what it is. 21 00:01:11,222 --> 00:01:16,250 It feels like even though the online data sets are quite big like over a million images, 22 00:01:16,250 --> 00:01:20,098 feels like we still wish we had more data. 23 00:01:20,098 --> 00:01:28,006 And there are some problems like object detection where we have even less data. 24 00:01:28,006 --> 00:01:31,340 So, just as a reminder image recognition was the problem of 25 00:01:31,340 --> 00:01:34,912 looking at a picture and telling you is this a cattle or not. 26 00:01:34,912 --> 00:01:39,590 Whereas object detection is look in the picture and actually you're putting 27 00:01:39,590 --> 00:01:41,948 the bounding boxes are telling you where in 28 00:01:41,948 --> 00:01:44,935 the picture the objects such as the car as well. 29 00:01:44,935 --> 00:01:46,760 And so because of the cost of getting 30 00:01:46,760 --> 00:01:52,470 the bounding boxes is just more expensive to label the objects and the bounding boxes. 31 00:01:52,470 --> 00:01:57,905 So, we tend to have less data for object detection than for image recognition. 32 00:01:57,905 --> 00:02:02,083 And object detection is something we'll discuss next week. 33 00:02:02,083 --> 00:02:06,605 So, if you look across a broad spectrum of machine learning problems, 34 00:02:06,605 --> 00:02:12,095 you see on average that when you have a lot of data you tend to find people 35 00:02:12,095 --> 00:02:18,300 getting away with using simpler algorithms as well as less hand-engineering. 36 00:02:18,300 --> 00:02:23,480 So, there's just less needing to carefully design features for the problem, 37 00:02:23,480 --> 00:02:25,910 but instead you can have a giant neural network, even 38 00:02:25,910 --> 00:02:28,507 a simpler architecture, and have a neural network. 39 00:02:28,507 --> 00:02:32,415 Just learn whether we want to learn we have a lot of data. 40 00:02:32,415 --> 00:02:36,560 Whereas, in contrast when you don't have that much data then on 41 00:02:36,560 --> 00:02:41,713 average you see people engaging in more hand-engineering. 42 00:02:41,713 --> 00:02:46,660 And if you want to be ungenerous you can say there are more hacks. 43 00:02:46,660 --> 00:02:49,880 But I think when you don't have much data then 44 00:02:49,880 --> 00:02:54,809 hand-engineering is actually the best way to get good performance. 45 00:02:54,809 --> 00:02:59,435 So, when I look at machine learning applications 46 00:02:59,435 --> 00:03:04,595 I think usually we have the learning algorithm has two sources of knowledge. 47 00:03:04,595 --> 00:03:07,430 One source of knowledge is the labeled data, 48 00:03:07,430 --> 00:03:11,525 really the (x,y) pairs you use for supervised learning. 49 00:03:11,525 --> 00:03:14,652 And the second source of knowledge is the hand-engineering. 50 00:03:14,652 --> 00:03:17,045 And there are lots of ways to hand-engineer a system. 51 00:03:17,045 --> 00:03:20,435 It can be from carefully hand designing the features, 52 00:03:20,435 --> 00:03:22,281 to carefully hand designing 53 00:03:22,281 --> 00:03:26,485 the network architectures to maybe other components of your system. 54 00:03:26,485 --> 00:03:28,880 And so when you don't have much labeled data you 55 00:03:28,880 --> 00:03:32,270 just have to call more on hand-engineering. 56 00:03:32,270 --> 00:03:38,165 And so I think computer vision is trying to learn a really complex function. 57 00:03:38,165 --> 00:03:42,815 And it often feels like we don't have enough data for computer vision. 58 00:03:42,815 --> 00:03:45,362 Even though data sets are getting bigger and bigger, 59 00:03:45,362 --> 00:03:48,845 often we just don't have as much data as we need. 60 00:03:48,845 --> 00:03:52,100 And this is why this data computer vision historically and 61 00:03:52,100 --> 00:03:57,178 even today has relied more on hand-engineering. 62 00:03:57,178 --> 00:04:00,425 And I think this is also why that either computer vision 63 00:04:00,425 --> 00:04:05,040 has developed rather complex network architectures, 64 00:04:05,040 --> 00:04:08,720 is because in the absence of more data 65 00:04:08,720 --> 00:04:13,400 the way to get good performance is to spend more time architecting, 66 00:04:13,400 --> 00:04:17,130 or fooling around with the network architecture. 67 00:04:17,130 --> 00:04:19,340 And in case you think I'm being 68 00:04:19,340 --> 00:04:23,525 derogatory of hand-engineering that's not at all my intent. 69 00:04:23,525 --> 00:04:27,830 When you don't have enough data hand-engineering is a very difficult, 70 00:04:27,830 --> 00:04:32,135 very skillful task that requires a lot of insight. 71 00:04:32,135 --> 00:04:36,875 And someone that is insightful with hand-engineering will get better performance, 72 00:04:36,875 --> 00:04:39,590 and is a great contribution to a project to 73 00:04:39,590 --> 00:04:43,085 do that hand-engineering when you don't have enough data. 74 00:04:43,085 --> 00:04:47,150 It's just when you have lots of data then I wouldn't spend time hand-engineering, 75 00:04:47,150 --> 00:04:52,588 I would spend time building up the learning system instead. 76 00:04:52,588 --> 00:04:57,610 But I think historically the fear the computer vision has used very small data sets, 77 00:04:57,610 --> 00:04:59,965 and so historically the computer vision literature 78 00:04:59,965 --> 00:05:02,700 has relied on a lot of hand-engineering. 79 00:05:02,700 --> 00:05:06,640 And even though in the last few years the amount of data 80 00:05:06,640 --> 00:05:10,540 with the right computer vision task has increased dramatically, 81 00:05:10,540 --> 00:05:12,580 I think that that has resulted in 82 00:05:12,580 --> 00:05:17,185 a significant reduction in the amount of hand-engineering that's being done. 83 00:05:17,185 --> 00:05:21,480 But there's still a lot of hand-engineering of network architectures and computer vision. 84 00:05:21,480 --> 00:05:26,890 Which is why you see very complicated hyper frantic choices in computer vision, 85 00:05:26,890 --> 00:05:31,294 are more complex than you do in a lot of other disciplines. 86 00:05:31,294 --> 00:05:33,550 And in fact, because you usually have 87 00:05:33,550 --> 00:05:38,050 smaller object detection data sets than image recognition data sets, 88 00:05:38,050 --> 00:05:43,360 when we talk about object detection that is task like this next week. 89 00:05:43,360 --> 00:05:48,280 You see that the algorithms 90 00:05:48,280 --> 00:05:54,040 become even more complex and has even more specialized components. 91 00:05:54,040 --> 00:06:00,100 Fortunately, one thing that helps a lot when you have little data is transfer learning. 92 00:06:00,100 --> 00:06:10,395 And I would say for the example from the previous slide of the tigger, 93 00:06:10,395 --> 00:06:13,850 misty, neither detection problem, 94 00:06:13,850 --> 00:06:18,666 you have soluble data that transfer learning will help a lot. 95 00:06:18,666 --> 00:06:21,120 And so that's another set of techniques that's used 96 00:06:21,120 --> 00:06:24,255 a lot for when you have relatively little data. 97 00:06:24,255 --> 00:06:27,100 If you look at the computer vision literature, 98 00:06:27,100 --> 00:06:29,243 and look at the sort of ideas out there, 99 00:06:29,243 --> 00:06:32,293 you also find that people are really enthusiastic. 100 00:06:32,293 --> 00:06:34,800 They're really into doing well on 101 00:06:34,800 --> 00:06:38,730 standardized benchmark data sets and on winning competitions. 102 00:06:38,730 --> 00:06:41,925 And for computer vision researchers if you do 103 00:06:41,925 --> 00:06:45,395 well and the benchmark is easier to get the paper published. 104 00:06:45,395 --> 00:06:49,155 So, there's just a lot of attention on doing well on these benchmarks. 105 00:06:49,155 --> 00:06:51,615 And the positive side of this is that, 106 00:06:51,615 --> 00:06:56,125 it helps the whole community figure out what are the most effective algorithms. 107 00:06:56,125 --> 00:07:02,475 But you also see in the papers people do things that allow you to do well on a benchmark, 108 00:07:02,475 --> 00:07:04,310 but that you wouldn't really use in 109 00:07:04,310 --> 00:07:08,665 a production or a system that you deploy in an actual application. 110 00:07:08,665 --> 00:07:11,379 So, here are a few tips on doing well on benchmarks. 111 00:07:11,379 --> 00:07:15,960 These are things that I don't myself pretty much ever use if I'm 112 00:07:15,960 --> 00:07:20,940 putting a system to production that is actually to serve customers. 113 00:07:20,940 --> 00:07:23,105 But one is ensembling. 114 00:07:23,105 --> 00:07:24,870 And what that means is, 115 00:07:24,870 --> 00:07:27,410 after you've figured out what neural network you want, 116 00:07:27,410 --> 00:07:33,120 train several neural networks independently and average their outputs. 117 00:07:33,120 --> 00:07:35,610 So, initialize say 3, or 5, 118 00:07:35,610 --> 00:07:40,095 or 7 neural networks randomly and train up all of these neural networks, 119 00:07:40,095 --> 00:07:41,890 and then average their outputs. 120 00:07:41,890 --> 00:07:44,943 And by the way it is important to average their outputs y hats. 121 00:07:44,943 --> 00:07:47,660 Don't average their weights that won't work. 122 00:07:47,660 --> 00:07:50,410 Look and you say seven neural networks that 123 00:07:50,410 --> 00:07:53,348 have seven different predictions and average that. 124 00:07:53,348 --> 00:07:57,725 And this will cause you to do maybe 1% better, or 2% better. 125 00:07:57,725 --> 00:08:02,015 So is a little bit better on some benchmark. 126 00:08:02,015 --> 00:08:04,569 And this will cause you to do a little bit better. 127 00:08:04,569 --> 00:08:11,544 Maybe sometimes as much as 1 or 2% which really help win a competition. 128 00:08:11,544 --> 00:08:15,200 But because ensembling means that to test on each image, 129 00:08:15,200 --> 00:08:17,810 you might need to run an image through anywhere 130 00:08:17,810 --> 00:08:21,965 from say 3 to 15 different networks quite typical. 131 00:08:21,965 --> 00:08:25,570 This slows down your running time by a factor of 3 to 15, 132 00:08:25,570 --> 00:08:26,885 or sometimes even more. 133 00:08:26,885 --> 00:08:29,655 And so ensembling is one of those tips that people 134 00:08:29,655 --> 00:08:33,111 use doing well in benchmarks and for winning competitions. 135 00:08:33,111 --> 00:08:38,275 But that I think is almost never use in production to serve actual customers. 136 00:08:38,275 --> 00:08:41,400 I guess unless you have huge computational budget and don't 137 00:08:41,400 --> 00:08:44,996 mind burning a lot more of it per customer image. 138 00:08:44,996 --> 00:08:50,195 Another thing you see in papers that really helps on benchmarks, 139 00:08:50,195 --> 00:08:52,690 is multi-crop at test time. 140 00:08:52,690 --> 00:08:58,055 So, what I mean by that is you've seen how you can do data augmentation. 141 00:08:58,055 --> 00:09:04,910 And multi-crop is a form of applying data augmentation to your test image as well. 142 00:09:04,910 --> 00:09:07,470 So, for example let's see a cat image 143 00:09:07,470 --> 00:09:12,155 and just copy it four times including two more versions. 144 00:09:12,155 --> 00:09:14,585 There's a technique called the 10-crop, 145 00:09:14,585 --> 00:09:19,460 which basically says let's say you take this central region that crop, 146 00:09:19,460 --> 00:09:22,097 and run it through your crossfire. 147 00:09:22,097 --> 00:09:24,830 And then take that crop up the left hand corner run through a crossfire, 148 00:09:24,830 --> 00:09:27,145 up right hand corner shown in green, 149 00:09:27,145 --> 00:09:30,980 lower left shown in yellow, 150 00:09:30,980 --> 00:09:33,162 lower right shown in orange, 151 00:09:33,162 --> 00:09:34,950 and run that through the crossfire. 152 00:09:34,950 --> 00:09:37,060 And then do the same thing with the mirrored image. 153 00:09:37,060 --> 00:09:38,743 Right. So I'll take the central crop, 154 00:09:38,743 --> 00:09:41,670 then take the four corners crops. 155 00:09:41,670 --> 00:09:44,165 So, that's one central crop here and here, 156 00:09:44,165 --> 00:09:46,210 there's four corners crop here and here. 157 00:09:46,210 --> 00:09:49,540 And if you add these up that's 10 different crops that you mentioned. 158 00:09:49,540 --> 00:09:51,600 So hence the name 10-crop. 159 00:09:51,600 --> 00:09:54,980 And so what you do, is you run these 10 images through 160 00:09:54,980 --> 00:09:59,360 your crossfire and then average the results. 161 00:09:59,360 --> 00:10:02,660 So, if you have the computational budget you could do this. 162 00:10:02,660 --> 00:10:04,530 Maybe you don't need as many as 10-crops, 163 00:10:04,530 --> 00:10:05,900 you can use a few crops. 164 00:10:05,900 --> 00:10:10,960 And this might get you a little bit better performance in a production system. 165 00:10:10,960 --> 00:10:16,190 By production I mean a system you're deploying for actual users. 166 00:10:16,190 --> 00:10:19,760 But this is another technique that is used much more for doing 167 00:10:19,760 --> 00:10:24,110 well on benchmarks than in actual production systems. 168 00:10:24,110 --> 00:10:27,550 And one of the big problems of ensembling is 169 00:10:27,550 --> 00:10:30,575 that you need to keep all these different networks around. 170 00:10:30,575 --> 00:10:33,835 And so that just takes up a lot more computer memory. 171 00:10:33,835 --> 00:10:37,600 For multi-crop I guess at least you keep just one network around. 172 00:10:37,600 --> 00:10:41,155 So it doesn't suck up as much memory, 173 00:10:41,155 --> 00:10:46,235 but it still slows down your run time quite a bit. 174 00:10:46,235 --> 00:10:52,240 So, these are tips you see and research papers will refer to these tips as well. 175 00:10:52,240 --> 00:10:56,940 But I personally do not tend to use these methods when building 176 00:10:56,940 --> 00:10:59,535 production systems even though they are great for doing 177 00:10:59,535 --> 00:11:03,205 better on benchmarks and on winning competitions. 178 00:11:03,205 --> 00:11:08,345 Because a lot of the computer vision problems are in the small data regime, 179 00:11:08,345 --> 00:11:12,620 others have done a lot of hand-engineering of the network architectures. 180 00:11:12,620 --> 00:11:17,400 And a neural network that works well on one vision problem often may be surprisingly, 181 00:11:17,400 --> 00:11:21,010 but they just often would work on other vision problems as well. 182 00:11:21,010 --> 00:11:25,295 So, to build a practical system often you do 183 00:11:25,295 --> 00:11:29,796 well starting off with someone else's neural network architecture. 184 00:11:29,796 --> 00:11:32,810 And you can use an open source implementation if possible, 185 00:11:32,810 --> 00:11:35,770 because the open source implementation might have figured out 186 00:11:35,770 --> 00:11:39,120 all the finicky details like the learning rate, 187 00:11:39,120 --> 00:11:42,478 case scheduler, and other hyper parameters. 188 00:11:42,478 --> 00:11:46,350 And finally someone else may have spent weeks training a model 189 00:11:46,350 --> 00:11:51,926 on half a dozen GP use and on over a million images. 190 00:11:51,926 --> 00:11:56,565 And so by using someone else's pretrained model and fine tuning on your data set, 191 00:11:56,565 --> 00:12:00,610 you can often get going much faster on an application. 192 00:12:00,610 --> 00:12:05,048 But of course if you have the compute resources and the inclination, 193 00:12:05,048 --> 00:12:09,840 don't let me stop you from training your own networks from scratch. 194 00:12:09,840 --> 00:12:14,326 And in fact if you want to invent your own computer vision algorithm, 195 00:12:14,326 --> 00:12:16,840 that's what you might have to do. 196 00:12:16,840 --> 00:12:18,920 So, that's it for this week, 197 00:12:18,920 --> 00:12:20,960 I hope that seeing a number of 198 00:12:20,960 --> 00:12:24,605 computer vision architectures helps you get a sense of what works. 199 00:12:24,605 --> 00:12:28,055 In this week's from the exercises you actually learn 200 00:12:28,055 --> 00:12:32,740 another program framework and use that to implement resonance. 201 00:12:32,740 --> 00:12:37,970 So, I hope you enjoy that exercise and I look forward to seeing you.