1 00:00:00,000 --> 00:00:04,900 It was always a question for speculation whether the kinds of nets developed for 2 00:00:04,900 --> 00:00:09,377 recognizing handwritten digits could actually be scaled up to what vision 3 00:00:09,377 --> 00:00:13,008 people call a real task. That is, recognizing objects in high 4 00:00:13,008 --> 00:00:16,154 resolution color images when the scene is cluttered. 5 00:00:16,154 --> 00:00:20,571 So that you have to do things like segmentation, you have to deal with 3D 6 00:00:20,571 --> 00:00:23,233 viewpoint, you have to deal with 5-foot list. 7 00:00:23,233 --> 00:00:28,254 Many different objects surrounding, you're not quite sure which is the intended one, 8 00:00:28,254 --> 00:00:32,218 and so on. Since the start of this course, we've got 9 00:00:32,218 --> 00:00:37,147 some interesting new results on that. So in my first lecture, I described the 10 00:00:37,147 --> 00:00:42,011 network developed by Alex Krizhevsky and showed that it was good at object 11 00:00:42,011 --> 00:00:44,929 recognition. But at that point it hadn't been 12 00:00:44,929 --> 00:00:48,366 benchmarked against the best computer vision systems, 13 00:00:48,366 --> 00:00:52,211 Now it has. People worked on Emenise for many years, 14 00:00:52,211 --> 00:00:58,200 gradually improving their ability of these networks to recognize handwritten digits. 15 00:00:58,980 --> 00:01:04,339 Many computer vision researchers thought this was a waste of time if you wanted to 16 00:01:04,339 --> 00:01:09,698 be able to recognize real objects in color images, because they thought the lessons 17 00:01:09,698 --> 00:01:13,637 learned from Emnise would not generalize to that domain. 18 00:01:13,637 --> 00:01:16,478 That was a fairly reasonable thing to think. 19 00:01:16,478 --> 00:01:20,546 Here's a number of reasons why it's a much more difficult task. 20 00:01:20,546 --> 00:01:24,614 First of all, there's many, many more different kinds of objects. 21 00:01:24,614 --> 00:01:29,263 Even if we only recognize a thousand classes, that's still a factor of a 22 00:01:29,263 --> 00:01:34,149 hundred. Secondly there is many more pixels even if 23 00:01:34,149 --> 00:01:41,825 we use trans sampled images that are only 256 by 256 with color pixels that's still 24 00:01:41,825 --> 00:01:47,576 100 or 300 times of many pixels Another factor is that in real scenes you have to 25 00:01:47,576 --> 00:01:51,929 deal with the fact you've got a two dimensional image of a three dimensional 26 00:01:51,929 --> 00:01:54,529 reality so a lot of information is being lost. 27 00:01:54,529 --> 00:01:58,655 And real scenes have clutter of a kind that doesn't occur in handwriting. 28 00:01:58,655 --> 00:02:03,347 In handwriting you can have overlapping letters and that requires segmentation but 29 00:02:03,347 --> 00:02:07,813 you don't have things like occlusion of large parts of objects by opaque other 30 00:02:07,813 --> 00:02:10,470 objects. You don't have many different kinds of 31 00:02:10,470 --> 00:02:14,298 objects in the same scene. And you don't have a little lighting 32 00:02:14,298 --> 00:02:19,642 variations that you get in real scenes. So the question is will the same kind of 33 00:02:19,642 --> 00:02:25,486 convolution neural network that proved to be so good on recognizing hand written 34 00:02:25,486 --> 00:02:31,615 digits work for real color images In the domain of real color images we probably do 35 00:02:31,615 --> 00:02:37,321 need to wire in some prior knowledge. Because, if we try and do it in the sera 36 00:02:37,321 --> 00:02:43,101 san way with no knowledge wired in, putting in all the knowledge by generating 37 00:02:43,101 --> 00:02:47,555 extra training examples. The computational problem is still too 38 00:02:47,555 --> 00:02:51,810 large for current computers. So there was a recent competition. 39 00:02:51,810 --> 00:02:59,300 And it was on a data base called ImageNet. ImageNet actually has many more than a 40 00:02:59,300 --> 00:03:04,559 machine images but a subset of 1.2 millimeters was chosen and the 41 00:03:04,559 --> 00:03:09,021 classification task was to correctly label those images. 42 00:03:09,021 --> 00:03:15,359 Now the images were hand labelled with a thousand different classes but this wasn't 43 00:03:15,359 --> 00:03:18,462 very reliable. There could be an image that has two of 44 00:03:18,462 --> 00:03:22,542 those thousand different objects in it and only one of them is labeled. 45 00:03:22,542 --> 00:03:27,139 So, to make the task feasible the computer vision system is allowed to make five 46 00:03:27,139 --> 00:03:29,667 bets. And it's set to get it right if one of 47 00:03:29,667 --> 00:03:33,690 those bets corresponds to the label that a person has given the image. 48 00:03:33,690 --> 00:03:38,325 There's also a localization task. The reason for the localization task is 49 00:03:38,325 --> 00:03:42,453 that many computer vision systems use a bag of features approach. 50 00:03:42,453 --> 00:03:47,723 For the whole image or for say, a quadrant of the image they know what the features 51 00:03:47,723 --> 00:03:52,735 are, but they don't know where they are. This allows them to recognize objects but 52 00:03:52,735 --> 00:03:57,316 without knowing exactly where they are. That's very unlike how people behave 53 00:03:57,316 --> 00:04:02,430 except people with a curious kind of brain damage called balance syndrome where they 54 00:04:02,430 --> 00:04:05,619 can recognize objects and not be sure where they are. 55 00:04:05,619 --> 00:04:10,553 So for the localization task you have to place a box around an object once you've 56 00:04:10,553 --> 00:04:15,607 recognized it and to get it right your box must have at least a 50% overlap with the 57 00:04:15,607 --> 00:04:19,155 correct box. On this task, people tried some of the 58 00:04:19,155 --> 00:04:24,806 best existing computer vision methods. So, leading groups from Oxford and the 59 00:04:24,806 --> 00:04:30,532 French National Research Labs Inria and Xerox's European Research Center and 60 00:04:30,532 --> 00:04:35,960 various other universities tried this task and discovered it's very hard. 61 00:04:36,900 --> 00:04:42,246 The computer vision systems typically use complicated multi-stage systems. 62 00:04:42,246 --> 00:04:47,954 The early stages of these systems are typically hand-tuned by optimizing a few 63 00:04:47,954 --> 00:04:53,156 parameters using some of the data. And, the top stage of these systems is 64 00:04:53,156 --> 00:04:58,060 always a learning algorithm. But they don't learn all the way through. 65 00:04:58,060 --> 00:05:02,904 In the way that a deep neural net does when its trained to do back propagation. 66 00:05:02,904 --> 00:05:07,993 They don't have end-to-end learning, where the parameters used in the early feature 67 00:05:07,993 --> 00:05:12,899 detectors are being influenced by how useful they are for making final decision 68 00:05:12,899 --> 00:05:16,210 about classes. So here are some examples from the test 69 00:05:16,210 --> 00:05:21,656 set to show you what the data is like. You already sow some examples in the first 70 00:05:21,656 --> 00:05:25,966 lecture, but here's some more. So you can see that it's fairly obvious 71 00:05:25,966 --> 00:05:29,652 what the object is in that image, but a lot of it's missing. 72 00:05:29,652 --> 00:05:32,276 It doesn't have ears, it doesn't have legs. 73 00:05:32,276 --> 00:05:36,837 The predictions are the un-normalized probabilities of Alex Krizhevsky's 74 00:05:36,837 --> 00:05:40,647 deep-neural-network. And you can see it's confident that, that 75 00:05:40,647 --> 00:05:45,250 is a cheetah And if it's not a cheetah, it thinks it almost sudden a leopard. 76 00:05:45,250 --> 00:05:49,292 It also understands there's other possibilities, like a snow leopard, it's 77 00:05:49,292 --> 00:05:52,283 the wrong color for a snow leopard, or an Egyptian cat. 78 00:05:52,283 --> 00:05:56,824 Here's an example the other way around, here there's many objects in the image and 79 00:05:56,824 --> 00:06:00,534 the object of interest is only a very small fraction of the pixels. 80 00:06:00,534 --> 00:06:04,984 The network correctly says bullet train. But it also has other bets, like subway 81 00:06:04,984 --> 00:06:08,192 train or electric locomotive, which are presentable bets. 82 00:06:08,192 --> 00:06:12,833 If you look at the image, there's lots of other things that could be labeled, like 83 00:06:12,833 --> 00:06:17,301 the roof which occupies a much larger fraction of the image than the train or 84 00:06:17,301 --> 00:06:20,510 the pillar that's supporting the roof or the pedestrian. 85 00:06:20,510 --> 00:06:23,386 Or the large apartment block in the background. 86 00:06:23,569 --> 00:06:28,221 In these kinds of images you really have to be able cope with the fact that there's 87 00:06:28,221 --> 00:06:32,198 lots of alternative targets. Last image shows a different kind of 88 00:06:32,198 --> 00:06:34,952 example where there is no background clutter. 89 00:06:34,952 --> 00:06:39,298 The object is quite well isolated, probably a picture from a catalog or 90 00:06:39,298 --> 00:06:42,494 something. And the network doesn't get it right for 91 00:06:42,494 --> 00:06:45,897 its first bet, but it does get it in its top five bets. 92 00:06:45,897 --> 00:06:49,174 But here the network isn't confident about anything. 93 00:06:49,174 --> 00:06:54,405 These are the relative probabilities, and the network correctly realizes it doesn't 94 00:06:54,405 --> 00:06:56,989 really know. And if you look at the other 95 00:06:56,989 --> 00:06:59,888 possibilities, they're all perfectly plausible. 96 00:06:59,888 --> 00:07:04,993 If you screw your eyes up so you can't see the image too well, you can see how it 97 00:07:04,993 --> 00:07:08,082 might think it was a frying pan or a stethoscope. 98 00:07:08,082 --> 00:07:13,769 So how did the systems do on this data? Here's the error rates for the computer 99 00:07:13,769 --> 00:07:17,479 vision systems. One thing you'll notice is that the best 100 00:07:17,479 --> 00:07:22,515 systems are all very similar. So the University of Tokyo managed to get 101 00:07:22,515 --> 00:07:27,660 26.1%, and here what I'm doing is just reporting the best system from each group. 102 00:07:27,660 --> 00:07:32,492 Oxford University, which has a very good computer vision group, generally 103 00:07:32,492 --> 00:07:37,861 recognized to be possibly the best group in Europe, again got in the 26 percents 104 00:07:37,861 --> 00:07:43,297 and the French National Research Labs in the Xerox Park Center, which are, again, a 105 00:07:43,297 --> 00:07:48,867 very good computer vision groups, got 27%. So you'll guess from this that it is going 106 00:07:48,867 --> 00:07:54,169 to be hard to be 26%, and if you do beat 26% you're comparable with the very best 107 00:07:54,169 --> 00:07:58,880 computer vision systems. So Alex Krizhevsky's neural net got 108 00:07:58,880 --> 00:08:01,463 sixteen percent error. It's a huge gap. 109 00:08:01,463 --> 00:08:06,974 Normally, in these competitions you don't see big gaps like that. 110 00:08:06,974 --> 00:08:10,850 So Alex Krizhevsky's network works like this. 111 00:08:10,850 --> 00:08:15,265 It's a very deep convolutional neural net of the type pioneered by Yann Le Cun was 112 00:08:15,265 --> 00:08:20,020 first used for digit recognition and then Yann later applied it to recognizing real 113 00:08:20,020 --> 00:08:22,737 objects. And we're using all the lessons that we 114 00:08:22,737 --> 00:08:28,001 learned by Yann's group from [UNKNOWN] group and various other groups, developing 115 00:08:28,001 --> 00:08:30,548 these deep neural nets for doing real vision. 116 00:08:30,548 --> 00:08:34,964 It has seven hidden layers, which is deeper than usual and that's not counting 117 00:08:34,964 --> 00:08:38,700 some of the max-pooling layers. The early layers are convolutional. 118 00:08:38,700 --> 00:08:43,133 We could probably get away with using just local receptive fields, without tying any 119 00:08:43,133 --> 00:08:47,514 weights, if we had a much bigger computer. But by making them convolutionary, you cut 120 00:08:47,514 --> 00:08:52,000 down the parameters a lot, so you cut down the amount of training data you need a lot 121 00:08:52,000 --> 00:08:54,798 which cuts down the amount of computation time a lot. 122 00:08:54,798 --> 00:08:59,178 The last two loads were globally connected and that's where most of the parameters 123 00:08:59,178 --> 00:09:01,749 are. I think there's about sixteen million 124 00:09:01,749 --> 00:09:04,543 parameters between each pair of those layers. 125 00:09:04,543 --> 00:09:09,572 What the last two layers are doing is looking for combinations of local features 126 00:09:09,572 --> 00:09:14,850 that were extracted by the early layers. And obviously this commonly tourily many 127 00:09:14,850 --> 00:09:18,451 combinations to look for. And that's why you need a lot of 128 00:09:18,451 --> 00:09:21,990 parameters there. The activation functions were rectified 129 00:09:21,990 --> 00:09:26,398 linear units in every hidden layer. These train much faster than logistic 130 00:09:26,398 --> 00:09:31,136 units and they're more expressive. Most of the people seriously applying deep 131 00:09:31,136 --> 00:09:36,266 in your own networks to real images to the object recognition of I switch direct fi 132 00:09:36,266 --> 00:09:39,732 linear units. We also used competitive normalization, 133 00:09:39,732 --> 00:09:45,350 within a layer to suppress the activity of a unit, if other units that are looking 134 00:09:45,350 --> 00:09:50,304 nearby localities are very active. This helps a lot with variations in 135 00:09:50,304 --> 00:09:53,655 intensity. So, you might have an edge detector, which 136 00:09:53,655 --> 00:09:57,005 gets somewhat active due to some fairly faint edge. 137 00:09:57,005 --> 00:10:02,129 And that's pretty much irrelevant, if there is much more intense things around. 138 00:10:02,129 --> 00:10:07,253 There's other tricks that we used to significantly improve the generalization 139 00:10:07,253 --> 00:10:10,209 of this net. First of all, we use the trick of 140 00:10:10,209 --> 00:10:13,100 enhancing the data by using transformations. 141 00:10:13,440 --> 00:10:18,637 So here's a skit for down-sampled the images in the competition to 256 by 256. 142 00:10:18,637 --> 00:10:23,835 But instead of using those whole images Alex Krizhevsky took random 224 by 224 143 00:10:23,835 --> 00:10:28,375 patches from those images. Which gave him hugely more images to train 144 00:10:28,375 --> 00:10:31,270 on. And helped him deal with translation and 145 00:10:31,270 --> 00:10:34,428 variance. Even though they're convolutional nets 146 00:10:34,428 --> 00:10:38,573 that's still a help. He also used left-right reflections of the 147 00:10:38,573 --> 00:10:41,600 images, which again doubled the amount of data. 148 00:10:41,600 --> 00:10:45,944 He didn't use up dime reflections. Because, gravity's very important. 149 00:10:45,944 --> 00:10:51,004 Left right reflections don't really change what things look like much unless they're 150 00:10:51,004 --> 00:10:54,398 things like writing. At test time, he doesn't just use one 151 00:10:54,398 --> 00:10:57,255 patch. He uses a number of different patches, the 152 00:10:57,255 --> 00:11:02,018 four corners, the middle, that gives him five, and then the left right reflections 153 00:11:02,018 --> 00:11:06,304 of all those, that gives him ten. He runs all ten through the network and 154 00:11:06,304 --> 00:11:10,174 then combines their opinions. In the top layers, where most of the 155 00:11:10,174 --> 00:11:14,341 parameters are, he uses a new regularization technique, called drop-out, 156 00:11:14,341 --> 00:11:18,157 which is very effective. And stops the network over fitting. 157 00:11:18,157 --> 00:11:21,160 That's worth several percent in his results. 158 00:11:21,160 --> 00:11:25,710 I'll describe drop pouch at some length in the later lecture. 159 00:11:25,710 --> 00:11:30,568 But for now, the basic idea of drop out is that each time you present a training 160 00:11:30,568 --> 00:11:33,726 example, you omit half the hidden units from a layer. 161 00:11:33,726 --> 00:11:38,645 This means that the other hidden units in that layer, the survivors, can't rely on 162 00:11:38,645 --> 00:11:43,442 the their com rates being present. They can't learn to fix up the errors left 163 00:11:43,442 --> 00:11:48,179 over by the other hidden units in that layer, cuz the other hidden units might 164 00:11:48,179 --> 00:11:52,188 not be there no matter be fixing up an error that doesn't exist. 165 00:11:52,188 --> 00:11:57,350 So they have to become more individualist. They have to individually do useful things 166 00:11:57,350 --> 00:12:02,224 but they still have to do useful things that are different from what the other 167 00:12:02,224 --> 00:12:05,247 survivors do. So drop out is stopping too much 168 00:12:05,247 --> 00:12:10,661 cooperation between the hidden units. And a lot of cooperation is very good for 169 00:12:10,661 --> 00:12:14,584 fitting the training data. But if the test distribution is 170 00:12:14,584 --> 00:12:18,926 significantly different, then all that cooperation causes over-fitting. 171 00:12:18,926 --> 00:12:23,888 Alex couldn't have done this work without significant hardware, but the hardware 172 00:12:23,888 --> 00:12:28,602 only costs a few thousand dollars now. Alex is a very good programmer, and he 173 00:12:28,602 --> 00:12:33,750 used a very efficient implementation of convolution and neural nets on two Invidia 174 00:12:33,750 --> 00:12:37,906 GTX 580 graphics processors. Each of these has over 500 fast little 175 00:12:37,906 --> 00:12:42,619 cores, which are very good at doing arithmetic and not much good at anything 176 00:12:42,619 --> 00:12:45,869 else. The GP use are very good at doing matrix, 177 00:12:45,869 --> 00:12:49,803 matrix multiplies. So if you stack together the vector of 178 00:12:49,803 --> 00:12:54,370 activities of a hidden layer, over many training cases, that gives you a matrix. 179 00:12:54,370 --> 00:12:58,996 And now you multiply that by matrix of weights to figure out the activities in 180 00:12:58,996 --> 00:13:01,983 the next hidden layer for all those training cases. 181 00:13:01,983 --> 00:13:06,140 And if both those matrices are big, the GPU's give you a huge advantage. 182 00:13:06,140 --> 00:13:10,159 They give you about a factor of 30. They also have very high bandwidth to 183 00:13:10,159 --> 00:13:14,410 memory, and that's needed for neural nets. Cause in neural nets you keep wanting to 184 00:13:14,410 --> 00:13:17,676 know another weight so that you can multiply it by an activity. 185 00:13:17,676 --> 00:13:21,720 And there's millions of these weights, so you can't keep them all in the cache. 186 00:13:22,300 --> 00:13:28,750 Using all that hard brac, he could train his final network, in a week. 187 00:13:28,750 --> 00:13:33,075 And you could also combine results from ten, ten different patches of TestTime 188 00:13:33,075 --> 00:13:36,124 very quickly. So Test Time you can run it at just about 189 00:13:36,124 --> 00:13:39,229 the frame rate. In the future we are going to be able to 190 00:13:39,229 --> 00:13:42,390 spread this kind of network over a large number of calls. 191 00:13:42,390 --> 00:13:46,660 As calls become cheaper, people at Google are already experimenting with that. 192 00:13:46,660 --> 00:13:51,318 And if we can communicate the stakes fast enough we are going to be able to do much 193 00:13:51,318 --> 00:13:55,643 bigger networks on many more calls. Google has already simulated networks with 194 00:13:55,643 --> 00:13:59,580 1.7 billion connections and I think that it's only going to get bigger. 195 00:14:01,320 --> 00:14:06,417 As the cores get cheaper and the data sets get bigger, these big deep neural nets are 196 00:14:06,417 --> 00:14:10,794 gonna improve much faster than the old-fashioned computer vision systems, 197 00:14:10,794 --> 00:14:15,411 because they don't involve much hand engineering, and they can make very good 198 00:14:15,411 --> 00:14:18,650 use of huge data sets and huge amounts of computation. 199 00:14:18,650 --> 00:14:24,595 So the fact that we've already opened up a big gap I think means there's no looking 200 00:14:24,595 --> 00:14:27,781 back. I think from now on all the best object 201 00:14:27,781 --> 00:14:33,160 recognition systems, at least of static images, will use big deep neural nets. 202 00:14:33,160 --> 00:14:39,106 There are other application domains where we've learned the same lesson so Vladimir 203 00:14:39,106 --> 00:14:43,416 Nee. Used a net with local fields but without 204 00:14:43,416 --> 00:14:48,129 convolution to extract roads from aerial images. 205 00:14:48,129 --> 00:14:53,040 These are cluttered aerial images of urban scenes. 206 00:14:53,040 --> 00:14:57,167 Again he uses multiple layers of rectified linear units. 207 00:14:57,167 --> 00:15:02,991 And he takes a relatively large image patch, and predicts for the central 16x16 208 00:15:02,991 --> 00:15:08,740 pixels whether each of those pixels is a piece of road or not a piece of road. 209 00:15:08,740 --> 00:15:14,342 The nice thing about this task is that there's a lot of label training data 210 00:15:14,342 --> 00:15:18,518 available. That's because maps tell you where the 211 00:15:18,518 --> 00:15:22,332 centre lines of roads are and roads are roughly fixed width. 212 00:15:22,332 --> 00:15:27,544 So from the vectors in the map that tell you where the centre line of the road is 213 00:15:27,544 --> 00:15:30,595 you can estimate which pixels are probably road. 214 00:15:30,595 --> 00:15:34,980 Nevertheless, the task is very hard. There's the normal kind of vision 215 00:15:34,980 --> 00:15:39,684 problems: so roads are occluded by buildings because a plane isn't looking 216 00:15:39,684 --> 00:15:42,417 straight down when it takes the photograph. 217 00:15:42,417 --> 00:15:46,548 They're occluded by trees. They're also occluded by cars that are 218 00:15:46,548 --> 00:15:50,597 sitting on the road. The shadow effects from building, the 219 00:15:50,597 --> 00:15:55,338 major lighting changes depending on whether it's a sunny day or a cloudy day 220 00:15:55,338 --> 00:15:58,355 for example and there's minor view point changes. 221 00:15:58,355 --> 00:16:03,034 So the plane is basically looking downwards, but in any large photo it can't 222 00:16:03,034 --> 00:16:05,804 be looking straight downwards at every pixel. 223 00:16:05,804 --> 00:16:09,313 The worst problems in this data are the incorrect labels. 224 00:16:09,313 --> 00:16:13,623 You get incorrect labels because the maps aren't perfectly registered. 225 00:16:13,623 --> 00:16:18,610 For most purposes, you don't need a map to be registered better than a few meters. 226 00:16:18,610 --> 00:16:21,953 The pixels are about one meter square in this data. 227 00:16:21,953 --> 00:16:27,263 And so if the registration of the map is off by three meters, you're going to get 228 00:16:27,263 --> 00:16:31,459 at least three of the labels wrong for pixels, across every road. 229 00:16:31,459 --> 00:16:36,507 Another, severe problem, is that the people making maps have to make arbitrary 230 00:16:36,507 --> 00:16:40,900 decisions about what counts as a road and what counts as a laneway. 231 00:16:41,200 --> 00:16:45,947 So, in may of the maps, you look at something, and you've no idea whether 232 00:16:45,947 --> 00:16:49,557 that's gonna be considered to be a road or a lane-way. 233 00:16:49,557 --> 00:16:54,104 And so you simply don't know what label it's gonna get from the map. 234 00:16:54,104 --> 00:16:59,185 Big neural nets trained on big image patches, using millions of examples, are, 235 00:16:59,185 --> 00:17:03,264 I think, the only real hope for doing a good job at this task. 236 00:17:03,264 --> 00:17:06,340 It's very hard to find out what people can do. 237 00:17:06,340 --> 00:17:10,566 So, here is what the data looks like. This is a part of Toronto. 238 00:17:10,566 --> 00:17:14,929 If you know Toronto, you can tell that by the angle of the roads. 239 00:17:14,929 --> 00:17:20,111 And, above the image of the part of Toronto, I put two patches extracted from 240 00:17:20,111 --> 00:17:23,656 that image. And if you look at those patches, you can 241 00:17:23,656 --> 00:17:27,405 see it's not trivial to tell which the road pixels are. 242 00:17:27,405 --> 00:17:30,746 On the right, is the output of [UNKNOWN] system. 243 00:17:30,746 --> 00:17:36,336 Green is correctly identified pixels of road, and red means things that his system 244 00:17:36,336 --> 00:17:39,200 thought might be road, but actually aren't. 245 00:17:39,200 --> 00:17:43,471 Actually that thing is a parking lot but you can see why he might have thought it 246 00:17:43,471 --> 00:17:44,045 was a road.