1 00:00:00,000 --> 00:00:05,883 In this video, I am going to describe how we might be able to combine recognition 2 00:00:05,883 --> 00:00:11,476 of objects by using the relationships between their parts and recognition of 3 00:00:11,476 --> 00:00:16,851 objects using a deep neural network. At present in computer vision, there's 4 00:00:16,851 --> 00:00:20,337 broadly three approaches to recognizing objects. 5 00:00:20,337 --> 00:00:26,003 Either you use a deep convolutional neural net and this currently works best. 6 00:00:26,003 --> 00:00:31,886 Or you use a parts-based approach, which I think in the long run is going to work 7 00:00:31,886 --> 00:00:36,803 best,. Or you use the existing features that computer vision people know how to 8 00:00:36,803 --> 00:00:41,419 extract from images and make histograms of them and then use lots of hand 9 00:00:41,419 --> 00:00:44,038 engineering. This is the approach that the 10 00:00:44,038 --> 00:00:47,220 convolutional neural networks have recently beaten. 11 00:00:47,220 --> 00:00:53,065 The point of this video is to show how we might combine a parts-based approach with 12 00:00:53,065 --> 00:00:56,892 early stages that use convolutional neraul networks. 13 00:00:56,892 --> 00:01:02,667 Even though convolutional neural networks have worked very well for recognizing 14 00:01:02,667 --> 00:01:06,288 objects and images, I think there's something missing. 15 00:01:06,288 --> 00:01:11,968 When we pool the activities of a bunch of replicator feature detectors, we lose the 16 00:01:11,968 --> 00:01:16,160 precise position of the feature detector that was most active. 17 00:01:16,160 --> 00:01:21,033 This means we don't know exactly where things are and that's fatal for high 18 00:01:21,033 --> 00:01:23,791 level parts, such as the nose and the mouth. 19 00:01:23,791 --> 00:01:28,600 In order to recognize whose face it is, you need to use the precise spatial 20 00:01:28,600 --> 00:01:32,512 relationships between high level parts, like noses and mouths. 21 00:01:32,512 --> 00:01:37,257 If you overlap the pools so that each feature occurs in several different 22 00:01:37,257 --> 00:01:42,516 pools, you retain more information about its position and that makes things work a 23 00:01:42,516 --> 00:01:45,530 bit better, but I don't think that's the answer. 24 00:01:45,530 --> 00:01:50,290 A related problem is that convolutional neural nets that use translations to 25 00:01:50,290 --> 00:01:54,617 replicate feature detectors cannot extrapolate their understanding of 26 00:01:54,617 --> 00:01:58,883 geometrical relationships to radically new viewpoints like different 27 00:01:58,883 --> 00:02:03,086 orientations or different scales. We could, of course, try replicating 28 00:02:03,086 --> 00:02:08,032 across orientation and scale, but then, we get huge numbers of replicated feature 29 00:02:08,032 --> 00:02:10,381 detectors. Now, people are very good at 30 00:02:10,381 --> 00:02:13,658 extrapolating. After seeing a new shape once, they can 31 00:02:13,658 --> 00:02:16,440 recognize it from a very different viewpoint. 32 00:02:16,440 --> 00:02:22,004 Currently, the way we deal with that with convolutional neural networks is to train 33 00:02:22,004 --> 00:02:25,876 them on transformed data. So this involves giving them huge 34 00:02:25,876 --> 00:02:30,527 training sets, where we tried to transform the data through different 35 00:02:30,527 --> 00:02:35,695 orientations, and scales, and lightings, and all sorts of other things so that the 36 00:02:35,695 --> 00:02:40,799 network can cope with those variations. But that's a very clumsy way of dealing 37 00:02:40,799 --> 00:02:44,546 with the variations. I think a much better way is to use a 38 00:02:44,546 --> 00:02:49,584 hierarchy of coordinate frames and to use a group of neurons to represent the 39 00:02:49,584 --> 00:02:54,430 conjunction of the shape of a feature and it's pose relative to the retina. 40 00:02:54,430 --> 00:02:59,034 So when these neurons are active, it tells you a feature of that kind is 41 00:02:59,034 --> 00:03:02,680 there, like a nose. And the precise activities or relative 42 00:03:02,680 --> 00:03:06,390 activities of these neurons tell you the pose of the nose. 43 00:03:06,390 --> 00:03:11,042 If you think about representing the pose of something, that's really a 44 00:03:11,042 --> 00:03:13,900 relationship between two coordinate frames. 45 00:03:13,900 --> 00:03:17,892 So it's a relationship between a coordinate frame embedded in the thing 46 00:03:17,892 --> 00:03:20,610 and the coordinate frame of the camera or retina. 47 00:03:20,610 --> 00:03:26,072 So, in order to represent the pose of something, we have to embed a coordinate 48 00:03:26,072 --> 00:03:29,737 frame within it. Once we've done this and we have a 49 00:03:29,737 --> 00:03:35,487 representation of the pose of thoughts of objects relative to the reckoner, it's 50 00:03:35,487 --> 00:03:40,590 easy to use the relationships between paths to recognize larger objects. 51 00:03:40,590 --> 00:03:45,901 So we're going to use the consistency of the poses of the parts as a cue for 52 00:03:45,901 --> 00:03:50,567 recognizing a larger shape. If you look at this picture, we have a 53 00:03:50,567 --> 00:03:56,165 nose and we have a mouth and then the right spacial relationship to one 54 00:03:56,165 --> 00:03:59,218 another. One way of thinking about that is that if 55 00:03:59,218 --> 00:04:03,972 you ask the mouth to predict the pose of the whole face, and if you ask the nose 56 00:04:03,972 --> 00:04:08,190 to predict the pose of the whole face, they'll make similar predictions. 57 00:04:08,190 --> 00:04:12,944 If you look on the right share, we have the same nose and the same mouth, but now 58 00:04:12,944 --> 00:04:15,440 they're in the wrong spatial relationship. 59 00:04:15,440 --> 00:04:21,169 And that means that if they separately make predictions about the pose of the 60 00:04:21,169 --> 00:04:24,695 whole face, those predictions won't agree at all. 61 00:04:24,695 --> 00:04:30,204 So here's two layers in a hierarchy of parts, where the larger parts can be 62 00:04:30,204 --> 00:04:34,318 recognized by consistent predictions from smaller parts. 63 00:04:34,318 --> 00:04:39,680 Let's suppose we're looking for a face. And so in the middle here the ellipse 64 00:04:39,680 --> 00:04:45,557 with Tj in it is a collection of neurons that are going to be used for recognizing 65 00:04:45,557 --> 00:04:49,705 the pose of the face. And the Pj next to it is a single 66 00:04:49,705 --> 00:04:54,624 logistic neuron that's going to be used for representing whether or not we think 67 00:04:54,624 --> 00:04:58,764 there's a face there. We have a similar representation in the 68 00:04:58,764 --> 00:05:03,067 lab below, where we have a representation of the pose for a mouth and a 69 00:05:03,067 --> 00:05:08,062 representation of the pose for a nose, and we can recognize the phase by 70 00:05:08,062 --> 00:05:13,448 noticing that those two representations make consistent predictions. 71 00:05:13,448 --> 00:05:19,151 So we take a vector of activities that represents the pose of the mouth. 72 00:05:19,151 --> 00:05:26,199 We multiply by matrix Tij that represents the spatial relationship between a mouth 73 00:05:26,199 --> 00:05:32,780 and a face, and we get a prediction, Ti * Tij for the pose of the face. 74 00:05:32,780 --> 00:05:37,135 We do the same thing with the nose. We take a vector of neural activities 75 00:05:37,135 --> 00:05:41,609 that represents the pose of the nose, Th. We multiply it by the relationship 76 00:05:41,609 --> 00:05:46,382 between the nose and the face and we get another prediction for the pose of the 77 00:05:46,382 --> 00:05:49,126 face. If those two predictions agree, there's a 78 00:05:49,126 --> 00:05:53,124 face there, because the nose and the mouth are in the right spacial 79 00:05:53,124 --> 00:05:57,360 relationship and that's very unlikely without there being a face there. 80 00:05:57,360 --> 00:06:00,834 What we're doing here is inverse computer graphics. 81 00:06:00,834 --> 00:06:06,011 In computer graphics, if you knew the pose of the face, you could now compute 82 00:06:06,011 --> 00:06:11,256 by using the inverse of Tij, the pose of the mouth and similarly for the nose. 83 00:06:11,256 --> 00:06:16,910 So in computer graphics you're going from poses of larger things to poses of their 84 00:06:16,910 --> 00:06:19,975 parts. In computer vision, you need to go from 85 00:06:19,975 --> 00:06:25,084 the poses of the parts to the poses of larger things and you need to check 86 00:06:25,084 --> 00:06:29,579 consistency when you do that. Now if we can get a neural net to 87 00:06:29,579 --> 00:06:35,359 represent these pose vectors, as vectors of neural activity, then we get a very 88 00:06:35,359 --> 00:06:38,916 nice property. Spatial relationships can then be 89 00:06:38,916 --> 00:06:43,634 modelled as linear operations. That makes it very easy to learn 90 00:06:43,634 --> 00:06:48,312 hierarchy visual entities, and it also makes it very easy to 91 00:06:48,312 --> 00:06:53,498 generalize across viewpoints. So, what's going to happen when we make 92 00:06:53,498 --> 00:06:59,163 small changes in viewpoint is the pose vectors, those vectors of neural 93 00:06:59,163 --> 00:07:04,588 activities are all going to change. What's going to be invariant is the 94 00:07:04,588 --> 00:07:08,418 weights. It was the weights that represented the 95 00:07:08,418 --> 00:07:12,726 relationship between the part from the hole, like Tij on the previous slide, and 96 00:07:12,726 --> 00:07:17,671 those don't depend on viewpoint. So we want to get the invariant 97 00:07:17,671 --> 00:07:22,737 properties of a shape into the weights. And we want to have the pose vectors in 98 00:07:22,737 --> 00:07:26,071 the activities, because when we change viewpoint, all 99 00:07:26,071 --> 00:07:30,764 those pose vectors are going to change. So, rather than trying to get neural 100 00:07:30,764 --> 00:07:35,533 activities that are invariant to viewpoint, which is what the pooling in 101 00:07:35,533 --> 00:07:40,104 the convolutional net is trying to do. We're going to aim to get neural 102 00:07:40,104 --> 00:07:43,084 activities that are equivariant of viewpoint. 103 00:07:43,084 --> 00:07:47,588 As the pose of the object varies the activities of the neurons vary. 104 00:07:47,588 --> 00:07:52,622 That means the percept of an object, not its label but what it looks like, is 105 00:07:52,622 --> 00:07:57,981 going to change as the viewpoint changes. I'm going to finish by giving you some 106 00:07:57,981 --> 00:08:03,811 evidence that our visual systems really do impose coordinate frames in order to 107 00:08:03,811 --> 00:08:07,440 represent shapes. This was pointed out a long time ago by a 108 00:08:07,440 --> 00:08:11,824 great psychologist called Irvin Rock. So if you look at this shape and I tell 109 00:08:11,824 --> 00:08:15,353 you it's a country, most people don't know which country it is. 110 00:08:15,353 --> 00:08:19,679 They look at it, and they think it looks a bit like Australia, but it's so sort, 111 00:08:19,679 --> 00:08:24,005 sort of mirror image of Australia, but it's not really a familiar country at 112 00:08:24,005 --> 00:08:26,453 all. If you tell them where the top is, that 113 00:08:26,453 --> 00:08:30,460 it's that way up, they immediately recognize that's it's Africa. 114 00:08:30,460 --> 00:08:35,138 If they knew what coordinate frame to impose on it, it immediately becomes a 115 00:08:35,138 --> 00:08:38,339 familiar shape. Similarly, if I give you a shape like 116 00:08:38,339 --> 00:08:43,820 this one, you can perceive it as a square or you can perceive it as a diamond. 117 00:08:43,820 --> 00:08:46,518 Those are two completely different percepts. 118 00:08:46,518 --> 00:08:50,934 What you know about the shape is totally different depending on which way you 119 00:08:50,934 --> 00:08:55,090 perceive it. So for example, if you perceive it as a 120 00:08:55,090 --> 00:09:01,930 tilted square, you're acutely sensitive to whether the angles are right angles. 121 00:09:01,930 --> 00:09:06,074 If you perceive it as an upright diamond, you're not sensitive to that at all. 122 00:09:06,074 --> 00:09:10,704 The angles could be 5 degrees off and you wouldn't notice it, but you are sensitive 123 00:09:10,704 --> 00:09:13,880 to something else. If you perceive it as an upright diamond, 124 00:09:13,880 --> 00:09:18,240 you're acutely sensitive to whether the corner on the left and the corner of the 125 00:09:18,240 --> 00:09:21,900 right are at the same height. And you'll probably notice now that, in 126 00:09:21,900 --> 00:09:24,700 this figure, they're very slightly different heights. 127 00:09:24,700 --> 00:09:28,478 These kind of demonstrations are evidence, that in order to represent 128 00:09:28,478 --> 00:09:30,833 shapes we impose coordinate frames on them. 129 00:09:30,833 --> 00:09:34,885 Because when you're looking at that square or diamond, it's the same thing 130 00:09:34,885 --> 00:09:38,828 you're looking at, but the percept's totally different depending on what 131 00:09:38,828 --> 00:09:40,362 coordinate frame you impose.