In this video, I am going to describe how 
we might be able to combine recognition 
of objects by using the relationships 
between their parts and recognition of 
objects using a deep neural network. 
At present in computer vision, there's 
broadly three approaches to recognizing 
objects. 
Either you use a deep convolutional 
neural net and this currently works best. 
Or you use a parts-based approach, which 
I think in the long run is going to work 
best,. Or you use the existing features 
that computer vision people know how to 
extract from images and make histograms 
of them and then use lots of hand 
engineering. 
This is the approach that the 
convolutional neural networks have 
recently beaten. 
The point of this video is to show how we 
might combine a parts-based approach with 
early stages that use convolutional 
neraul networks. 
Even though convolutional neural networks 
have worked very well for recognizing 
objects and images, 
I think there's something missing. 
When we pool the activities of a bunch of 
replicator feature detectors, we lose the 
precise position of the feature detector 
that was most active. 
This means we don't know exactly where 
things are and that's fatal for high 
level parts, such as the nose and the 
mouth. 
In order to recognize whose face it is, 
you need to use the precise spatial 
relationships between high level parts, 
like noses and mouths. 
If you overlap the pools so that each 
feature occurs in several different 
pools, you retain more information about 
its position and that makes things work a 
bit better, but I don't think that's the 
answer. 
A related problem is that convolutional 
neural nets that use translations to 
replicate feature detectors cannot 
extrapolate their understanding of 
geometrical relationships to radically 
new viewpoints like different 
orientations or different scales. 
We could, of course, try replicating 
across orientation and scale, but then, 
we get huge numbers of replicated feature 
detectors. 
Now, people are very good at 
extrapolating. 
After seeing a new shape once, they can 
recognize it from a very different 
viewpoint. 
Currently, the way we deal with that with 
convolutional neural networks is to train 
them on transformed data. 
So this involves giving them huge 
training sets, where we tried to 
transform the data through different 
orientations, and scales, and lightings, 
and all sorts of other things so that the 
network can cope with those variations. 
But that's a very clumsy way of dealing 
with the variations. 
I think a much better way is to use a 
hierarchy of coordinate frames and to use 
a group of neurons to represent the 
conjunction of the shape of a feature and 
it's pose relative to the retina. 
So when these neurons are active, it 
tells you a feature of that kind is 
there, like a nose. 
And the precise activities or relative 
activities of these neurons tell you the 
pose of the nose. 
If you think about representing the pose 
of something, that's really a 
relationship between two coordinate 
frames. 
So it's a relationship between a 
coordinate frame embedded in the thing 
and the coordinate frame of the camera or 
retina. 
So, in order to represent the pose of 
something, we have to embed a coordinate 
frame within it. 
Once we've done this and we have a 
representation of the pose of thoughts of 
objects relative to the reckoner, it's 
easy to use the relationships between 
paths to recognize larger objects. 
So we're going to use the consistency of 
the poses of the parts as a cue for 
recognizing a larger shape. 
If you look at this picture, we have a 
nose and we have a mouth and then the 
right spacial relationship to one 
another. 
One way of thinking about that is that if 
you ask the mouth to predict the pose of 
the whole face, and if you ask the nose 
to predict the pose of the whole face, 
they'll make similar predictions. 
If you look on the right share, we have 
the same nose and the same mouth, but now 
they're in the wrong spatial 
relationship. 
And that means that if they separately 
make predictions about the pose of the 
whole face, those predictions won't agree 
at all. 
So here's two layers in a hierarchy of 
parts, where the larger parts can be 
recognized by consistent predictions from 
smaller parts. 
Let's suppose we're looking for a face. 
And so in the middle here the ellipse 
with Tj in it is a collection of neurons 
that are going to be used for recognizing 
the pose of the face. 
And the Pj next to it is a single 
logistic neuron that's going to be used 
for representing whether or not we think 
there's a face there. 
We have a similar representation in the 
lab below, where we have a representation 
of the pose for a mouth and a 
representation of the pose for a nose, 
and we can recognize the phase by 
noticing that those two representations 
make consistent predictions. 
So we take a vector of activities that 
represents the pose of the mouth. 
We multiply by matrix Tij that represents 
the spatial relationship between a mouth 
and a face, and we get a prediction, Ti * 
Tij for the pose of the face. 
We do the same thing with the nose. 
We take a vector of neural activities 
that represents the pose of the nose, Th. 
We multiply it by the relationship 
between the nose and the face and we get 
another prediction for the pose of the 
face. 
If those two predictions agree, there's a 
face there, because the nose and the 
mouth are in the right spacial 
relationship and that's very unlikely 
without there being a face there. 
What we're doing here is inverse computer 
graphics. 
In computer graphics, if you knew the 
pose of the face, you could now compute 
by using the inverse of Tij, the pose of 
the mouth and similarly for the nose. 
So in computer graphics you're going from 
poses of larger things to poses of their 
parts. 
In computer vision, you need to go from 
the poses of the parts to the poses of 
larger things and you need to check 
consistency when you do that. 
Now if we can get a neural net to 
represent these pose vectors, as vectors 
of neural activity, then we get a very 
nice property. 
Spatial relationships can then be 
modelled as linear operations. 
That makes it very easy to learn 
hierarchy visual entities, 
and it also makes it very easy to 
generalize across viewpoints. 
So, what's going to happen when we make 
small changes in viewpoint is the pose 
vectors, those vectors of neural 
activities are all going to change. 
What's going to be invariant is the 
weights. 
It was the weights that represented the 
relationship between the part from the 
hole, like Tij on the previous slide, and 
those don't depend on viewpoint. 
So we want to get the invariant 
properties of a shape into the weights. 
And we want to have the pose vectors in 
the activities, 
because when we change viewpoint, all 
those pose vectors are going to change. 
So, rather than trying to get neural 
activities that are invariant to 
viewpoint, which is what the pooling in 
the convolutional net is trying to do. 
We're going to aim to get neural 
activities that are equivariant of 
viewpoint. 
As the pose of the object varies the 
activities of the neurons vary. 
That means the percept of an object, not 
its label but what it looks like, is 
going to change as the viewpoint changes. 
I'm going to finish by giving you some 
evidence that our visual systems really 
do impose coordinate frames in order to 
represent shapes. 
This was pointed out a long time ago by a 
great psychologist called Irvin Rock. 
So if you look at this shape and I tell 
you it's a country, most people don't 
know which country it is. 
They look at it, and they think it looks 
a bit like Australia, but it's so sort, 
sort of mirror image of Australia, 
but it's not really a familiar country at 
all. 
If you tell them where the top is, that 
it's that way up, they immediately 
recognize that's it's Africa. 
If they knew what coordinate frame to 
impose on it, it immediately becomes a 
familiar shape. 
Similarly, if I give you a shape like 
this one, you can perceive it as a square 
or you can perceive it as a diamond. 
Those are two completely different 
percepts. 
What you know about the shape is totally 
different depending on which way you 
perceive it. 
So for example, if you perceive it as a 
tilted square, you're acutely sensitive 
to whether the angles are right angles. 
If you perceive it as an upright diamond, 
you're not sensitive to that at all. 
The angles could be 5 degrees off and you 
wouldn't notice it, but you are sensitive 
to something else. 
If you perceive it as an upright diamond, 
you're acutely sensitive to whether the 
corner on the left and the corner of the 
right are at the same height. 
And you'll probably notice now that, in 
this figure, they're very slightly 
different heights. 
These kind of demonstrations are 
evidence, that in order to represent 
shapes we impose coordinate frames on 
them. 
Because when you're looking at that 
square or diamond, it's the same thing 
you're looking at, but the percept's 
totally different depending on what 
coordinate frame you impose.