In this video I'm going to talk about the
issue of viewpoint invariance.
Each time we look at an object in the
seen, we typically have a different view
point.
So the object shares up on different
pixel.
This makes okay recognition very unlike
most machine learning tasks and I'm going
to talk about various ways of trying to do
with that issue.
And number of different ways have been
suggested for coping with view point
variations.
We're so good at it that we don't really
appreciate how difficult it is.
It's one of the main difficulties in
making computers perceive, and there still
aren't generally accepted solutions,
either in engineering or in psychology.
The first approach is to use redundant
invariant features.
The second approach is to put a box round
the object so that you can normalize the
pixels.
The third approach is to use replicated
features, and pool them.
This is called convolutional neural nets.
I'll go to that in great detail.
And the first approach we shall talk about
at the end of the lecture is to use a
hierarchy of parts and to explicitly
represent the places of the parts relative
to the camera or retina.
So, the invariant feature approach says
you should extract a large and redundant
set of features.
And they should be features that are
invariant under the transformations like
translation and rotational scaling.
So here's an example of an invariant
feature.
It's a pair of roughly parallel lines,
with a red dot between them.
That's actually being suggested as the
feature the baby herring gulls use for
knowing what to peck for food.
If you paint that feature on a piece of
wood, they'll peck at the appropriate
place on the piece of wood.
With enough invariant features, there's
only one way to assemble them into an
object or an image.
You don't actually need to represent the
relationships between features directly
because those relationships are captured
by other features.
This has been pointed out for strings of
letters by psychologist called Wayne
[UNKNOWN] it's been pointed out in vision
by Shimon Ullman.
And, it's a sort of acute point that all
we need is a big bag of features, because
with overlapping and redundant features.
One feature will tell you how two other
features are related.
Unfortunately, if you're doing
recognition, you're going to get a whole
bunch of features that are composed of
parts of different objects.
And they'll be very misleading for
recognition.
So you'd like to avoid forming features
from parts of different objects.
A second approach is what I call judicious
normalization.
So if you look at that upside down capital
letter R on the right, I put a box around
it.
Not very well, in fact.
And I've labeled a top and a front for
that box.
And relative to that box, the R has for
example a vertical stroke at the back, and
it has a loop facing forwards at the top.
So if we describe features of the r
relative to that box, they're going to be
invariant.
This is assuming it's a rigid shape.
Putting a box around a rigid shape solves
the dimension hopping problem.
It gets rid of the effect of changes in
viewpoint.
If we choose the box correctly, the same
part of an object would always occur on
the same normalized pixels.
It doesn't have to be a rectangular box.
We can provide invariance to not only
translation and rotation scale but also
things like sheer and stretch.
Unfortunately, choosing the box is
difficult.
It's difficult because we might have
segmentation errors.
We might have occlusion so you can't just
shrink a box around things.
We might have unusual orientations.
That example of the upside down R makes it
clear that we have to use our knowledge of
what the shape is to help us decide what
the box is.
If, for example, we had a character that
was like a lowercase D, but with an extra
stroke coming out of the loop of the D.
We would see that as an upright one of
those characters.
So it's a chicken and egg problem.
In order to get the box right, we need to
recognize the shape.
In order to recognize the shape, we need
to get the box right.
An aside here for psychologists.
Many psychologists think we do mental
rotation to deal with shapes that aren't
oriented right.
This is complete nonsense.
That capital letter R you recognize
perfectly well before you do any mental
rotation.
Indeed, you need to recognize that it's an
R and it's upside down, in order to know
how to rotate it.
You use mental rotation for dealing with
judgments like handedness.
That is, is it a correct R or mirror image
R?
You can't tell that without doing mental
rotation.
The mental rotation is not used for
dealing with the fact that it's upside
down when we want to recognize it.
The brute force normalization approach
works like this.
You use well segmented, upright images
that you can judiciously put a box around
when you train the recognizer, and then at
test time, when you have to deal with
cluttered images, you try all possible
boxes in a whole range of positions and
scales.
This approach is widely used in computer
vision.
Particularly for detecting upright things
like faces or house numbers in unsegmented
images.
It's much more efficient if they recognize
they can cope with some variation in the
position and scale so that we can use a
course grid when trying on possible boxes.