1
00:00:00,000 --> 00:00:05,680
In this video I'm going to talk about the
issue of viewpoint invariance.

2
00:00:06,180 --> 00:00:11,697
Each time we look at an object in the
seen, we typically have a different view

3
00:00:11,697 --> 00:00:14,668
point.
So the object shares up on different

4
00:00:14,668 --> 00:00:17,921
pixel.
This makes okay recognition very unlike

5
00:00:17,921 --> 00:00:23,934
most machine learning tasks and I'm going
to talk about various ways of trying to do

6
00:00:23,934 --> 00:00:27,824
with that issue.
And number of different ways have been

7
00:00:27,824 --> 00:00:31,220
suggested for coping with view point
variations.

8
00:00:31,540 --> 00:00:35,770
We're so good at it that we don't really
appreciate how difficult it is.

9
00:00:35,770 --> 00:00:40,412
It's one of the main difficulties in
making computers perceive, and there still

10
00:00:40,412 --> 00:00:44,820
aren't generally accepted solutions,
either in engineering or in psychology.

11
00:00:45,880 --> 00:00:51,000
The first approach is to use redundant
invariant features.

12
00:00:51,680 --> 00:00:56,862
The second approach is to put a box round
the object so that you can normalize the

13
00:00:56,862 --> 00:01:01,131
pixels.
The third approach is to use replicated

14
00:01:01,131 --> 00:01:06,110
features, and pool them.
This is called convolutional neural nets.

15
00:01:06,110 --> 00:01:12,146
I'll go to that in great detail.
And the first approach we shall talk about

16
00:01:12,146 --> 00:01:17,076
at the end of the lecture is to use a
hierarchy of parts and to explicitly

17
00:01:17,076 --> 00:01:21,481
represent the places of the parts relative
to the camera or retina.

18
00:01:21,481 --> 00:01:26,740
So, the invariant feature approach says
you should extract a large and redundant

19
00:01:26,740 --> 00:01:30,224
set of features.
And they should be features that are

20
00:01:30,224 --> 00:01:35,220
invariant under the transformations like
translation and rotational scaling.

21
00:01:35,880 --> 00:01:39,360
So here's an example of an invariant
feature.

22
00:01:39,680 --> 00:01:44,380
It's a pair of roughly parallel lines,
with a red dot between them.

23
00:01:45,460 --> 00:01:50,573
That's actually being suggested as the
feature the baby herring gulls use for

24
00:01:50,573 --> 00:01:55,424
knowing what to peck for food.
If you paint that feature on a piece of

25
00:01:55,424 --> 00:01:59,622
wood, they'll peck at the appropriate
place on the piece of wood.

26
00:01:59,622 --> 00:02:04,673
With enough invariant features, there's
only one way to assemble them into an

27
00:02:04,673 --> 00:02:10,249
object or an image.
You don't actually need to represent the

28
00:02:10,249 --> 00:02:15,404
relationships between features directly
because those relationships are captured

29
00:02:15,404 --> 00:02:19,451
by other features.
This has been pointed out for strings of

30
00:02:19,451 --> 00:02:24,687
letters by psychologist called Wayne
[UNKNOWN] it's been pointed out in vision

31
00:02:24,687 --> 00:02:28,925
by Shimon Ullman.
And, it's a sort of acute point that all

32
00:02:28,925 --> 00:02:35,230
we need is a big bag of features, because
with overlapping and redundant features.

33
00:02:35,230 --> 00:02:38,994
One feature will tell you how two other
features are related.

34
00:02:38,994 --> 00:02:43,314
Unfortunately, if you're doing
recognition, you're going to get a whole

35
00:02:43,314 --> 00:02:47,388
bunch of features that are composed of
parts of different objects.

36
00:02:47,388 --> 00:02:50,288
And they'll be very misleading for
recognition.

37
00:02:50,288 --> 00:02:54,732
So you'd like to avoid forming features
from parts of different objects.

38
00:02:54,732 --> 00:02:58,250
A second approach is what I call judicious
normalization.

39
00:02:58,250 --> 00:03:03,434
So if you look at that upside down capital
letter R on the right, I put a box around

40
00:03:03,434 --> 00:03:04,977
it.
Not very well, in fact.

41
00:03:04,977 --> 00:03:07,940
And I've labeled a top and a front for
that box.

42
00:03:08,700 --> 00:03:16,003
And relative to that box, the R has for
example a vertical stroke at the back, and

43
00:03:16,003 --> 00:03:22,418
it has a loop facing forwards at the top.
So if we describe features of the r

44
00:03:22,418 --> 00:03:25,643
relative to that box, they're going to be
invariant.

45
00:03:25,643 --> 00:03:30,883
This is assuming it's a rigid shape.
Putting a box around a rigid shape solves

46
00:03:30,883 --> 00:03:35,586
the dimension hopping problem.
It gets rid of the effect of changes in

47
00:03:35,586 --> 00:03:38,945
viewpoint.
If we choose the box correctly, the same

48
00:03:38,945 --> 00:03:43,446
part of an object would always occur on
the same normalized pixels.

49
00:03:43,446 --> 00:03:48,687
It doesn't have to be a rectangular box.
We can provide invariance to not only

50
00:03:48,687 --> 00:03:53,390
translation and rotation scale but also
things like sheer and stretch.

51
00:03:53,390 --> 00:03:57,460
Unfortunately, choosing the box is
difficult.

52
00:03:57,460 --> 00:04:01,551
It's difficult because we might have
segmentation errors.

53
00:04:01,551 --> 00:04:06,504
We might have occlusion so you can't just
shrink a box around things.

54
00:04:06,504 --> 00:04:12,092
We might have unusual orientations.
That example of the upside down R makes it

55
00:04:12,092 --> 00:04:18,090
clear that we have to use our knowledge of
what the shape is to help us decide what

56
00:04:18,090 --> 00:04:21,660
the box is.
If, for example, we had a character that

57
00:04:21,660 --> 00:04:27,373
was like a lowercase D, but with an extra
stroke coming out of the loop of the D.

58
00:04:27,373 --> 00:04:31,372
We would see that as an upright one of
those characters.

59
00:04:31,372 --> 00:04:36,469
So it's a chicken and egg problem.
In order to get the box right, we need to

60
00:04:36,469 --> 00:04:40,376
recognize the shape.
In order to recognize the shape, we need

61
00:04:40,376 --> 00:04:43,826
to get the box right.
An aside here for psychologists.

62
00:04:43,826 --> 00:04:48,970
Many psychologists think we do mental
rotation to deal with shapes that aren't

63
00:04:48,970 --> 00:04:51,639
oriented right.
This is complete nonsense.

64
00:04:51,639 --> 00:04:56,588
That capital letter R you recognize
perfectly well before you do any mental

65
00:04:56,588 --> 00:04:59,908
rotation.
Indeed, you need to recognize that it's an

66
00:04:59,908 --> 00:05:03,620
R and it's upside down, in order to know
how to rotate it.

67
00:05:03,900 --> 00:05:08,556
You use mental rotation for dealing with
judgments like handedness.

68
00:05:08,556 --> 00:05:11,614
That is, is it a correct R or mirror image
R?

69
00:05:11,614 --> 00:05:15,090
You can't tell that without doing mental
rotation.

70
00:05:15,090 --> 00:05:19,634
The mental rotation is not used for
dealing with the fact that it's upside

71
00:05:19,634 --> 00:05:24,057
down when we want to recognize it.
The brute force normalization approach

72
00:05:24,057 --> 00:05:27,328
works like this.
You use well segmented, upright images

73
00:05:27,328 --> 00:05:32,418
that you can judiciously put a box around
when you train the recognizer, and then at

74
00:05:32,418 --> 00:05:36,962
test time, when you have to deal with
cluttered images, you try all possible

75
00:05:36,962 --> 00:05:39,810
boxes in a whole range of positions and
scales.

76
00:05:39,810 --> 00:05:42,725
This approach is widely used in computer
vision.

77
00:05:42,725 --> 00:05:47,889
Particularly for detecting upright things
like faces or house numbers in unsegmented

78
00:05:47,889 --> 00:05:50,926
images.
It's much more efficient if they recognize

79
00:05:50,926 --> 00:05:55,847
they can cope with some variation in the
position and scale so that we can use a

80
00:05:55,847 --> 00:05:58,399
course grid when trying on possible boxes.