1
00:00:00,000 --> 00:00:05,778
In this video, I'm going to describe some
of the things that make it difficult to

2
00:00:05,778 --> 00:00:11,128
recognize objects in real scenes.
We're incredibly good at this, and so it's

3
00:00:11,128 --> 00:00:16,834
very hard for us to realize how difficult
it is to take a bunch of numbers that

4
00:00:16,834 --> 00:00:22,470
describe the intensities of pixels and go
from there to the label of an object.

5
00:00:22,470 --> 00:00:26,689
There's all sorts of difficulties.
We have to segment the object out.

6
00:00:26,689 --> 00:00:30,226
We have to deal with variations in
lighting, in viewpoint.

7
00:00:30,226 --> 00:00:34,693
We have to deal with the fact that the
definitions of objects are quite

8
00:00:34,693 --> 00:00:37,858
complicated.
It's also possible that to get from an

9
00:00:37,858 --> 00:00:42,760
image to an object requires huge amounts
of knowledge, even for the lower level

10
00:00:42,760 --> 00:00:47,476
processes that involve segmentation and
dealing with viewpoint and lighting.

11
00:00:47,476 --> 00:00:52,502
If that's the case, it's gonna be very
hard for any hand engineered program to be

12
00:00:52,502 --> 00:00:58,274
able to do a good job of those things.
There are many reasons why it's hard to

13
00:00:58,274 --> 00:01:04,366
recognize objects and images.
First of all, it's hard to segment out an

14
00:01:04,366 --> 00:01:09,515
object from the other things in an image.
In the real world, we move around, and so

15
00:01:09,515 --> 00:01:13,330
we have motion cues.
We also have two eyes, so we have stereo

16
00:01:13,330 --> 00:01:16,000
cues.
You don't get those in static images.

17
00:01:16,000 --> 00:01:21,022
So it's very hard to tell which pieces go
together as parts of the same object.

18
00:01:21,022 --> 00:01:25,853
Also, parts of an object can be hidden
behind other objects, and so, you often

19
00:01:25,853 --> 00:01:30,494
don't see the whole of an object.
You're so good at doing vision that you

20
00:01:30,494 --> 00:01:35,159
don't often notice this.
Another thing that makes it very hard to

21
00:01:35,159 --> 00:01:40,620
recognize objects is that the intensity of
a pixel is determined as much by the

22
00:01:40,620 --> 00:01:43,761
lighting as it is by the nature of the
object.

23
00:01:43,761 --> 00:01:49,154
So, for example, a black surface in bright
light will give you much more intense

24
00:01:49,154 --> 00:01:52,500
pixels than a white surface in very gloomy
light.

25
00:01:53,480 --> 00:01:58,551
Remember, to recognize an object you've
got to convert a bunch of numbers, that is

26
00:01:58,551 --> 00:02:01,620
the intensities of the pixels, into a
class label.

27
00:02:02,160 --> 00:02:07,457
And, these intensities are varying for all
sorts of reasons that have nothing to do

28
00:02:07,457 --> 00:02:12,053
with the nature of the object.
Or nothing to do with the identity of the

29
00:02:12,053 --> 00:02:16,089
object.
Objects can also deform in a variety of

30
00:02:16,089 --> 00:02:19,186
ways.
So for relatively simple things like

31
00:02:19,186 --> 00:02:25,159
handwritten digits there's a wide variety
of different shapes that have the same

32
00:02:25,159 --> 00:02:28,404
name.
A two for example could be very italic

33
00:02:28,404 --> 00:02:34,525
with just a cusp instead of a loop or it
could be a very loopy two with a, a big

34
00:02:34,525 --> 00:02:42,420
loop and very round.
It's also the case that for many types of

35
00:02:42,420 --> 00:02:49,206
object, the class is defined more by what
the object's for, than by its visual

36
00:02:49,206 --> 00:02:51,849
appearance.
So consider chairs.

37
00:02:51,849 --> 00:02:58,899
There's a huge variety of things we call
chairs, from armchairs to modern chairs

38
00:02:58,899 --> 00:03:05,914
made with steel frames and wood backs to
basically anything you can sit on.

39
00:03:05,914 --> 00:03:11,663
One other thing that makes it hard to
recognize objects, is that we have

40
00:03:11,663 --> 00:03:17,105
different viewpoints.
So, there's a wide variety of viewpoints

41
00:03:17,105 --> 00:03:23,717
from which we can recognize a 3D object.
Now, changes in viewpoint cause changes in

42
00:03:23,717 --> 00:03:29,120
the images that standard machine learning
methods cannot cope with.

43
00:03:29,960 --> 00:03:34,590
The problem is that information hops about
between the input dimensions.

44
00:03:34,590 --> 00:03:39,156
So, typically envision the input
dimensions correspond to pixels, and, if

45
00:03:39,156 --> 00:03:43,980
an object moves in the world and you don't
move your eyes to follow it, the

46
00:03:43,980 --> 00:03:47,839
information about the object will occur on
different pixels.

47
00:03:47,839 --> 00:03:52,920
That's not the kind of thing we normally
have to deal with in machine learning.

48
00:03:53,320 --> 00:03:58,128
Just to stress that point, suppose we had
a medical database in which one of the

49
00:03:58,128 --> 00:04:02,816
inputs is the age of a patient and another
input is the weight of the patient.

50
00:04:02,816 --> 00:04:07,385
And we start doing machine learning, and
then we realize that some coder has

51
00:04:07,385 --> 00:04:11,352
actually changed which input I mentioned
is coding which property.

52
00:04:11,352 --> 00:04:16,100
So for one of the coders they've put
weight where they should have put age, and

53
00:04:16,100 --> 00:04:18,925
they put age where they should have put
weight.

54
00:04:18,925 --> 00:04:22,231
Obviously, we wouldn't just carry on doing
our learning.

55
00:04:22,231 --> 00:04:26,980
We'd try and do something to fix that.
That's going to make everything go wrong.

56
00:04:26,980 --> 00:04:31,704
I call that phenomenon dimension hopping.
When information jumps from one input

57
00:04:31,704 --> 00:04:35,352
dimension to another.
And that's what Viewpoint does and it's

58
00:04:35,352 --> 00:04:39,239
something we need to fix.
And preferably we'd like to fix it in a

59
00:04:39,239 --> 00:04:40,136
systematic way.