1
00:00:00,000 --> 00:00:04,291
[MUSIC]

2
00:00:04,291 --> 00:00:07,792
The first place where neural networks
made a tremendous amount of difference,

3
00:00:07,792 --> 00:00:11,140
is in an area called computer vision,
so analyzing images and videos.

4
00:00:11,140 --> 00:00:15,200
So let's see a few examples
of how deep learning, or

5
00:00:15,200 --> 00:00:18,160
this big neural networks,
can be applied to computer vision.

6
00:00:18,160 --> 00:00:22,020
So to do that, it's good to
understand what image features are.

7
00:00:23,710 --> 00:00:27,580
So in computer vision, image features
are kind of like local detectors that get

8
00:00:27,580 --> 00:00:29,550
combined to make a prediction.

9
00:00:29,550 --> 00:00:31,520
So let's say we take
this particular image.

10
00:00:31,520 --> 00:00:35,680
Suppose that I want to predict whether
this a face image or not a face image.

11
00:00:35,680 --> 00:00:41,490
I run the neural detector,
let's say a nose detector, eye detector,

12
00:00:41,490 --> 00:00:45,070
another eye detector,
a mouth detector, and

13
00:00:45,070 --> 00:00:48,448
if all of these fire, you can do it and

14
00:00:48,448 --> 00:00:53,090
using a little neural network, you can say
this is a face, and that's our prediction.

15
00:00:55,180 --> 00:00:59,200
Now, this is a simple example of how it
can build a classifier for images, but

16
00:00:59,200 --> 00:01:02,920
in reality they don't explicitly have
a nose detector or eye detector.

17
00:01:02,920 --> 00:01:06,440
What happens is these
called image features or

18
00:01:06,440 --> 00:01:09,910
interest points and
there's various names for this.

19
00:01:09,910 --> 00:01:13,280
But they really tried to
find local image segments,

20
00:01:13,280 --> 00:01:15,616
patches, that are really distinctive.

21
00:01:15,616 --> 00:01:19,720
So then maybe they'll find
the corner around the eye,

22
00:01:19,720 --> 00:01:24,090
maybe the corner around the nose, so
if you have lots of this corner detectors,

23
00:01:24,090 --> 00:01:27,030
a face is comprised of corners.

24
00:01:27,030 --> 00:01:32,587
Corner detector firings at places around
the eye, the mouth, and both eyes.

25
00:01:32,587 --> 00:01:35,124
And if enough of this fire
in a particular pattern,

26
00:01:35,124 --> 00:01:36,842
you discover that you have a face.

27
00:01:36,842 --> 00:01:39,170
So this is how computer
vision typically works.

28
00:01:39,170 --> 00:01:40,260
So this is how classification works.

29
00:01:40,260 --> 00:01:43,170
Of course, there's more general models and
more complex ones, but

30
00:01:43,170 --> 00:01:45,240
this is kind of the basic idea.

31
00:01:45,240 --> 00:01:51,140
For years, these types of detectors
of local features are built by hand.

32
00:01:51,140 --> 00:01:54,520
A very popular one was
called SIFT features.

33
00:01:54,520 --> 00:01:57,844
And this retransformed their computer
vision because they were really quite

34
00:01:57,844 --> 00:01:59,590
applicable and quite cool.

35
00:01:59,590 --> 00:02:03,820
And then, there are many others
that improve the accuracy.

36
00:02:03,820 --> 00:02:06,240
So other kinds of features
that can be used.

37
00:02:07,370 --> 00:02:12,460
We talked about this hand created
image features like SIFT feature, and

38
00:02:12,460 --> 00:02:16,480
so, let's talk about how they can be
typically used for classification.

39
00:02:16,480 --> 00:02:19,749
What we do is, we run the sifted
textures over the image and

40
00:02:19,749 --> 00:02:22,290
they fire in various places.

41
00:02:22,290 --> 00:02:25,550
So for example the corners of the eyes and
the mouth.

42
00:02:25,550 --> 00:02:31,130
And then what we do is, we create
a vector that describe the image based

43
00:02:31,130 --> 00:02:35,570
on the firings, the locations
where those SIFT features fired.

44
00:02:35,570 --> 00:02:41,554
So, you might have some firings in some
locations, no firings in other locations,

45
00:02:41,554 --> 00:02:45,787
and this can be viewed similarly
to the words in a document.

46
00:02:45,787 --> 00:02:48,640
So, does the word messy appear?

47
00:02:48,640 --> 00:02:50,480
Does the word football appear?

48
00:02:50,480 --> 00:02:54,510
Similarly, does a corner appear in
a particular place in the image.

49
00:02:54,510 --> 00:02:59,080
Now once we have that description of
the image, we feed it to a classifier.

50
00:02:59,080 --> 00:03:00,240
So for example,

51
00:03:00,240 --> 00:03:03,240
a simple linear classifier like we
talked about earlier in the quarter.

52
00:03:04,860 --> 00:03:07,320
It's not a quarter,
we're teaching this online.

53
00:03:07,320 --> 00:03:08,660
It was earlier in the module.

54
00:03:08,660 --> 00:03:12,690
[LAUGH]
So as we talked about earlier in

55
00:03:12,690 --> 00:03:17,750
the module, you can feed it to a simple
linear classifier and some names for those

56
00:03:17,750 --> 00:03:21,420
are things like linguistically regression,
support vector machines and more.

57
00:03:21,420 --> 00:03:25,830
And from there, we get a detection as to
whether this is image is a face or not.

58
00:03:27,790 --> 00:03:29,420
Now that sounds pretty exciting and

59
00:03:29,420 --> 00:03:33,180
it had a real significant impact
in their computer vision.

60
00:03:33,180 --> 00:03:40,180
The challenge though, is that creating
these hand built image features was

61
00:03:40,180 --> 00:03:45,500
a really complicated process and required
several PhD thesis to be done well.

62
00:03:48,240 --> 00:03:52,480
Neural networks are going to discover and
learn those features automatically.

63
00:03:52,480 --> 00:03:54,120
Let me give you an example of that.

64
00:03:54,120 --> 00:03:56,210
Suppose they give you this input image and

65
00:03:56,210 --> 00:04:00,320
they run it through a three layer neural
network before making a prediction.

66
00:04:00,320 --> 00:04:04,952
Typically what happens,
is that you learn local feature detectors,

67
00:04:04,952 --> 00:04:08,380
they're like SIFT, but
at different levels and different layers.

68
00:04:08,380 --> 00:04:13,230
And this detectors that you learn,
they detect different things,

69
00:04:13,230 --> 00:04:16,110
different properties of
the image at different levels.

70
00:04:16,110 --> 00:04:20,590
So, the first layer,
you might learn detectors that look kinda

71
00:04:20,590 --> 00:04:25,710
like these little patches, which really
react to things like diagonal edges.

72
00:04:25,710 --> 00:04:29,830
So this first detector here is all
about capturing diagonal edges.

73
00:04:29,830 --> 00:04:33,760
The center one is about capturing
diagonal edges in the other direction.

74
00:04:33,760 --> 00:04:39,720
And the last one here is about capturing
transitions and color from dark to green.

75
00:04:42,005 --> 00:04:46,415
Now, if we look at the next layer,
you're combining this edge,

76
00:04:46,415 --> 00:04:50,715
diagonal edge [INAUDIBLE] into some
kind of more complex detectors.

77
00:04:50,715 --> 00:04:55,620
So, for example,
we discovered this wiggly line and

78
00:04:55,620 --> 00:04:58,675
pattern detectors in the layer.

79
00:04:58,675 --> 00:05:02,710
You also discovered this kind of
detectors that react to corners,

80
00:05:02,710 --> 00:05:05,531
that [INAUDIBLE] detect
corners in the images.

81
00:05:05,531 --> 00:05:10,508
And at the final layers you come up with
detectors that are even more complicated.

82
00:05:10,508 --> 00:05:15,030
So for
a variety of images you might end up with

83
00:05:15,030 --> 00:05:18,960
things that react to torsos and faces.

84
00:05:18,960 --> 00:05:24,831
Or maybe if you have a bigger data set,
even to these images of here,

85
00:05:24,831 --> 00:05:28,448
which they fire up with images of corals.

86
00:05:28,448 --> 00:05:33,146
So neural networks capture different
types of image features at

87
00:05:33,146 --> 00:05:37,591
different layers and
then they get learned automatically.

88
00:05:37,591 --> 00:05:41,819
[MUSIC]