1
00:00:00,300 --> 00:00:04,300
The job of the function d,
which you learned about in the last video,

2
00:00:04,300 --> 00:00:08,860
is to input two faces and tell you how
similar or how different they are.

3
00:00:08,860 --> 00:00:12,605
A good way to do this is
to use a Siamese network.

4
00:00:12,605 --> 00:00:13,105
Let's take a look.

5
00:00:15,117 --> 00:00:19,098
You're used to seeing pictures of
confidence like these where you input

6
00:00:19,098 --> 00:00:20,635
an image, let's say x1.

7
00:00:20,635 --> 00:00:24,165
And through a sequence of
convolutional and pulling and

8
00:00:24,165 --> 00:00:28,465
fully connected layers,
end up with a feature vector like that.

9
00:00:30,260 --> 00:00:35,940
And sometimes this is fed to a softmax
unit to make a classification.

10
00:00:35,940 --> 00:00:39,000
We're not going to use that in this video.

11
00:00:39,000 --> 00:00:44,090
Instead, we're going to focus
on this vector of let's say 128

12
00:00:44,090 --> 00:00:49,900
numbers computed by some fully connected
layer that is deeper in the network.

13
00:00:50,930 --> 00:00:54,850
And I'm going to give this
list of 128 numbers a name.

14
00:00:54,850 --> 00:01:00,220
I'm going to call this f of x1,

15
00:01:00,220 --> 00:01:07,476
and you should think of f of x1 as
an encoding of the input image x1.

16
00:01:07,476 --> 00:01:12,653
So it's taken the input image,
here this picture of Kian,

17
00:01:12,653 --> 00:01:17,750
and is re-representing it
as a vector of 128 numbers.

18
00:01:17,750 --> 00:01:22,840
The way you can build a face recognition
system is then that if you want to compare

19
00:01:22,840 --> 00:01:27,890
two pictures, let's say this first
picture with this second picture here.

20
00:01:27,890 --> 00:01:31,860
What you can do is feed this second
picture to the same neural network with

21
00:01:31,860 --> 00:01:37,625
the same parameters and
get a different vector of 128 numbers,

22
00:01:37,625 --> 00:01:41,740
which encodes this second picture.

23
00:01:41,740 --> 00:01:43,560
So I'm going to call this second picture.

24
00:01:44,570 --> 00:01:50,480
So I'm going to call this encoding
of this second picture f of x2, and

25
00:01:50,480 --> 00:01:55,680
here I'm using x1 and
x2 just to denote two input images.

26
00:01:55,680 --> 00:01:58,030
They don't necessarily
have to be the first and

27
00:01:58,030 --> 00:02:00,010
second examples in your training sets.

28
00:02:00,010 --> 00:02:02,230
It can be any two pictures.

29
00:02:02,230 --> 00:02:07,320
Finally, if you believe that these
encodings are a good representation

30
00:02:07,320 --> 00:02:11,430
of these two images,
what you can do is then define the image

31
00:02:12,670 --> 00:02:16,463
d of distance between x1 and

32
00:02:16,463 --> 00:02:21,820
x2 as the norm of the difference

33
00:02:21,820 --> 00:02:25,750
between the encodings of these two images.

34
00:02:26,990 --> 00:02:29,434
So this idea of running two identical,

35
00:02:29,434 --> 00:02:34,544
convolutional neural networks on two
different inputs and then comparing them,

36
00:02:34,544 --> 00:02:39,380
sometimes that's called a Siamese
neural network architecture.

37
00:02:39,380 --> 00:02:43,530
And a lot of the ideas I'm
presenting here came from

38
00:02:43,530 --> 00:02:48,490
this paper due to Yaniv Taigman,
Ming Yang, Marc'Aurelio Ranzato, and

39
00:02:48,490 --> 00:02:53,690
Lior Wolf in the research system
that they developed called DeepFace.

40
00:02:54,890 --> 00:03:00,741
And many of the ideas I'm presenting here
came from a paper due to Yaniv Taigman,

41
00:03:00,741 --> 00:03:02,920
Ming Yang, Marc'Aurelio Ranzato, and

42
00:03:02,920 --> 00:03:07,155
Lior Wolf in a system that they
developed called DeepFace.

43
00:03:08,520 --> 00:03:11,830
So how do you train this
Siamese neural network?

44
00:03:11,830 --> 00:03:15,380
Remember that these two neural
networks have the same parameters.

45
00:03:16,730 --> 00:03:19,600
So what you want to do is really
train the neural network so

46
00:03:19,600 --> 00:03:24,250
that the encoding that it
computes results in a function d

47
00:03:24,250 --> 00:03:27,420
that tells you when two pictures
are of the same person.

48
00:03:27,420 --> 00:03:33,350
So more formally, the parameters of the
neural network define an encoding f of xi.

49
00:03:33,350 --> 00:03:35,548
So given any input image xi,

50
00:03:35,548 --> 00:03:40,968
the neural network outputs this
128 dimensional encoding f of xi.

51
00:03:40,968 --> 00:03:45,655
So more formally,
what you want to do is learn parameters so

52
00:03:45,655 --> 00:03:50,152
that if two pictures, xi and
xj, are of the same person,

53
00:03:50,152 --> 00:03:55,347
then you want that distance between
their encodings to be small.

54
00:03:55,347 --> 00:03:59,583
And in the previous slide,
l was using x1 and x2, but

55
00:03:59,583 --> 00:04:03,841
it's really any pair xi and
xj from your training set.

56
00:04:03,841 --> 00:04:07,959
And in contrast, if xi and
xj are of different persons,

57
00:04:07,959 --> 00:04:13,340
then you want that distance between
their encodings to be large.

58
00:04:13,340 --> 00:04:18,160
So as you vary the parameters in all
of these layers of the neural network,

59
00:04:18,160 --> 00:04:20,665
you end up with different encodings.

60
00:04:20,665 --> 00:04:23,639
And what you can do is
use back propagation and

61
00:04:23,639 --> 00:04:29,590
vary all those parameters in order to
make sure these conditions are satisfied.

62
00:04:29,590 --> 00:04:33,460
So you've learned about the Siamese
network architecture and

63
00:04:33,460 --> 00:04:36,890
have a sense of what you want
the neural network to output for

64
00:04:36,890 --> 00:04:39,830
you in terms of what would
make a good encoding.

65
00:04:39,830 --> 00:04:42,790
But how do you actually
define an objective function

66
00:04:42,790 --> 00:04:46,700
to make a neural network learn to
do what we just discussed here?

67
00:04:46,700 --> 00:04:50,940
Let's see how you can do that in the next
video using the triplet loss function.