1
00:00:00,000 --> 00:00:02,580
The Triplet Loss is one good way to learn

2
00:00:02,580 --> 00:00:05,950
the parameters of a continent for face recognition.

3
00:00:05,950 --> 00:00:08,490
There's another way to learn these parameters.

4
00:00:08,490 --> 00:00:11,475
Let me show you how face recognition can also be posed

5
00:00:11,475 --> 00:00:15,175
as a straight binary classification problem.

6
00:00:15,175 --> 00:00:17,085
Another way to train a neural network,

7
00:00:17,085 --> 00:00:19,740
is to take this pair of neural networks to take

8
00:00:19,740 --> 00:00:25,260
this Siamese Network and have them both compute these embeddings,

9
00:00:25,260 --> 00:00:27,255
maybe 128 dimensional embeddings,

10
00:00:27,255 --> 00:00:28,770
maybe even higher dimensional,

11
00:00:28,770 --> 00:00:31,530
and then have these be input to

12
00:00:31,530 --> 00:00:36,375
a logistic regression unit to then just make a prediction.

13
00:00:36,375 --> 00:00:42,705
Where the target output will be one if both of these are the same persons,

14
00:00:42,705 --> 00:00:46,530
and zero if both of these are of different persons.

15
00:00:46,530 --> 00:00:52,960
So, this is a way to treat face recognition just as a binary classification problem.

16
00:00:52,960 --> 00:00:58,890
And this is an alternative to the triplet loss for training a system like this.

17
00:00:58,890 --> 00:01:03,405
Now, what does this final logistic regression unit actually do?

18
00:01:03,405 --> 00:01:08,400
The output y hat will be a sigmoid function,

19
00:01:08,400 --> 00:01:12,750
applied to some set of features but rather than just feeding in,

20
00:01:12,750 --> 00:01:18,690
these encodings, what you can do is take the differences between the encodings.

21
00:01:18,690 --> 00:01:20,795
So, let me show you what I mean.

22
00:01:20,795 --> 00:01:30,020
Let's say, I write a sum over K equals 1 to 128 of the absolute value,

23
00:01:30,020 --> 00:01:35,525
taken element wise between the two different encodings.

24
00:01:35,525 --> 00:01:39,930
Let me just finish writing this out and then we'll see what this means.

25
00:01:39,930 --> 00:01:45,335
In this notation, f of x i is the encoding of the image x i

26
00:01:45,335 --> 00:01:52,210
and the substitute k means to just select out the cave components of this vector.

27
00:01:52,210 --> 00:01:59,625
This is taking the element Y's difference in absolute values between these two encodings.

28
00:01:59,625 --> 00:02:03,240
And what you might do is think of these 128 numbers

29
00:02:03,240 --> 00:02:07,140
as features that you then feed into logistic regression.

30
00:02:07,140 --> 00:02:11,350
And, you'll find that little regression can have additional parameters w,

31
00:02:11,350 --> 00:02:16,030
i, and b similar to a normal logistic regression unit.

32
00:02:16,030 --> 00:02:21,990
And you would train appropriate waiting on these 128 features in

33
00:02:21,990 --> 00:02:24,105
order to predict whether or not

34
00:02:24,105 --> 00:02:28,225
these two images are of the same person or of different persons.

35
00:02:28,225 --> 00:02:31,035
So, this will be one pretty useful way to

36
00:02:31,035 --> 00:02:37,300
learn to predict zero or one whether these are the same person or different persons.

37
00:02:37,300 --> 00:02:40,230
And there are a few other variations on how you can

38
00:02:40,230 --> 00:02:44,220
compute this formula that I had underlined in green.

39
00:02:44,220 --> 00:02:51,405
For example, another formula could be this k minus f of x j,

40
00:02:51,405 --> 00:02:56,220
k squared divided by f of x i

41
00:02:56,220 --> 00:03:02,980
on plus f of x j k. This is sometimes called the chi square form.

42
00:03:02,980 --> 00:03:05,700
This is the Greek alphabet chi.

43
00:03:05,700 --> 00:03:08,874
But this is sometimes called a chi square similarity.

44
00:03:08,874 --> 00:03:15,810
And this and other variations are explored in this deep face paper,

45
00:03:15,810 --> 00:03:18,015
which I referenced earlier as well.

46
00:03:18,015 --> 00:03:20,760
So in this learning formulation,

47
00:03:20,760 --> 00:03:23,801
the input is a pair of images,

48
00:03:23,801 --> 00:03:28,920
so this is really your training input x and the output y

49
00:03:28,920 --> 00:03:32,085
is either zero or one depending on whether you're inputting

50
00:03:32,085 --> 00:03:35,680
a pair of similar or dissimilar images.

51
00:03:35,680 --> 00:03:37,070
And same as before,

52
00:03:37,070 --> 00:03:40,065
you're training is Siamese Network so that means that,

53
00:03:40,065 --> 00:03:44,035
this neural network up here has parameters that are what they're

54
00:03:44,035 --> 00:03:48,455
really tied to the parameters in this lower neural network.

55
00:03:48,455 --> 00:03:52,235
And this system can work pretty well as well.

56
00:03:52,235 --> 00:03:53,420
Lastly, just to mention,

57
00:03:53,420 --> 00:03:58,905
one computational trick that can help neural deployment significantly, which is that,

58
00:03:58,905 --> 00:04:00,375
if this is the new image,

59
00:04:00,375 --> 00:04:03,910
so this is an employee walking in hoping that the turnstile

60
00:04:03,910 --> 00:04:08,815
the doorway will open for them and that this is from your database image.

61
00:04:08,815 --> 00:04:11,190
Then instead of having to compute,

62
00:04:11,190 --> 00:04:17,520
this embedding every single time,

63
00:04:17,520 --> 00:04:20,970
where you can do is actually pre-compute that,

64
00:04:20,970 --> 00:04:22,970
so, when the new employee walks in,

65
00:04:22,970 --> 00:04:29,500
what you can do is use this upper components to compute that encoding and use it,

66
00:04:29,500 --> 00:04:31,020
then compare it to

67
00:04:31,020 --> 00:04:36,730
your pre-computed encoding and then use that to make a prediction y hat.

68
00:04:36,730 --> 00:04:40,770
Because you don't need to store the raw images and

69
00:04:40,770 --> 00:04:44,880
also because if you have a very large database of employees,

70
00:04:44,880 --> 00:04:50,935
you don't need to compute these encodings every single time for every employee database.

71
00:04:50,935 --> 00:04:52,980
This idea of free computing,

72
00:04:52,980 --> 00:04:56,880
some of these encodings can save a significant computation.

73
00:04:56,880 --> 00:05:00,775
And this type of pre-computation works both for this type of

74
00:05:00,775 --> 00:05:02,950
Siamese Central architecture where you

75
00:05:02,950 --> 00:05:07,485
treat face recognition as a binary classification problem,

76
00:05:07,485 --> 00:05:11,160
as well as, when you were learning encodings maybe using

77
00:05:11,160 --> 00:05:15,070
the Triplet Loss function as described in the last couple of videos.

78
00:05:15,070 --> 00:05:16,760
And so just to wrap up,

79
00:05:16,760 --> 00:05:19,530
to treat face verification supervised learning,

80
00:05:19,530 --> 00:05:23,460
you create a training set of just pairs of images now is

81
00:05:23,460 --> 00:05:28,045
of triplets of pairs of images where the target label is one.

82
00:05:28,045 --> 00:05:34,366
When these are a pair of pictures of the same person and where the tag label is zero,

83
00:05:34,366 --> 00:05:38,880
when these are pictures of different persons and you use

84
00:05:38,880 --> 00:05:40,845
different pairs to train

85
00:05:40,845 --> 00:05:45,660
the neural network to train the scientists that were using back propagation.

86
00:05:45,660 --> 00:05:49,755
So, this version that you just saw of treating face verification

87
00:05:49,755 --> 00:05:53,918
and by extension face recognition as a binary classification problem,

88
00:05:53,918 --> 00:05:55,645
this works quite well as well.

89
00:05:55,645 --> 00:05:57,645
As sort of that, I hope that you now know,

90
00:05:57,645 --> 00:05:59,490
whether it would take to train

91
00:05:59,490 --> 00:06:05,000
your own face verification or your own face recognition system one that can do one