1
00:00:00,000 --> 00:00:05,361
In this video, I'm going to introduce 
Principal Component Analysis, which is a 

2
00:00:05,361 --> 00:00:09,660
very widely used technique in signal 
processing. 

3
00:00:09,660 --> 00:00:15,441
The idea of principal components 
analysis, is that high dimensional data 

4
00:00:15,441 --> 00:00:20,600
can often be represented using a much 
lower dimensional code. 

5
00:00:20,600 --> 00:00:26,436
This happens when the data lies near a 
linear manifold in the high dimensional 

6
00:00:26,436 --> 00:00:29,894
space. 
So the idea is, if we can find this 

7
00:00:29,894 --> 00:00:33,862
linear manifold, we can project the data 
onto the manifold, 

8
00:00:33,862 --> 00:00:37,420
and then just represent where it is on 
the manifold. 

9
00:00:37,420 --> 00:00:41,867
And we haven't lost much, 
because in the directions orthogonal to 

10
00:00:41,867 --> 00:00:45,425
the manifold, there's not much variation 
in the data. 

11
00:00:45,425 --> 00:00:50,557
As we'll see, we can do this operation 
efficiently using standard principle 

12
00:00:50,557 --> 00:00:54,457
components methods, 
or we can do it inefficiently using a 

13
00:00:54,457 --> 00:00:59,656
neural net with one hidden layer, where 
both the hidden units and the output 

14
00:00:59,656 --> 00:01:04,426
units are linear. 
The advantage of doing it with the neural 

15
00:01:04,426 --> 00:01:09,749
net is that we can then generalize the 
technique to using deep neural nets in 

16
00:01:09,749 --> 00:01:13,298
which the code is a nonlinear function of 
the input. 

17
00:01:13,298 --> 00:01:18,825
And our reconstruction of the data from 
the code is also a nonlinear function of 

18
00:01:18,825 --> 00:01:22,431
the code. 
This enables us to deal with curved 

19
00:01:22,431 --> 00:01:27,329
manifolds in the input space. 
So we represent data by where it gets 

20
00:01:27,329 --> 00:01:32,080
projected on the curve manifold. 
And this is a much more powerful 

21
00:01:32,080 --> 00:01:36,173
representation. 
In principal components analysis, we have 

22
00:01:36,173 --> 00:01:41,510
N dimensional data, and we want to 
represent it using less than N numbers. 

23
00:01:41,510 --> 00:01:47,500
And so we find M orthagonal directions in 
which the data has the most variance. 

24
00:01:47,500 --> 00:01:52,430
And we ignore the directions in which the 
data doesn't vary much. 

25
00:01:52,430 --> 00:01:58,572
The M principal dimensions form a lower 
dimensional subspace, and we represent an 

26
00:01:58,572 --> 00:02:03,956
N dimensional data point by its 
projections onto these M dimensions in 

27
00:02:03,956 --> 00:02:09,764
the lower dimensional space. 
So we've lost all information about where 

28
00:02:09,764 --> 00:02:14,445
the data point is located in the 
remaining orthogonal directions. 

29
00:02:14,445 --> 00:02:20,133
But since these don't have much variance, 
we haven't lost that much information. 

30
00:02:20,133 --> 00:02:25,606
If we wanted to reconstruct the data 
point, from our representation in terms 

31
00:02:25,606 --> 00:02:31,367
of M numbers, we reduce the mean value 
for all the N minus M directions that are 

32
00:02:31,367 --> 00:02:36,220
not represented, and then the area in our 
reconstruction. 

33
00:02:36,220 --> 00:02:40,389
Would be the sum over all these 
unrepresented directions of the square 

34
00:02:40,389 --> 00:02:45,028
difference between the value of the data 
point count on that direction, and the 

35
00:02:45,028 --> 00:02:49,680
mean value on that direction. 
This is most easily seen in the picture. 

36
00:02:49,680 --> 00:02:55,395
So consider 2-dimensional data. 
This distributed according to an 

37
00:02:55,395 --> 00:03:00,478
elongated Gaussian like this. 
The ellipse is meant to show kind of one 

38
00:03:00,478 --> 00:03:03,600
standard deviation contour of the 
Gaussian. 

39
00:03:03,600 --> 00:03:06,840
And consider a data point like that red 
one. 

40
00:03:06,840 --> 00:03:11,730
If we used principal components analysis 
with a single component. 

41
00:03:11,730 --> 00:03:17,487
That component would be the direction in 
the data that had the greatest variance. 

42
00:03:17,487 --> 00:03:23,245
And so to represent the red point, we'd 
represent how far along that direction it 

43
00:03:23,245 --> 00:03:25,946
lay. 
In other words we'd represent the 

44
00:03:25,946 --> 00:03:29,287
projection of the red point onto that 
line, i.e. 

45
00:03:29,287 --> 00:03:32,983
the green point. 
When we need to reconstruct the red 

46
00:03:32,983 --> 00:03:36,335
point, 
what we'll do is simply use the mean 

47
00:03:36,335 --> 00:03:41,474
value of all the data points, in the 
direction that we've ignored. 

48
00:03:41,474 --> 00:03:46,960
In other words you'll represent a point 
on that black line. 

49
00:03:46,960 --> 00:03:51,469
And so the lost in the reconstruction 
will be the squared difference between 

50
00:03:51,469 --> 00:03:56,683
the red point and the green point. 
That is, would have lost the difference 

51
00:03:56,683 --> 00:04:02,631
between the data point and the mean value 
of all the data, in the direction we're 

52
00:04:02,631 --> 00:04:06,890
not representing, which is the direction 
of least variance. 

53
00:04:06,890 --> 00:04:12,335
And so we obviously have minimized our 
loss if we choose to ignore the direction 

54
00:04:12,335 --> 00:04:16,166
of least variance. 
Now, we can actually implement PCA, or a 

55
00:04:16,166 --> 00:04:20,536
version of it, using back propagation, 
but it's not very efficient. 

56
00:04:20,536 --> 00:04:25,712
So what we do is we make a network in 
which the output of the network is the 

57
00:04:25,712 --> 00:04:30,351
reconstruction of the data. 
And we try and minimize the squared error 

58
00:04:30,351 --> 00:04:34,663
in the reconstruction. 
The network has a central bottleneck that 

59
00:04:34,663 --> 00:04:38,847
only has M hidden units. 
And those are going to correspond to the 

60
00:04:38,847 --> 00:04:41,815
principal components, or something like 
them. 

61
00:04:41,815 --> 00:04:44,919
So it looks like this. 
We have an input vector. 

62
00:04:44,919 --> 00:04:50,047
We project that onto a code vector. 
And from the code vector, we construct an 

63
00:04:50,047 --> 00:04:53,758
output vector. 
And the aim is to make the output vector 

64
00:04:53,758 --> 00:04:56,660
as similar as possible to the input 
vector. 

65
00:04:58,520 --> 00:05:02,940
The activities of the hidden units in the 
code vector from a bottleneck. 

66
00:05:02,940 --> 00:05:07,602
So the code vector is a compressed 
representation of the input vector. 

67
00:05:07,602 --> 00:05:13,131
If the hidden units and the output units 
are linear, then ordering coder like 

68
00:05:13,131 --> 00:05:15,862
this, 
we'll learn codes that minimize the 

69
00:05:15,862 --> 00:05:20,058
squared reconstruction error. 
And that's exactly what principle 

70
00:05:20,058 --> 00:05:23,655
components analysis does. 
It will get exactly the same 

71
00:05:23,655 --> 00:05:27,585
reconstruction error as principle 
components analysis does. 

72
00:05:27,585 --> 00:05:32,581
But it won't necessarily have hidden 
units that correspond exactly the, the 

73
00:05:32,581 --> 00:05:37,432
principle components. 
They will span the same space as the 

74
00:05:37,432 --> 00:05:43,402
first N-principal components, but there 
may be a rotation and skewing of those 

75
00:05:43,402 --> 00:05:46,694
axes. 
So the incoming weight vectors of the 

76
00:05:46,694 --> 00:05:52,817
code units, which are what represent the 
directions of the components, may not be 

77
00:05:52,817 --> 00:05:56,621
orthogonal. 
And unlike principal components analysis, 

78
00:05:56,621 --> 00:06:01,787
they will typically have equal variances. 
But the space spanned by the incoming 

79
00:06:01,787 --> 00:06:07,085
weight vectors of those code units will 
be exactly the same as the space spanned 

80
00:06:07,085 --> 00:06:11,728
by the M principal components. 
So in that sense this network will do an 

81
00:06:11,728 --> 00:06:16,825
equivalent thing to principal components. 
It's just if we use to cast the great 

82
00:06:16,825 --> 00:06:21,783
descend learning for this network, it 
will typically much less sufficient than 

83
00:06:21,783 --> 00:06:24,580
the algorithm used for principle 
components. 

84
00:06:24,580 --> 00:06:29,153
Although if there's a huge amount of 
data, it might actually be more 

85
00:06:29,153 --> 00:06:32,583
efficient. 
The main point of implementing principal 

86
00:06:32,583 --> 00:06:38,098
component analysis using back-propagation 
in a neural net is that it allows us to 

87
00:06:38,098 --> 00:06:43,613
generalize principal component analysis. 
If we use a neural net that has nonlinear 

88
00:06:43,613 --> 00:06:48,926
layers before and after the code layer, 
it should be possible to represent data 

89
00:06:48,926 --> 00:06:53,365
that lies on a curved manifold rather 
than a linear manifold in a 

90
00:06:53,365 --> 00:06:56,930
high-dimensional space. 
And this is much more general. 

91
00:06:56,930 --> 00:07:02,524
So our network will look something like 
this, the B input vector, and then one or 

92
00:07:02,524 --> 00:07:07,350
more, layer of non-linear hidden unit, 
typically we use logistic units. 

93
00:07:07,350 --> 00:07:11,203
Then there'll be a code layer which might 
be linear units. 

94
00:07:11,203 --> 00:07:15,720
And then following the code layer, 
there'll be one or more layers of 

95
00:07:15,720 --> 00:07:19,840
non-linear hidden units. 
And then there'll be an output vector, 

96
00:07:19,840 --> 00:07:24,091
which we train to be as similar as 
possible to the input vector. 

97
00:07:24,091 --> 00:07:28,875
So this is a curious network in which 
we're using a supervised learning 

98
00:07:28,875 --> 00:07:34,738
algorithm to do unsupervised learning. 
The bottom part of the network is an 

99
00:07:34,738 --> 00:07:38,770
encoder. 
Which takes the input vector and convert 

100
00:07:38,770 --> 00:07:45,956
it into a code using a non-linear method. 
The top part of the network is a decoder, 

101
00:07:45,956 --> 00:07:52,863
which takes the nonlinear code and maps 
it back to a reconstruction of the input 

102
00:07:52,863 --> 00:07:56,249
vector. 
So after we've done the learning, we have 

103
00:07:56,249 --> 00:07:57,936
mappings in both directions.