In this video, I'm going to introduce Principal Component Analysis, which is a very widely used technique in signal processing. The idea of principal components analysis, is that high dimensional data can often be represented using a much lower dimensional code. This happens when the data lies near a linear manifold in the high dimensional space. So the idea is, if we can find this linear manifold, we can project the data onto the manifold, and then just represent where it is on the manifold. And we haven't lost much, because in the directions orthogonal to the manifold, there's not much variation in the data. As we'll see, we can do this operation efficiently using standard principle components methods, or we can do it inefficiently using a neural net with one hidden layer, where both the hidden units and the output units are linear. The advantage of doing it with the neural net is that we can then generalize the technique to using deep neural nets in which the code is a nonlinear function of the input. And our reconstruction of the data from the code is also a nonlinear function of the code. This enables us to deal with curved manifolds in the input space. So we represent data by where it gets projected on the curve manifold. And this is a much more powerful representation. In principal components analysis, we have N dimensional data, and we want to represent it using less than N numbers. And so we find M orthagonal directions in which the data has the most variance. And we ignore the directions in which the data doesn't vary much. The M principal dimensions form a lower dimensional subspace, and we represent an N dimensional data point by its projections onto these M dimensions in the lower dimensional space. So we've lost all information about where the data point is located in the remaining orthogonal directions. But since these don't have much variance, we haven't lost that much information. If we wanted to reconstruct the data point, from our representation in terms of M numbers, we reduce the mean value for all the N minus M directions that are not represented, and then the area in our reconstruction. Would be the sum over all these unrepresented directions of the square difference between the value of the data point count on that direction, and the mean value on that direction. This is most easily seen in the picture. So consider 2-dimensional data. This distributed according to an elongated Gaussian like this. The ellipse is meant to show kind of one standard deviation contour of the Gaussian. And consider a data point like that red one. If we used principal components analysis with a single component. That component would be the direction in the data that had the greatest variance. And so to represent the red point, we'd represent how far along that direction it lay. In other words we'd represent the projection of the red point onto that line, i.e. the green point. When we need to reconstruct the red point, what we'll do is simply use the mean value of all the data points, in the direction that we've ignored. In other words you'll represent a point on that black line. And so the lost in the reconstruction will be the squared difference between the red point and the green point. That is, would have lost the difference between the data point and the mean value of all the data, in the direction we're not representing, which is the direction of least variance. And so we obviously have minimized our loss if we choose to ignore the direction of least variance. Now, we can actually implement PCA, or a version of it, using back propagation, but it's not very efficient. So what we do is we make a network in which the output of the network is the reconstruction of the data. And we try and minimize the squared error in the reconstruction. The network has a central bottleneck that only has M hidden units. And those are going to correspond to the principal components, or something like them. So it looks like this. We have an input vector. We project that onto a code vector. And from the code vector, we construct an output vector. And the aim is to make the output vector as similar as possible to the input vector. The activities of the hidden units in the code vector from a bottleneck. So the code vector is a compressed representation of the input vector. If the hidden units and the output units are linear, then ordering coder like this, we'll learn codes that minimize the squared reconstruction error. And that's exactly what principle components analysis does. It will get exactly the same reconstruction error as principle components analysis does. But it won't necessarily have hidden units that correspond exactly the, the principle components. They will span the same space as the first N-principal components, but there may be a rotation and skewing of those axes. So the incoming weight vectors of the code units, which are what represent the directions of the components, may not be orthogonal. And unlike principal components analysis, they will typically have equal variances. But the space spanned by the incoming weight vectors of those code units will be exactly the same as the space spanned by the M principal components. So in that sense this network will do an equivalent thing to principal components. It's just if we use to cast the great descend learning for this network, it will typically much less sufficient than the algorithm used for principle components. Although if there's a huge amount of data, it might actually be more efficient. The main point of implementing principal component analysis using back-propagation in a neural net is that it allows us to generalize principal component analysis. If we use a neural net that has nonlinear layers before and after the code layer, it should be possible to represent data that lies on a curved manifold rather than a linear manifold in a high-dimensional space. And this is much more general. So our network will look something like this, the B input vector, and then one or more, layer of non-linear hidden unit, typically we use logistic units. Then there'll be a code layer which might be linear units. And then following the code layer, there'll be one or more layers of non-linear hidden units. And then there'll be an output vector, which we train to be as similar as possible to the input vector. So this is a curious network in which we're using a supervised learning algorithm to do unsupervised learning. The bottom part of the network is an encoder. Which takes the input vector and convert it into a code using a non-linear method. The top part of the network is a decoder, which takes the nonlinear code and maps it back to a reconstruction of the input vector. So after we've done the learning, we have mappings in both directions.