In this video, we're going to look at the 
issue of training deep autoencoders. 
People thought of these a long time ago, 
in the mid 1980s. 
But they simply couldn't train them well 
enough for them to do significantly 
better than principal components 
analysis. 
There were various papers published about 
them, but no good demonstrations of 
impressive performance. 
After we developed methods of 
pre-training deep networks one layer at a 
time. 
Russ Salakhutdinov and I applied these 
methods to pretraining deep autoencoders, 
and for the first time, we got much 
better representations out of deep 
autoencoders than we could get from 
principal components analysis. 
Deep autoencoders always seemed like a 
really nice way to do dimensionality 
reduction because it seemed like they 
should work much better than principal 
components analysis. 
They provide flexible mappings in both 
directions, and the mappings can be 
non-linear. 
Their learning time should be linear or 
better in the number of training cases. 
And after they've been learned, the 
encoding part of the network is fairly 
fast because it's just a matrix 
multiplier for each layer. 
Unfortunately, it was very difficult to 
optimize deep autoencoders using back 
propagation. 
Typically people try small initial 
weights, and then the back propagated 
gradient died, so for deep network, they 
never got off the ground. 
But now we have much better way to 
optimize them. 
We can use unsupervised layer by layer 
pre-training, 
or we can simply initialize the weight 
sensibly, as an echo statement it works. 
The first really successful deep water 
encoders were learned by Russ 
Salakhutdinov and I in 2006. 
We applied them to the N-ness digits. 
So we started with images with 784 
pixels. 
And we then encoded those via three 
hidden layers, into 30 real valued 
activities in a central code layer. 
We then decoded those 30 real valued 
activities, 
back to 784 reconstructed pixels. 
We used a stack of restricted Boltzmann 
machine to initialize the weights used 
for encoding, 
and we then took the transposers of those 
weights and initialized the decoding 
network with them. 
So initially, the 784 pixels were 
reconstructed, using a weight matrix that 
was just the transpose of the weight 
matrix used for encoding them. 
But after the four restricted Boltzmann 
machines have being trained and unrolled 
to give the transposes for decoding, 
we then applied back propagation to 
minimize the reconstruction error of the 
784 pixels. 
In this case we were using a 
cross-entropy error, because the pixels 
were represented by logistic units. 
So that error was back propagated through 
this whole deep net. 
And we once started back propagating the 
error, 
the weights used for reconstructing the 
pixels became different from the weights 
used for encoding the pixels. 
Although they, typically stayed fairly 
similar. 
This worked very well. 
So if you look at the first row, that's 
one random sample from each digit class. 
If you look at the second row, that's the 
reconstruction of the random sample by 
the deep autoencoder that uses 30 linear 
hidden units in its central layer. 
So the data has been compressed to 30 
real numbers and then reconstructed. 
If you look at the eight, you can see 
that the reconstruction is actually 
better than the eight. 
It's got rid of the little defect in the 
eight because it doesn't have the 
capacity to encode it. 
If you compare that with linear principal 
commands analysis, you can see it's much 
better. 
A linear mapping to 30 real numbers 
cannot do nearly as good a job of 
representing the data.