In this video, I'm going to talk about 
alternative pre-training methods for 
learning deep neural nets. 
I introduced pre-training using 
restrictive Boltzmann machines trained 
with contrastive divergence. 
But after that, people discovered there 
are many other ways to pre-train layers 
of features. 
And indeed, if you initialize the weights 
correctly, you may not need pre-training 
at all provided you have enough labeled 
data. 
We've seen some of the neat things that 
can be done with the codes produced by 
deep auto-encoders. 
I now want to consider shallow 
auto-encoders that just have one hidden 
layer. 
Restricted Boltzmann machines can be used 
with shallow auto-encoders, particularly 
if they're trained with contrastive 
divergence because they're trying to make 
the reconstructions look like the data. 
When do you use an autoencoder? 
A restricted Boltzmann machine has very 
strong regularization because the hidden 
units are only allowed to have binary 
activities, 
and this restricts their capacity a lot. 
If we train restricted Boltzmann machines 
with maximum likelihood, they're not at 
all like auto-encoders. 
One way to see that is if you had a pixel 
that was pure noise, an auto-encoder 
would try to reconstruct whatever noise 
value it had. 
A restricted Boltzmann machine trained 
with maximum likelihood would completely 
ignore that pixel and model it just using 
the bias for that input. 
So, since we can view a restricted 
Boltzmann machine as a kind of a strongly 
regularized auto-encoder, maybe we can 
replace the RBMs that we use for 
pre-training with a stack of 
autoencoders. 
It turns out that if you do that, 
pre-training is not as effective. 
At least, that's true if you use shallow 
auto-encoders that are regularized just 
by penalizing the squared weights. 
So, stacking these autoencoders doesn't 
work as well as stacking restricted 
Boltzmann machines. 
However, there's a different kind of 
auto-encoder that does work as well, 
and that's the denoising auto-encoder. 
And, it's been studied extensively by the 
group in Montreal. 
Denoising auto-encoders work by adding 
noise to each input vector by setting 
many of the components to zero, but it's 
different components for different input 
vectors. 
This resembles dropout, but it's for the 
inputs rather than the hidden units. 
The denoising auto-encoder is still 
required to reconstruct, the inputs that 
have been set to zero. And so, it can't 
just copy its input. 
The danger with the shallow auto-encoder 
is that if you give it enough hidden 
units, it might just copy each pixel to 
one hidden unit, and then reconstruct 
that pixel from that hidden unit. 
A denoising auto-encoder clearly can't do 
that so it has to use hidden units to 
catch a correlation between inputs so 
that it can use the values of some inputs 
to help it reconstruct the inputs that 
have been zeroed out. 
If we use a stack of denoising 
auto-encoders, pre-training is very 
effective. 
There's some cases in which RBMs still 
work better, but in most cases denoising 
auto-encoders are more effective. 
It's also much simpler to evaluate the 
pre-training using a denoising 
autoencoder because we can easily compute 
the value of the objective function. 
When we pre-train a restricted Boltzmann 
machine with contrast divergence, we 
can't compute the value of the real 
objective function we're trying to 
minimize. 
So, we often just use the squared 
reconstruction error, which is not 
actually what's being minimized. 
In a denoising auto-encoder, we can print 
out the value of the thing we're trying 
to minimize, and that's very helpful. 
One disadvantage of the denoising 
autoencoder is that it lacks the nice 
variational boundary we get with 
restricted Boltzmann machines. 
But that's only of theoretical interest 
because it only applies if the restricted 
Boltzmann machine is trained with maximum 
likelihood. 
Yet another kind of auto-encoder is the 
contractive auto-encoder, that was also 
developed by the group in Montreal. 
The way this works is that we try to make 
the hidden activities be as insensitive 
as possible to the inputs. 
Of course, the hidden units can't just 
ignore the inputs altogether because they 
have to be able to reconstruct them. 
The way we achieve this insensitivity is 
by penalizing the squared gradient of 
each hidden unit with respect to each 
input. So, we try to make each hidden 
unit so that it won't change much if we 
change an input value. 
Contractive auto-encoders also work very 
well for pre-training. 
Their codes tend to have the property 
that only a small subset of the hidden 
units are in their sensitive range. 
For different parts of the space, it's a 
different subset and so this active set 
acts like a sparse code. 
The other hidden units are unsaturated 
and are insensitive. 
RBMs actually have a very similar 
behavior. 
After they've been trained, many of the 
hidden units will be saturated, and the 
working set of the unsaturated ones will 
be different for different training 
cases. 
I want to finish by summarizing my 
current view of pre-training. 
There are now many different ways to do 
layer by layer pre-training that 
discovers good features. 
When our data set does not have a huge 
number of labels, 
this way of discovering features before 
you ever use the labels is very helpful 
for the subsequent discriminative fine 
tuning. 
It discovers the features without using 
the information in the labels, and then 
the information in the labels is used for 
fine tuning the decision banquets between 
classes. 
It's especially useful if we have a lot 
of unlabeled data so that the 
pre-training can be a very good job of 
discovering interesting features, using a 
lot of data. 
For very large labeled data sets however, 
initializing the weights that are going 
to be used for supervised learning by 
using unsupervised pre-training is not 
necessary, even if the nets are deep. 
Pre-training was the first good way to 
initialize the weights for deep nets, but 
now we have lots of other ways. 
However, even if we have a lot of labels, 
if we make the nets much larger again, 
we'll need pretraining again. 
So, an argument I often have with people 
from Google is they say, we've got lots 
and lots of labelled data so we don't 
need regularization methods. 
Our nets won't over fit anyway because 
we've got so much data. 
The counter-argument is, that's only 
because you're using nets that are much 
too small. 
You should use much, much bigger nets on 
much, much more powerful computers. 
And then, you'll start over fitting again 
and you'll need these regularization 
methods, like dropout and pre-training. 
If you ask which regime the brain is in, 
the brain is clearly in the regime where 
it got huge numbers of parameters 
compared with the amount of data its got. 
And so to the brain, at least, 
regularization methods are very 
important.