. 
In this video I'm going to show how we 
can first learn a deep belief net by 
stacking up restricted Boltzmann 
machines. 
And then we can treat that as a deep 
neural net that we fine tune 
discriminatory. 
So instead of fine tuning it to be better 
at generation, as we did in the previous 
video, we're going to fine tune it to be 
better at discriminating between classes. 
This works very well and led to a big 
renewal of interest in neural networks. 
In speech recognition, it's had a major 
influence and many leading groups are now 
switching to using deep neural nets in 
order to reduce the error rate, in speech 
recognition. 
I now want to talk about fine tuning 
these deep networks to be better at 
discrimination. 
So we first learn one layer of features 
at a time, by stacking up restricted 
Boltzmann machines. 
Then we treat this as pre-training that 
finds a good initial set of weights in 
the DPO network and we fine tune those 
weights using some local search 
procedure. 
In the previous video I showed you how to 
use contrustive weight sleep to fine tune 
a deep network so that it was better 
generating its inputs. 
In this video we're going to use back 
propagation to fine tune a model to be 
better at discrimination. 
If we do this it overcomes many of the 
standard limitations of back propagation. 
It makes it much easier to learn deep 
nets. 
And it makes those nets generalise 
better. 
We need to understand why back 
propagation when we pre-train the 
weights. 
And there's really two effects. 
There's an effect on optimization and 
there's an effect on generalization. 
So the pre-training scales really well if 
we have big networks, especially if each 
layer has locality. 
So if we're doing vision, for example, 
and we had local receptor fields in each 
layer, then there's not much interaction 
between widely separate locations. 
And so it's very easy to learn a big 
layer more or less in parallel. 
When we do pre-training. 
We don't start back propagation until 
we've already learned sensible feature 
detectors. 
And these feature detectors should be 
very helpful for discrimination. 
So the initial gradients are much more 
sensible than if we use random ones. 
And back propagation doesn't need to do a 
global search. 
It just needs to do a local search from a 
sensible starting point. 
In addition to being easier to optimize, 
pre-trained nets exhibit much less 
overfitting. 
That's because most of the information in 
the final weights comes from modeling the 
distribution of input vectors. 
And these input vectors, if you're 
dealing with something like images, 
generally contain a lot more information 
than labels. 
A label typically only contains a few 
bits of information to constrain the 
mapping from input to output. 
Whereas an image contains a lot of 
information which will constrain any 
generative model of a set of images. 
The information in the labels is only 
used for the final fine tuning And 
because by that stage we've already 
decided on the feature detectors, we're 
not squandering that precious information 
designing feature detectors from scratch. 
The fine tuning only makes slight changes 
to the feature detectors we learned in 
the generative pre-training phase. 
And those are the changes required to get 
the category boundaries in the right 
place. 
The important thing is the back 
propagation is not being required to 
discover new features. 
And so it doesn't need nearly as much 
label data. 
In fact, this type of learning works well 
when most of the data is unlabeled, 
because the generative pre-training can 
make use of the light data. 
The unlabeled data is still very useful 
for discovering good features. 
There is an obvious objection to this 
type of learning, which is that when we 
do generative pre-training. 
We'll be learning lots of features that 
are useless for the particular 
discriminative task we want the net to 
do. 
Consider, for example, that you might 
want the net to discriminate between 
shapes or you might want the net to 
discriminate between different poses of 
one shape. 
They need very different features, and if 
you don't know the task in advance. 
You'll inevitably learn features that are 
never used. 
When computers were much smaller, that 
was the serious objection. 
But now that computers are large enough, 
we can afford to learn features that are 
never used. 
And, we can afford it because among all 
the features we learn, there will be some 
that are much more useful than their raw 
inputs. 
And that more than makes up for the fact 
that we have learned some features that 
aren't helpful for the particular task 
we're interested in. 
So let's apply this to modeling the 
m-list digits. 
We'll now learn three hidden layers of 
features entirely unsupervised. 
Once we've done that learning, when we 
generate from the model, 
it will generate things that look like 
real digits. 
And it'll generate them from all the 
different classes. 
And it'll typically take a while before 
it switches from one class to another. 
And it will typically take a while before 
it switches from one class to another 
because it'll tend to stay in the same 
ravine for a while before it jumps to 
another ravine. 
But the question is, are the features 
that we've learned that way useful for 
doing discrimination? 
So all we need to do is add a final 
10-way soft max at the top. 
And fine tune it with back propagation. 
And see if we do better than purely 
discriminative training. 
So here's the results on the permutation 
invariant M-ness task. 
And what I mean is permutation invariant 
is, if we were to apply a fixed random 
permutation to all the pixels, the same 
permutation to every test and training 
case, the results of our algorithm 
wouldn't change. 
That's clearly not true for something 
like a convolutional net. 
A convolutional net's been told something 
about the task. 
By applying this fixed permutation, we 
destroy all simple ways of telling the 
net something about the spatial nature of 
the task. 
So if you apply standard back 
propagation. 
It's hard to do better than 1.6% errors. 
John Platt and myself have both tried 
quite hard applying standard back 
propagation with various different 
architectures. 
And we're both quite good at doing it. 
You can actually beat 1.6%. 
By using constraints on the incoming 
weight vectors of the hidden units. 
If you use an appropriate restriction on 
the length of an incoming weight vector, 
you can do a bit better than 1.6%. 
Support vector machines can get 1.4 
percent. 
And this was one of the pieces of 
evidence that led to support vector 
machines, supplanting back propagation. 
If you pretrain a network using a stack 
of Boltzmann Machines. 
And then you fine tune it to be better at 
generating the joint density of digits 
and image labels. 
Then you can get down to 1.25%. 
If you train a stack of Boltzmann 
machines, and simply put a 10-way 
[INAUDIBLE] on top, and fine tune it. 
You can get to 1.15%. 
And with more fiddling around, you can 
get that down to about one%. 
So you can do a lot better than standard 
back propagation. 
And also better than support vector 
machines by using generative pre-training 
followed by discriminative fine tuning. 
Mackerie Yerenzato working in Yan 
LeCanne's group also showed, using a 
slightly different pre-training method, 
that pre-training helps for models that 
have more data and better prions. 
So they used an additional 60,000 
distorted digital images. 
So they had a lot more training data. 
They also used convolution multilinear 
network. 
And Yan's group is the best group, at 
tuning those. 
With back propagation, they managed to 
get down to.49%. 
When they did the unsupervised layer by 
layer pre-training, followed by back 
propagation they got down to.39%. 
Which at the time was a record. 
So you may remember this picture from the 
first lecture. 
This was one of the examples I gave of 
the successive neural nets. 
It's the same picture. 
Back then, I said we could get down to 
20.7% by pre-training and then fine 
tuning with back propagation, and that 
the previous, and that the previous speak 
independent record on Timint was 24.4%. 
Which actually required averaging several 
models. 
Lee Ding at Microsoft Research picked up 
at this result immediately and 
collaborated on improving it. 
And this has led to a big change in 
speech recognition. 
If you look at this news story, it will 
refer you to a blog where the chief 
research officer for Microsoft is talking 
about the big improvements in speech 
recognition caused by using deep neural 
nets.