In this video we'll look in more detail 
at what happens when a neural network is 
discriminatorily fine tuned after it's 
first been pre-trained as a stack of 
Boltzmann machines. 
What we'll see is that the weights in the 
lower layers hardly change at all. 
But that nevertheless these tiny changes 
make a big difference in the 
classification performance of the neural 
net, because they put the decision 
boundaries in the right places. 
We also see that the effect of 
pre-training is to make deeper networks 
more effective than shallower ones. 
Without pre-training it's often the other 
way around. 
Finally, I give a fairly general argument 
about why it makes sense to start by 
doing generative training. 
And only after this is well under way to 
consider discriminative training. 
So now we're going to look at some work 
done in Yoshua Bengio's lab, examining 
what happens during fine tuning after a 
net's been generatively pre-trained. 
If you look on the left, there's the 
receptive fields in the first hidden 
layer of feature detectors, after the 
generative pre-training but before the 
fine tuning. 
Then on the right, there's the same 
receptive fields after the fine-tuning 
then you'll see almost nothing has 
changed. 
Nevertheless, the changes helped with 
discrimination. 
Here's an example of how pre-training 
reduces the test errors for networks with 
one hidden left. 
The task was discriminating between 
digits in a very large set of distorted 
digits. 
And you can see that after the back 
propagation fine tuning, the networks 
with pre-training almost always did 
better than the networks without 
pre-training. 
The effect gets even bigger, if you use 
deeper networks. 
So here you can see that there's 
basically no overlap between the two 
distributions. 
And the deep networks with pre-training 
have got better than the shallow 
networks, and the deep networks without 
pre-training have got worse than the 
shallow networks. 
This is showing you the classification 
error and the variation classification 
error as you change the number of layers 
when you're not doing pre-training. 
And you can see that two layers appears 
to be best. 
And by the time you've got four layers, 
you're doing considerably worse. 
By contrast, if you use pre-training, 
four layers is better than two layers. 
There's much less variation and DR is 
lower. 
This is a visualization made with Teeson 
of what happens to the weights during 
training for both pretrained and 
non-pretrained networks and they are all 
plotted in the same space but you can see 
they form two distinct classes in 
networks. 
One's at the top and that works without 
pretraining and the ones at the bottom 
and that works with pretraining. 
Each point shows a model in function 
space. 
It's no use comparing weight vectors 
because two nets might differ by having 
two of the hidden units swapped round. 
So they behave exactly the same way, but 
the weights would look very different. 
In order to compare them, you have to 
compare the functions of the 
implementation. 
And a way to do that is to have a suite 
of test cases and look at the outputs the 
networks produce on those test cases. 
And then concatenate those outputs into 
one great long vector. 
And so if two networks produce very 
similar outputs for all the test cases, 
that concatenated vector will be very 
similar for the two networks. 
Now you take those concatenated output 
vectors and you plot those in 2D using 
t-SNE. 
The colors show the stages of training so 
if you look at the networks at the top, 
there's an initial blob in dark blue. 
And then you can see that it's all moving 
roughly the same direction. 
In other words, the networks after one 
epoch of learning are all more similar to 
one another then they are to the initial 
networks. 
That's even more pronounced with the 
pre-trained networks at the bottom. 
So the color tells you which epoch you're 
in. 
The trajectories at the top without 
pre-training show that different networks 
end up in different places in function 
space. 
And they're quite widely spread. 
The trajectories of the bottom show that 
with pre-training, you end up in a quite 
different region of function space. 
And the networks tend to be more similar 
to one another. 
But the main point is there's no overlap. 
The kinds of solutions you find, if you 
pre-train the networks generatively, are 
just qualitatively different from the 
kinds of solutions you find if you start 
with small random words. 
The last thing I want to say in this 
video is to explain why pre-training 
makes sense. 
So let's imagine that the way we 
generated pairs of an image and a label 
was by taking the stuff in the real 
world, using that to generate an image, 
for example by taking a photograph of 
something. 
And then having generated the image we 
attached a label to it that didn't depend 
on the stuff in the world. 
So contingent on the image itself, the 
stuff in the world is relevant the label 
thus depends on the pixels in the image 
That would be the case for example if the 
label told us whether the top left pixel 
was similar to the bottom right pixel. 
Now if we generated images that way, then 
it would make sense to try and learn a 
mapping from images to labels. 
Because the labels depend directly on the 
images but actually, it's more plausible 
that the way we generate image label 
pairs are by there being stuff in the 
world that gives rise to the image. 
And the reason the image has the name it 
has is because of the stuff in the world, 
not because of the pixels in the image. 
So you see a cow. 
You take a photograph. 
And you call that a photograph of a cow, 
because you were looking at a cow when 
you took it. 
Now the point is, there's a high 
bandwidth from the stuff in the world to 
the image. 
And there's a low bandwidth from the 
stuff in the world to the label. 
For example if I just say cow, you don't 
know whether the cow is upside down, 
whether it was brown or black and white, 
whether it was alive or dead, how big it 
was, what else was in the image, whether 
it was facing you or facing away from 
you. 
One of those things aren't conveyed by 
the label. 
If you see an image with thousands and 
thousands of pixels, you typically know 
all of those things. 
You get much, much more information about 
the causes of an image by looking at the 
image then you do by looking at the label 
of the image. 
So in that situation, where there's a 
high bandwidth pathway from the world to 
the image, and a low bandwidth image from 
the world to the label, because the label 
typically contains very few bits. 
It makes much more sense to try and 
recover the label by first inverting the 
high boundary pathway to get back to the 
stuff in the world that caused the image. 
And then having recovered the stuff in 
the world that caused the image, to 
decide what label it would be given. 
So that's a much more plausible model of 
how we assign names to things in images. 
And that justifies having a pre-training 
phase where you try and go from the image 
to its underlying causes, followed by a 
discriminative phase where you try and go 
from those underlying causes to the 
label. 
And perhaps you slightly fine tune the 
mapping from the image to the underlying 
causes.