In this video, I'm going to talk about 
some recent work on learning a joint 
model of captions and feature vectors 
that describe images. 
In the previous lecture, I talked about 
how we might extract semantically 
meaningful features from images. 
But we were doing that with no help from 
the captions. 
Obviously the words in a caption ought to 
be helpful in extracting appropriate 
semantic categories from images. 
And similarly, the images ought to be 
helpful in disambiguating what the words 
in the caption mean. 
So the idea is we're going to try in a 
great big net that gets its input, stand 
to computer vision feature vectors 
extractive for images and pack up words 
representations of captions and learns 
how the two input representations are 
related to each other. 
At the end of the video I'll show you a 
movie of the final network using words to 
create feature vectors for images and 
then showing you the closest image in its 
data base. 
And also using images to create bytes of 
words. 
I'm now going to describe some work by 
Natish Rivastiva, who's one of the TAs 
for this course, and Roslyn Salakutinov, 
that will appear shortly. 
The goal is to build a joint density 
model of captions and of images except 
that the images represented by the 
features standardly used in computeration 
rather than by the ropic cells.This needs 
a lot more computation than building a 
joint density model of labels and digit 
images which we saw earlier in the 
course. 
So what they did was they first trained a 
multi-layer model of images alone. 
That is it's really a multi-layer model 
of the features they extracted from 
images using the standard computer vision 
features. 
Then separately, they train a multi-layer 
model of the word count vectors from the 
captions. 
Once they trained both of those models, 
they had a new top layer, that's 
connected to the top layers of both of 
the individual models. 
After that, they use further joint 
training of the whole system so that each 
modality can improve the earlier layers 
of the other modality. 
Instead of using a deep belief net, which 
is what you might expect, they used a 
deep Bolton machine, where the symmetric 
connections bring in all pairs of layers. 
The further joint training of the whole 
deep Boltzmann machine is then what 
allows each modality to change the 
feature detectors in the early layers of 
the other modality. 
That's the reason they used a deep 
Boltzmann machine. 
They could've also used a deep belief 
net, and done generative fine tuning with 
contrastive wake sleep. 
But the fine tuning algorithm for deep 
Boltzmann machines may well work better. 
This leaves the question of how they 
pretrained the hidden layers of a deep 
Boltzmann machine. 
because what we've seen so far in the 
course is that if you train a stack of 
restricted Boltzmann machines and you 
combine them together into a single 
composite model what you get is a deep 
belief net not a deep Boltzmann machine. 
So I'm now going to explain how, despite 
what I said earlier in the course, you 
can actually pre-trail a stack of 
restrictive Boltzmann machines in such a 
way that you can then combine them to 
make a deep Boltzmann machine. 
The trick is that the top and the bottom 
restrictive bowser machines in the stack 
have to trained with weights that it 
twices begin one directions the other. 
So, the bottom Boltzmann machine, that 
looks at the visible units is trained 
with the bottom up weights being twice as 
big as the top down weights. 
Apart from that, the weights are 
symmetrical. 
So, this is what I call scale 
symmetrical. 
But the bottom up weights are always 
twice as big as their top down 
counterparts. 
This can be justified, and I'll show you 
the justification in a little while. 
The next restrictive Boltzmann machine in 
the stack, is trained with symmetrical 
weights. 
I've called them two W, here rather then 
W for reasons you'll see later. 
We can keep training with restrictive 
bowsler machines like that with genuinely 
symmetrical weights. 
But then the top one in the stack has 
be-trained with the bottom up weights 
being half of the top down weights. 
So again, these are scale symmetric 
weights, but now, the top down weights 
are twice as big as the bottom up 
weights. 
That's the opposite of what we had when 
we trained the first restricted Bolton 
machine in the stack. 
After having trained these three 
restricted Bolton machines, we can then 
combine them to make a composite model, 
and the composite model looks like this. 
For the restricted Bolton machine in the 
middle, we simply halved its weights. 
That's why they were 2W2 to begin with. 
For the one at the bottom, we've halved 
the up-going weights but kept the 
down-going weights the same. 
And for the one at the top we've halved 
the down-going weights and kept the 
up-going weights the same. 
Now the question is: Why do we do this 
funny business of halving the whites? 
The explanation is quite complicated but 
I'll give you a rough idea of what's 
going on. 
If you look at the layer H1. 
We have two different ways of inferring 
the states of the units in h1, in the 
stack of restricted bolts and machines on 
the left. 
We can either infer the states of H1 
bottom up from V or we can infer the 
states of H1 top down from H2. 
When we combine these Boltzmann machines 
together, what we're going to do is we're 
going to an average of those two ways of 
inferring H1. 
And to take a geometric average, what we 
need to do, is halve the weights. 
So we're going to use half of what the 
bottom up model says. 
So that's half of 2W1. 
And we're going to use half of what the 
top down model says. 
That's half of 2W2. 
And if you look at the deep Boltzmann 
machine on the right, that's exactly 
what's being used to infer the state of 
H1. 
In other words, if you're given the 
states in H2, and you're given the states 
in V, those are the weights you'll use 
for inferring the states of H1. 
The reason we need to halve the weights 
is so that we don't double count. 
You see, in the Boltzmann machine on the 
right. 
The state of H2 already depends on V. 
At least it does after we've done some 
settling down in the Boltzmann Machine. 
So if we were to use the bottom up input 
coming from the first restricted 
Boltzmann Machine in the stack. 
And we use the top down input coming from 
the second Boltzmann Machine in the 
stack, we'd be counting the evidence 
twice.'Cause we'd be inferring H1 from V. 
And we'd also be inferring it from H2, 
which, itself, depends on V. 
In order not to double count the 
evidence, we have to halve the weights. 
That's a very high level and perhaps not 
totally clear description of why we have 
to half the weights. 
If you want to know the mathematical 
details, you can go and read the paper. 
But that's what's going on. 
And that's why we need to halve the 
weights. 
So that the intermediate layers can be 
doing geometric averaging of the two 
different models of that layer, from the 
two different restricted Boltzmann 
machines in the original stack.