In this video, I'm going to talk about some recent work on learning a joint model of captions and feature vectors that describe images. In the previous lecture, I talked about how we might extract semantically meaningful features from images. But we were doing that with no help from the captions. Obviously the words in a caption ought to be helpful in extracting appropriate semantic categories from images. And similarly, the images ought to be helpful in disambiguating what the words in the caption mean. So the idea is we're going to try in a great big net that gets its input, stand to computer vision feature vectors extractive for images and pack up words representations of captions and learns how the two input representations are related to each other. At the end of the video I'll show you a movie of the final network using words to create feature vectors for images and then showing you the closest image in its data base. And also using images to create bytes of words. I'm now going to describe some work by Natish Rivastiva, who's one of the TAs for this course, and Roslyn Salakutinov, that will appear shortly. The goal is to build a joint density model of captions and of images except that the images represented by the features standardly used in computeration rather than by the ropic cells.This needs a lot more computation than building a joint density model of labels and digit images which we saw earlier in the course. So what they did was they first trained a multi-layer model of images alone. That is it's really a multi-layer model of the features they extracted from images using the standard computer vision features. Then separately, they train a multi-layer model of the word count vectors from the captions. Once they trained both of those models, they had a new top layer, that's connected to the top layers of both of the individual models. After that, they use further joint training of the whole system so that each modality can improve the earlier layers of the other modality. Instead of using a deep belief net, which is what you might expect, they used a deep Bolton machine, where the symmetric connections bring in all pairs of layers. The further joint training of the whole deep Boltzmann machine is then what allows each modality to change the feature detectors in the early layers of the other modality. That's the reason they used a deep Boltzmann machine. They could've also used a deep belief net, and done generative fine tuning with contrastive wake sleep. But the fine tuning algorithm for deep Boltzmann machines may well work better. This leaves the question of how they pretrained the hidden layers of a deep Boltzmann machine. because what we've seen so far in the course is that if you train a stack of restricted Boltzmann machines and you combine them together into a single composite model what you get is a deep belief net not a deep Boltzmann machine. So I'm now going to explain how, despite what I said earlier in the course, you can actually pre-trail a stack of restrictive Boltzmann machines in such a way that you can then combine them to make a deep Boltzmann machine. The trick is that the top and the bottom restrictive bowser machines in the stack have to trained with weights that it twices begin one directions the other. So, the bottom Boltzmann machine, that looks at the visible units is trained with the bottom up weights being twice as big as the top down weights. Apart from that, the weights are symmetrical. So, this is what I call scale symmetrical. But the bottom up weights are always twice as big as their top down counterparts. This can be justified, and I'll show you the justification in a little while. The next restrictive Boltzmann machine in the stack, is trained with symmetrical weights. I've called them two W, here rather then W for reasons you'll see later. We can keep training with restrictive bowsler machines like that with genuinely symmetrical weights. But then the top one in the stack has be-trained with the bottom up weights being half of the top down weights. So again, these are scale symmetric weights, but now, the top down weights are twice as big as the bottom up weights. That's the opposite of what we had when we trained the first restricted Bolton machine in the stack. After having trained these three restricted Bolton machines, we can then combine them to make a composite model, and the composite model looks like this. For the restricted Bolton machine in the middle, we simply halved its weights. That's why they were 2W2 to begin with. For the one at the bottom, we've halved the up-going weights but kept the down-going weights the same. And for the one at the top we've halved the down-going weights and kept the up-going weights the same. Now the question is: Why do we do this funny business of halving the whites? The explanation is quite complicated but I'll give you a rough idea of what's going on. If you look at the layer H1. We have two different ways of inferring the states of the units in h1, in the stack of restricted bolts and machines on the left. We can either infer the states of H1 bottom up from V or we can infer the states of H1 top down from H2. When we combine these Boltzmann machines together, what we're going to do is we're going to an average of those two ways of inferring H1. And to take a geometric average, what we need to do, is halve the weights. So we're going to use half of what the bottom up model says. So that's half of 2W1. And we're going to use half of what the top down model says. That's half of 2W2. And if you look at the deep Boltzmann machine on the right, that's exactly what's being used to infer the state of H1. In other words, if you're given the states in H2, and you're given the states in V, those are the weights you'll use for inferring the states of H1. The reason we need to halve the weights is so that we don't double count. You see, in the Boltzmann machine on the right. The state of H2 already depends on V. At least it does after we've done some settling down in the Boltzmann Machine. So if we were to use the bottom up input coming from the first restricted Boltzmann Machine in the stack. And we use the top down input coming from the second Boltzmann Machine in the stack, we'd be counting the evidence twice.'Cause we'd be inferring H1 from V. And we'd also be inferring it from H2, which, itself, depends on V. In order not to double count the evidence, we have to halve the weights. That's a very high level and perhaps not totally clear description of why we have to half the weights. If you want to know the mathematical details, you can go and read the paper. But that's what's going on. And that's why we need to halve the weights. So that the intermediate layers can be doing geometric averaging of the two different models of that layer, from the two different restricted Boltzmann machines in the original stack.