1 00:00:00,000 --> 00:00:05,672 In this video, I'm going to talk about some recent work on learning a joint 2 00:00:05,672 --> 00:00:10,600 model of captions and feature vectors that describe images. 3 00:00:10,600 --> 00:00:15,404 In the previous lecture, I talked about how we might extract semantically 4 00:00:15,404 --> 00:00:20,209 meaningful features from images. But we were doing that with no help from 5 00:00:20,209 --> 00:00:23,828 the captions. Obviously the words in a caption ought to 6 00:00:23,828 --> 00:00:28,370 be helpful in extracting appropriate semantic categories from images. 7 00:00:28,370 --> 00:00:33,503 And similarly, the images ought to be helpful in disambiguating what the words 8 00:00:33,503 --> 00:00:38,004 in the caption mean. So the idea is we're going to try in a 9 00:00:38,004 --> 00:00:43,243 great big net that gets its input, stand to computer vision feature vectors 10 00:00:43,243 --> 00:00:48,762 extractive for images and pack up words representations of captions and learns 11 00:00:48,762 --> 00:00:52,954 how the two input representations are related to each other. 12 00:00:52,954 --> 00:00:58,682 At the end of the video I'll show you a movie of the final network using words to 13 00:00:58,682 --> 00:01:04,271 create feature vectors for images and then showing you the closest image in its 14 00:01:04,271 --> 00:01:07,953 data base. And also using images to create bytes of 15 00:01:07,953 --> 00:01:11,227 words. I'm now going to describe some work by 16 00:01:11,227 --> 00:01:16,830 Natish Rivastiva, who's one of the TAs for this course, and Roslyn Salakutinov, 17 00:01:16,830 --> 00:01:21,439 that will appear shortly. The goal is to build a joint density 18 00:01:21,439 --> 00:01:27,017 model of captions and of images except that the images represented by the 19 00:01:27,017 --> 00:01:33,349 features standardly used in computeration rather than by the ropic cells.This needs 20 00:01:33,349 --> 00:01:39,304 a lot more computation than building a joint density model of labels and digit 21 00:01:39,304 --> 00:01:42,470 images which we saw earlier in the course. 22 00:01:42,470 --> 00:01:46,749 So what they did was they first trained a multi-layer model of images alone. 23 00:01:46,749 --> 00:01:51,028 That is it's really a multi-layer model of the features they extracted from 24 00:01:51,028 --> 00:01:54,460 images using the standard computer vision features. 25 00:01:54,460 --> 00:02:00,461 Then separately, they train a multi-layer model of the word count vectors from the 26 00:02:00,461 --> 00:02:03,724 captions. Once they trained both of those models, 27 00:02:03,724 --> 00:02:07,736 they had a new top layer, that's connected to the top layers of both of 28 00:02:07,736 --> 00:02:11,954 the individual models. After that, they use further joint 29 00:02:11,954 --> 00:02:17,537 training of the whole system so that each modality can improve the earlier layers 30 00:02:17,537 --> 00:02:21,826 of the other modality. Instead of using a deep belief net, which 31 00:02:21,826 --> 00:02:27,068 is what you might expect, they used a deep Bolton machine, where the symmetric 32 00:02:27,068 --> 00:02:32,727 connections bring in all pairs of layers. The further joint training of the whole 33 00:02:32,727 --> 00:02:37,178 deep Boltzmann machine is then what allows each modality to change the 34 00:02:37,178 --> 00:02:41,560 feature detectors in the early layers of the other modality. 35 00:02:41,560 --> 00:02:44,934 That's the reason they used a deep Boltzmann machine. 36 00:02:44,934 --> 00:02:49,965 They could've also used a deep belief net, and done generative fine tuning with 37 00:02:49,965 --> 00:02:53,912 contrastive wake sleep. But the fine tuning algorithm for deep 38 00:02:53,912 --> 00:02:59,268 Boltzmann machines may well work better. This leaves the question of how they 39 00:02:59,268 --> 00:03:02,392 pretrained the hidden layers of a deep Boltzmann machine. 40 00:03:02,392 --> 00:03:06,393 because what we've seen so far in the course is that if you train a stack of 41 00:03:06,393 --> 00:03:10,449 restricted Boltzmann machines and you combine them together into a single 42 00:03:10,449 --> 00:03:15,520 composite model what you get is a deep belief net not a deep Boltzmann machine. 43 00:03:15,520 --> 00:03:20,918 So I'm now going to explain how, despite what I said earlier in the course, you 44 00:03:20,918 --> 00:03:25,601 can actually pre-trail a stack of restrictive Boltzmann machines in such a 45 00:03:25,601 --> 00:03:29,828 way that you can then combine them to make a deep Boltzmann machine. 46 00:03:29,828 --> 00:03:35,162 The trick is that the top and the bottom restrictive bowser machines in the stack 47 00:03:35,162 --> 00:03:40,500 have to trained with weights that it twices begin one directions the other. 48 00:03:40,500 --> 00:03:44,881 So, the bottom Boltzmann machine, that looks at the visible units is trained 49 00:03:44,881 --> 00:03:48,863 with the bottom up weights being twice as big as the top down weights. 50 00:03:48,863 --> 00:03:51,367 Apart from that, the weights are symmetrical. 51 00:03:51,367 --> 00:03:54,180 So, this is what I call scale symmetrical. 52 00:03:54,180 --> 00:03:57,487 But the bottom up weights are always twice as big as their top down 53 00:03:57,487 --> 00:04:01,816 counterparts. This can be justified, and I'll show you 54 00:04:01,816 --> 00:04:07,122 the justification in a little while. The next restrictive Boltzmann machine in 55 00:04:07,122 --> 00:04:10,161 the stack, is trained with symmetrical weights. 56 00:04:10,161 --> 00:04:14,851 I've called them two W, here rather then W for reasons you'll see later. 57 00:04:14,851 --> 00:04:20,136 We can keep training with restrictive bowsler machines like that with genuinely 58 00:04:20,136 --> 00:04:23,967 symmetrical weights. But then the top one in the stack has 59 00:04:23,967 --> 00:04:28,790 be-trained with the bottom up weights being half of the top down weights. 60 00:04:28,790 --> 00:04:33,554 So again, these are scale symmetric weights, but now, the top down weights 61 00:04:33,554 --> 00:04:36,333 are twice as big as the bottom up weights. 62 00:04:36,333 --> 00:04:41,561 That's the opposite of what we had when we trained the first restricted Bolton 63 00:04:41,561 --> 00:04:45,135 machine in the stack. After having trained these three 64 00:04:45,135 --> 00:04:50,296 restricted Bolton machines, we can then combine them to make a composite model, 65 00:04:50,296 --> 00:04:55,656 and the composite model looks like this. For the restricted Bolton machine in the 66 00:04:55,656 --> 00:05:00,620 middle, we simply halved its weights. That's why they were 2W2 to begin with. 67 00:05:01,660 --> 00:05:05,773 For the one at the bottom, we've halved the up-going weights but kept the 68 00:05:05,773 --> 00:05:09,606 down-going weights the same. And for the one at the top we've halved 69 00:05:09,606 --> 00:05:13,520 the down-going weights and kept the up-going weights the same. 70 00:05:13,520 --> 00:05:18,746 Now the question is: Why do we do this funny business of halving the whites? 71 00:05:18,746 --> 00:05:24,181 The explanation is quite complicated but I'll give you a rough idea of what's 72 00:05:24,181 --> 00:05:27,180 going on. If you look at the layer H1. 73 00:05:27,180 --> 00:05:32,545 We have two different ways of inferring the states of the units in h1, in the 74 00:05:32,545 --> 00:05:36,580 stack of restricted bolts and machines on the left. 75 00:05:36,580 --> 00:05:42,125 We can either infer the states of H1 bottom up from V or we can infer the 76 00:05:42,125 --> 00:05:47,595 states of H1 top down from H2. When we combine these Boltzmann machines 77 00:05:47,595 --> 00:05:53,665 together, what we're going to do is we're going to an average of those two ways of 78 00:05:53,665 --> 00:05:58,434 inferring H1. And to take a geometric average, what we 79 00:05:58,434 --> 00:06:04,212 need to do, is halve the weights. So we're going to use half of what the 80 00:06:04,212 --> 00:06:07,711 bottom up model says. So that's half of 2W1. 81 00:06:07,711 --> 00:06:12,593 And we're going to use half of what the top down model says. 82 00:06:12,593 --> 00:06:16,689 That's half of 2W2. And if you look at the deep Boltzmann 83 00:06:16,689 --> 00:06:20,988 machine on the right, that's exactly what's being used to infer the state of 84 00:06:20,988 --> 00:06:23,137 H1. In other words, if you're given the 85 00:06:23,137 --> 00:06:27,606 states in H2, and you're given the states in V, those are the weights you'll use 86 00:06:27,606 --> 00:06:32,832 for inferring the states of H1. The reason we need to halve the weights 87 00:06:32,832 --> 00:06:38,439 is so that we don't double count. You see, in the Boltzmann machine on the 88 00:06:38,439 --> 00:06:41,882 right. The state of H2 already depends on V. 89 00:06:41,882 --> 00:06:47,074 At least it does after we've done some settling down in the Boltzmann Machine. 90 00:06:47,074 --> 00:06:51,999 So if we were to use the bottom up input coming from the first restricted 91 00:06:51,999 --> 00:06:56,858 Boltzmann Machine in the stack. And we use the top down input coming from 92 00:06:56,858 --> 00:07:01,651 the second Boltzmann Machine in the stack, we'd be counting the evidence 93 00:07:01,651 --> 00:07:06,909 twice.'Cause we'd be inferring H1 from V. And we'd also be inferring it from H2, 94 00:07:06,909 --> 00:07:10,834 which, itself, depends on V. In order not to double count the 95 00:07:10,834 --> 00:07:15,761 evidence, we have to halve the weights. That's a very high level and perhaps not 96 00:07:15,761 --> 00:07:19,483 totally clear description of why we have to half the weights. 97 00:07:19,483 --> 00:07:24,400 If you want to know the mathematical details, you can go and read the paper. 98 00:07:24,400 --> 00:07:28,359 But that's what's going on. And that's why we need to halve the 99 00:07:28,359 --> 00:07:31,313 weights. So that the intermediate layers can be 100 00:07:31,313 --> 00:07:36,152 doing geometric averaging of the two different models of that layer, from the 101 00:07:36,152 --> 00:07:40,300 two different restricted Boltzmann machines in the original stack.