1 00:00:00,000 --> 00:00:08,460 In this video I'm going to describe how to use an RBM to model real value data. 2 00:00:08,460 --> 00:00:11,949 The idea is that we make the visible units. 3 00:00:11,949 --> 00:00:18,640 Instead of being binary stochastic units, the linear units with Gaussian noise. 4 00:00:18,640 --> 00:00:23,100 When we do this, we get problems with learning. 5 00:00:23,100 --> 00:00:29,361 And it turns out a good solution to those problems is to then make the hidden units 6 00:00:29,361 --> 00:00:34,166 be rectified linear units. With linear Gaussian units for the 7 00:00:34,166 --> 00:00:39,260 visible, and rectified linear units for the hiddens, it's quite easy to learn a 8 00:00:39,260 --> 00:00:43,963 restricted Boltzmann machine that makes a good model of real value data. 9 00:00:43,963 --> 00:00:48,862 We first used restricted Boltzmann machines with the images of handwritten 10 00:00:48,862 --> 00:00:50,430 digits. For those images. 11 00:00:50,430 --> 00:00:55,715 Intermediate intensities caused by a pixel being only partially inked can be 12 00:00:55,715 --> 00:01:01,137 modelled quite well by probabilities, that is numbers between one and zero that 13 00:01:01,137 --> 00:01:05,187 are actually the probability of a logistical unit being on. 14 00:01:05,187 --> 00:01:10,419 So we treat partially inked pixels. As having a probability of being inked. 15 00:01:10,419 --> 00:01:13,467 This is incorrect but it works quite well. 16 00:01:13,467 --> 00:01:19,199 However it won't work for real images. In a real image the intensity of a pixel 17 00:01:19,199 --> 00:01:23,625 is almost always, almost exactly the average of its neighbors. 18 00:01:23,625 --> 00:01:29,211 So its got a very high probability of being very close to that average and a 19 00:01:29,211 --> 00:01:33,130 very small probability of being a little further away. 20 00:01:33,130 --> 00:01:37,219 And you can't achieve that with a logistic unit. 21 00:01:37,219 --> 00:01:43,695 Mean field logistic units are unable to represent things like the intensity is 22 00:01:43,695 --> 00:01:50,580 69. but very unlikely to be 71. or 67. So we need some other kind of unit. 23 00:01:50,580 --> 00:01:55,289 The obvious thing to use is a linear unit with Gaussian norms. 24 00:01:55,289 --> 00:02:00,910 So we model pixels as Gaussian variables. We can still use alternating, get 25 00:02:00,910 --> 00:02:06,076 sampling, to run the Markoff chain required for the cross-divergence 26 00:02:06,076 --> 00:02:09,013 learning. But we need to use a much smaller 27 00:02:09,013 --> 00:02:12,160 learning range, otherwise it will tend to blow up. 28 00:02:12,160 --> 00:02:18,733 The equation looks like this. The first term on the right hand side, is 29 00:02:18,733 --> 00:02:23,860 a kind of parabolic containing function. It stops things blowing out. 30 00:02:23,860 --> 00:02:30,722 So determining that sum contributed by the Ith visible unit is parabolic in 31 00:02:30,722 --> 00:02:32,980 shape. It looks like this. 32 00:02:32,980 --> 00:02:38,642 It's parabola with its minimum at the bias of the Ith unit. 33 00:02:38,642 --> 00:02:45,551 And as the Ith unit departs from that value, we add energy quadratically. 34 00:02:45,551 --> 00:02:50,830 So that tries to keep the Ith visible unit close to VI. 35 00:02:50,830 --> 00:02:57,382 The interactive term between the visible and the hidden units looks like this. 36 00:02:57,382 --> 00:03:03,515 And if you differentiate that with respect to the I, you can see that you 37 00:03:03,515 --> 00:03:07,883 get a constant. It's the sum over all J, of H J W I J 38 00:03:07,883 --> 00:03:13,678 divided by sigma I. So that term with its constant gradient 39 00:03:13,678 --> 00:03:18,508 looks like this. And when you add together, that top down 40 00:03:18,508 --> 00:03:24,737 contribution to the energy is linear, and the parabolic containment function. 41 00:03:24,737 --> 00:03:30,557 You'll get a parabolic function, but with the mean shifted away from BI. 42 00:03:30,557 --> 00:03:35,720 And how much it shifted depends on the slope of that blue line. 43 00:03:35,720 --> 00:03:41,540 So the effect of the hidden units is just to push the mean to one side. 44 00:03:41,540 --> 00:03:44,994 It's easy to write down an energy function like this. 45 00:03:44,994 --> 00:03:50,012 And it's easy to take derivatives off it. But when we try learning with it, we 46 00:03:50,012 --> 00:03:53,532 often get problems. There were a lot of reports in the 47 00:03:53,532 --> 00:03:58,290 literature that people could not get these Gaussian binary RBM's to work. 48 00:03:58,290 --> 00:04:03,926 And it is indeed extremely hard to learn tight variances for the visible units. 49 00:04:03,926 --> 00:04:09,419 It took us a long time to figure out why it's so hard to learn those visible 50 00:04:09,419 --> 00:04:11,488 variances. This picture helps. 51 00:04:11,488 --> 00:04:16,340 If you consider the effect that visible unit I has on hidden unit J. 52 00:04:16,340 --> 00:04:22,369 When visible unit I has a strong standard deviation sigma I, that has the effect of 53 00:04:22,369 --> 00:04:27,672 exaggerating the bottom up weights. That's because we need to measure the 54 00:04:27,672 --> 00:04:31,232 activity of I in units of its standard deviation. 55 00:04:31,232 --> 00:04:37,044 So when the standard deviation is small, we need to multiply the weight by a lot. 56 00:04:37,044 --> 00:04:42,420 If you look at the top down effect of J on I, that's multiplied by sigma I. 57 00:04:42,420 --> 00:04:48,236 So when the standard deviation of a visible unit I is very small, the bottom 58 00:04:48,236 --> 00:04:53,286 up effects get exaggerated, on the top down effects get attenuated. 59 00:04:53,286 --> 00:04:59,179 The result is that we have a conflict where either we have bottom up effects 60 00:04:59,179 --> 00:05:04,230 that are much too big or top down effects that are much too small. 61 00:05:04,230 --> 00:05:10,201 And the result is that the hidden units tend to saturate and be firmly on or off 62 00:05:10,201 --> 00:05:13,445 all the time, and this will mess up learning. 63 00:05:13,445 --> 00:05:18,533 So the solution is to have many more hidden units than visible units. 64 00:05:18,533 --> 00:05:24,357 That allows small weights between the visible and hidden units to have big top 65 00:05:24,357 --> 00:05:27,675 down effects, because of so many hidden units. 66 00:05:27,675 --> 00:05:33,130 But of course, we really need the number of hidden units to change as that 67 00:05:33,130 --> 00:05:39,044 standard deviation sigma I gets smaller. And on the next slide, we'll see how we 68 00:05:39,044 --> 00:05:43,300 can achieve that. I'm going to introduce stepped sigmoid 69 00:05:43,300 --> 00:05:46,796 units. The idea is we make many copies of each 70 00:05:46,796 --> 00:05:52,117 stacastic binary hidden unit. All the copies have the same weights, and 71 00:05:52,117 --> 00:05:58,197 the same bias that's learned B But in addition to that adapted bias B they have 72 00:05:58,197 --> 00:06:04,541 a fixed offset to the bias. The first unit has an offset of -1.5. The 73 00:06:04,541 --> 00:06:10,460 second unit has an offset of -1.5. The third one has an offset of minus -2.5, 74 00:06:10,460 --> 00:06:14,180 and so on. If you have a whole family of sigmoid 75 00:06:14,180 --> 00:06:19,892 units like that, with the bias changed by one between neighbouring members of the 76 00:06:19,892 --> 00:06:22,784 family, the response code looks like this. 77 00:06:22,784 --> 00:06:27,227 If the total in product is very low, none of them are turned on. 78 00:06:27,227 --> 00:06:31,811 As it increases, the number that get turned on increases linearly. 79 00:06:31,811 --> 00:06:37,524 This means that as the standard deviation on the previous slide gets smaller, the 80 00:06:37,524 --> 00:06:43,356 number of copies of each hidden unit that get turned on gets bigger and we achieved 81 00:06:43,356 --> 00:06:48,212 just the effect we wanted, which we get more top-down effect to drive these 82 00:06:48,212 --> 00:06:51,450 visible units that have small standard deviations. 83 00:06:51,450 --> 00:06:56,372 Now it's quite expensive to use a big population of binary stochastic units 84 00:06:56,372 --> 00:07:01,488 with offset biases, because for each one of them, we need to put the total input 85 00:07:01,488 --> 00:07:06,798 through the logistic function, but we can make some fast approximations which work 86 00:07:06,798 --> 00:07:11,048 just as well. So the sum of the activities of a whole 87 00:07:11,048 --> 00:07:16,757 bunch of sigmoid units with offset ballasts, which is shown in that 88 00:07:16,757 --> 00:07:21,038 summation. Is approximately equal to log of one plus 89 00:07:21,038 --> 00:07:26,671 E to the X and that in turn is approximately equal to the maximum of 90 00:07:26,671 --> 00:07:31,080 nought and X. And we can add some noise to the X if we 91 00:07:31,080 --> 00:07:34,753 want. So the first term in the equation looks 92 00:07:34,753 --> 00:07:37,998 like this. The second term looks like that. 93 00:07:37,998 --> 00:07:43,833 And you can see that the sum of all those sigmoids in the first term will be a 94 00:07:43,833 --> 00:07:47,605 curve like that. And we can approximate that by a linear 95 00:07:47,605 --> 00:07:52,185 threshold unit that has a value of zero unless it's above threshold. 96 00:07:52,185 --> 00:07:56,091 In which case its value increases linearly with its input. 97 00:07:56,091 --> 00:08:01,480 Contrastive Divergence Learning works well for the sum of a bunch of stochastic 98 00:08:01,480 --> 00:08:05,610 logistic units with offset biases. And in that case. 99 00:08:05,610 --> 00:08:11,530 You get a noise variance that's equal to the logistic function. 100 00:08:11,530 --> 00:08:15,901 But the output of that sum. Alternatively we can use that green curve 101 00:08:15,901 --> 00:08:20,082 and use rectified linear units. They're much faster to compute because 102 00:08:20,082 --> 00:08:23,440 you don't need to go through the logistic many times. 103 00:08:23,440 --> 00:08:27,291 And can trust divergence works just fine with those. 104 00:08:27,291 --> 00:08:32,845 One nice property of rectified linear units is that if they have a bias of 105 00:08:32,845 --> 00:08:38,622 zero, they exhibit scale equivariance. This is a very nice property to have for 106 00:08:38,622 --> 00:08:42,177 images. What scale equivariance means is that if 107 00:08:42,177 --> 00:08:48,280 you take an image x and you multiply all the pixel intensities by a scalar a., 108 00:08:48,280 --> 00:08:54,775 then the representation of ax in the rectified linear units would be just a 109 00:08:54,775 --> 00:08:59,978 times the representation of x. In other words, when we scale up all the 110 00:08:59,978 --> 00:09:05,256 intensities in the image, we scale up the activities of all the hidden units but 111 00:09:05,256 --> 00:09:09,544 all the ratios stay the same. Rectified linear units aren't fully 112 00:09:09,544 --> 00:09:14,822 linear because if you add together two images, the representation you get is not 113 00:09:14,822 --> 00:09:18,451 the sum of the representations of each unit separately. 114 00:09:18,451 --> 00:09:23,201 This property of scale equivariance is quite similar to the property of 115 00:09:23,201 --> 00:09:26,760 translational equivariance, convolutional nets off. 116 00:09:26,760 --> 00:09:32,884 So if we ignore the pooling for now, in a convolution on that, if we shift an image 117 00:09:32,884 --> 00:09:38,608 and look at the representation, the representation of a shifted image is 118 00:09:38,608 --> 00:09:43,134 just a shifted version of the representation of the unshifted image. 119 00:09:43,134 --> 00:09:47,993 So in a convolutional net without pooling, translations of the input just 120 00:09:47,993 --> 00:09:52,586 flow through the layers of the net without really affecting anything. 121 00:09:52,586 --> 00:09:56,114 The representation of every layer is just translated.