1 00:00:00,000 --> 00:00:05,689 In this video, I'm going to show a simple example of a restricted Boltzmann machine 2 00:00:05,689 --> 00:00:08,834 learning a model of images of handwritten twos. 3 00:00:08,834 --> 00:00:14,256 After it's learned the model, we'll look at how good it is at reconstructing twos. 4 00:00:14,256 --> 00:00:19,945 And we'll look at what happens if we give it a different kind of digit and ask it to 5 00:00:19,945 --> 00:00:23,827 reconstruct that. We'll also look at the weights we get if 6 00:00:23,827 --> 00:00:29,382 we train a restricted Boltzmann machine that's considerably larger on all of the 7 00:00:29,382 --> 00:00:33,229 digit classes. It lends a wide variety of features, which 8 00:00:33,229 --> 00:00:38,146 between them are very good at reconstructing all the different classes 9 00:00:38,146 --> 00:00:42,649 of digits, and also are quite a good model of those digit classes. 10 00:00:42,649 --> 00:00:48,259 That is, if you take a binary vector, this image of a hundred digit, the model will 11 00:00:48,259 --> 00:00:53,593 be able to find low energy states, compatible with that image and if you give 12 00:00:53,593 --> 00:00:58,995 it an image that's a long way away from being an image of a hundred digit, the 13 00:00:58,995 --> 00:01:04,260 model will not be able to find low energy states compatible with that image. 14 00:01:04,260 --> 00:01:10,281 I'm now gonna show how a relatively simple RBM can learn to build a model of images 15 00:01:10,281 --> 00:01:15,779 of the digit two. The images of sixteen pixels by sixteen 16 00:01:15,779 --> 00:01:19,009 pixels. And it has 50 binary hidden units that 17 00:01:19,009 --> 00:01:22,993 they're gonna learn to become interesting feature detectors. 18 00:01:22,993 --> 00:01:28,437 So when it's presented with the data case, the first thing that it does is use the 19 00:01:28,437 --> 00:01:33,351 weight and the connection from pixel to feature like this, to activate the 20 00:01:33,351 --> 00:01:37,401 features like this. That is for each of the binary neurons, it 21 00:01:37,401 --> 00:01:43,045 makes the sarcastic decision about whether it should deduct the state of one or zero. 22 00:01:43,045 --> 00:01:47,560 It then uses the binary pans for activation to reconstruct the data. 23 00:01:47,560 --> 00:01:53,149 That is, for each pixel, it makes a binary decision about whether it should be a one 24 00:01:53,149 --> 00:01:56,449 or a zero. It then reactivates the binary feature 25 00:01:56,449 --> 00:02:01,366 detectors using the reconstruction to activate them rather than the data. 26 00:02:01,366 --> 00:02:06,956 The weights are changed by incrementing the weights between an active pixel and an 27 00:02:06,956 --> 00:02:10,997 active feature detector when the network is looking at data. 28 00:02:10,997 --> 00:02:16,048 And that will lower the energy of the global configuration of the data, and 29 00:02:16,048 --> 00:02:21,890 whatever hidden pattern went with it. And it decrements the weights, between an 30 00:02:21,890 --> 00:02:26,025 active pixel and an active feature detector, when it's looking at a 31 00:02:26,025 --> 00:02:30,160 reconstruction, and that would raise the energy of the reconstruction. 32 00:02:30,700 --> 00:02:35,718 Near the beginning of learning when the weights are random, the reconstruction 33 00:02:35,718 --> 00:02:39,256 would almost certainly have lower energy than the data. 34 00:02:39,256 --> 00:02:44,467 Because the reconstruction is what the network likes to reproduce on the visible 35 00:02:44,467 --> 00:02:47,233 units, given the hidden pattern of activity. 36 00:02:47,233 --> 00:02:52,316 And obviously it likes to reproduce patterns that have low energy according to 37 00:02:52,316 --> 00:02:56,369 its energy function. And you can think of what learning does as 38 00:02:56,369 --> 00:02:59,392 changing the weights so the data is low energy. 39 00:02:59,392 --> 00:03:03,510 And the reconstructions of the data are generally higher energy. 40 00:03:03,510 --> 00:03:08,221 So let's start with some random weights for the 50 feature detectors. 41 00:03:08,221 --> 00:03:14,093 We'll use small random weights and each of these squares shows you the weights to the 42 00:03:14,093 --> 00:03:17,439 pixels coming from a particular feature detector. 43 00:03:17,439 --> 00:03:20,990 The small random weights are used to break symmetry. 44 00:03:20,990 --> 00:03:25,665 That because the update were stochastic, we don't really need that. 45 00:03:25,665 --> 00:03:31,190 After seeing a few hundred examples of digits, and digesting the weights a few 46 00:03:31,190 --> 00:03:34,661 times, the weights are beginning to form patterns. 47 00:03:34,661 --> 00:03:40,541 If we do it again, you can see that many of the feature detectors are detecting the 48 00:03:40,541 --> 00:03:44,464 pattern of a hole two. They're fairly global feature detectors. 49 00:03:44,464 --> 00:03:47,360 And those feature detectors are getting stronger, 50 00:03:47,900 --> 00:03:53,006 And stronger, and now some of them begin to localize, and they're getting more 51 00:03:53,006 --> 00:03:58,444 local, and even more local, and even more local, and these are the final weights and 52 00:03:58,444 --> 00:04:04,081 you can see that each neuron has become a different feature detector and most of the 53 00:04:04,081 --> 00:04:09,253 feature detectors are fairly local. If you look at the feature detector in the 54 00:04:09,253 --> 00:04:13,126 red box, for example, it's detecting the top of a two, 55 00:04:13,126 --> 00:04:18,242 And it's happy when the top of a two is where its white pixels are and where 56 00:04:18,242 --> 00:04:21,099 there's nothing where the black pixels are. 57 00:04:21,099 --> 00:04:24,355 So it's representing where the top of the two is. 58 00:04:24,355 --> 00:04:29,405 Once we've learned the model, we can look at how well it reconstructs digits. 59 00:04:29,405 --> 00:04:33,524 And we'll give it some test digits that it hasn't seen before. 60 00:04:33,524 --> 00:04:36,980 So we'll start by giving it a test example of a two. 61 00:04:37,640 --> 00:04:41,967 And its reconstruction is pretty faithful to the test example. 62 00:04:41,967 --> 00:04:46,363 It's slightly blurry. The test example has a hook at the top and 63 00:04:46,363 --> 00:04:51,033 that's been blurred after the reconstruction, but it's a pretty good 64 00:04:51,033 --> 00:04:54,811 reconstruction. The more interesting thing we can do, is 65 00:04:54,811 --> 00:04:58,383 give it a test example from a different digit class. 66 00:04:58,383 --> 00:05:02,230 So if we give it an example of the three to reconstruct. 67 00:05:02,230 --> 00:05:06,860 What it reconstructs actually looks more like a two then like a three. 68 00:05:06,860 --> 00:05:11,568 All of the feature detectors is learned, are good for representing twos, but it 69 00:05:11,568 --> 00:05:16,638 doesn't have feature detectors for things like representing that cusp in the middle 70 00:05:16,638 --> 00:05:19,777 of the three. So it ends up reconstructing something, 71 00:05:19,777 --> 00:05:24,546 but obeys the regularities of a two, better than it represents the regularities 72 00:05:24,546 --> 00:05:27,415 of a three. In fact, the network tries to see 73 00:05:27,415 --> 00:05:31,625 everything as a two. So here's some feature detectors that were 74 00:05:31,625 --> 00:05:37,037 learned in the first hidden layer of a model that uses 500 hidden units to model 75 00:05:37,037 --> 00:05:41,381 all ten digit classes. And this model has been trained for a long 76 00:05:41,381 --> 00:05:46,393 time with contrastive divergence. It has a big variety of feature detectors. 77 00:05:46,393 --> 00:05:51,672 If you look at the one in the blue box, that's obviously going to be useful for 78 00:05:51,672 --> 00:05:57,903 detecting things like eights If you look at the one in the red box, it's not what 79 00:05:57,903 --> 00:06:02,104 you expect to see. It likes to see pixels on very near the 80 00:06:02,104 --> 00:06:07,330 bottom there, and it really doesn't like to see pixels on in a road that's 21 81 00:06:07,330 --> 00:06:11,741 pixels above the bottom. What's going on here is that the data is 82 00:06:11,741 --> 00:06:17,102 normalized and so the digits can't have a height of greater than twenty pixels. 83 00:06:17,102 --> 00:06:22,803 And that means if you know that there's a pixel on where those big positive weights 84 00:06:22,803 --> 00:06:27,690 are, there can't possibly be a pixel on, where those negative weights are. 85 00:06:27,690 --> 00:06:33,180 So this is picking up on the long range regularity that was introduced by the way 86 00:06:33,180 --> 00:06:37,560 we normalized the data. Here's another one that's doing the same 87 00:06:37,560 --> 00:06:42,201 thing for the fact that the data can't be wider than twenty pixels. 88 00:06:42,201 --> 00:06:46,220 The feature detected in the green box is very interesting. 89 00:06:46,220 --> 00:06:51,550 It's for detecting where the bottom of a vertical stroke comes and it will detect 90 00:06:51,550 --> 00:06:56,360 it in a number of different positions and then refuse to detect it in the 91 00:06:56,360 --> 00:07:00,130 intermediate positions. So it's very like one of the least 92 00:07:00,130 --> 00:07:05,525 significant digits in a binary number, as you increase the magnitude of a number it 93 00:07:05,525 --> 00:07:10,270 goes on again, and off again, and on again, and off again and it shows that 94 00:07:10,270 --> 00:07:14,886 this is developing quite complex ways of representing where things are.