1 00:00:00,000 --> 00:00:05,930 In this video, I'm going to talk about alternative pre-training methods for 2 00:00:05,930 --> 00:00:11,054 learning deep neural nets. I introduced pre-training using 3 00:00:11,054 --> 00:00:16,400 restrictive Boltzmann machines trained with contrastive divergence. 4 00:00:16,400 --> 00:00:21,777 But after that, people discovered there are many other ways to pre-train layers 5 00:00:21,777 --> 00:00:25,384 of features. And indeed, if you initialize the weights 6 00:00:25,384 --> 00:00:30,829 correctly, you may not need pre-training at all provided you have enough labeled 7 00:00:30,829 --> 00:00:33,477 data. We've seen some of the neat things that 8 00:00:33,477 --> 00:00:36,995 can be done with the codes produced by deep auto-encoders. 9 00:00:36,995 --> 00:00:41,263 I now want to consider shallow auto-encoders that just have one hidden 10 00:00:41,263 --> 00:00:44,165 layer. Restricted Boltzmann machines can be used 11 00:00:44,165 --> 00:00:48,667 with shallow auto-encoders, particularly if they're trained with contrastive 12 00:00:48,667 --> 00:00:53,596 divergence because they're trying to make the reconstructions look like the data. 13 00:00:53,596 --> 00:00:57,916 When do you use an autoencoder? A restricted Boltzmann machine has very 14 00:00:57,916 --> 00:01:02,723 strong regularization because the hidden units are only allowed to have binary 15 00:01:02,723 --> 00:01:05,826 activities, and this restricts their capacity a lot. 16 00:01:05,826 --> 00:01:10,754 If we train restricted Boltzmann machines with maximum likelihood, they're not at 17 00:01:10,754 --> 00:01:14,846 all like auto-encoders. One way to see that is if you had a pixel 18 00:01:14,846 --> 00:01:20,294 that was pure noise, an auto-encoder would try to reconstruct whatever noise 19 00:01:20,294 --> 00:01:23,926 value it had. A restricted Boltzmann machine trained 20 00:01:23,926 --> 00:01:29,723 with maximum likelihood would completely ignore that pixel and model it just using 21 00:01:29,723 --> 00:01:33,565 the bias for that input. So, since we can view a restricted 22 00:01:33,565 --> 00:01:38,595 Boltzmann machine as a kind of a strongly regularized auto-encoder, maybe we can 23 00:01:38,595 --> 00:01:42,480 replace the RBMs that we use for pre-training with a stack of 24 00:01:42,480 --> 00:01:45,409 autoencoders. It turns out that if you do that, 25 00:01:45,409 --> 00:01:49,994 pre-training is not as effective. At least, that's true if you use shallow 26 00:01:49,994 --> 00:01:54,642 auto-encoders that are regularized just by penalizing the squared weights. 27 00:01:54,642 --> 00:01:59,418 So, stacking these autoencoders doesn't work as well as stacking restricted 28 00:01:59,418 --> 00:02:02,921 Boltzmann machines. However, there's a different kind of 29 00:02:02,921 --> 00:02:06,734 auto-encoder that does work as well, and that's the denoising auto-encoder. 30 00:02:06,734 --> 00:02:12,040 And, it's been studied extensively by the group in Montreal. 31 00:02:12,040 --> 00:02:17,095 Denoising auto-encoders work by adding noise to each input vector by setting 32 00:02:17,095 --> 00:02:22,484 many of the components to zero, but it's different components for different input 33 00:02:22,484 --> 00:02:25,677 vectors. This resembles dropout, but it's for the 34 00:02:25,677 --> 00:02:30,400 inputs rather than the hidden units. The denoising auto-encoder is still 35 00:02:30,400 --> 00:02:35,589 required to reconstruct, the inputs that have been set to zero. And so, it can't 36 00:02:35,589 --> 00:02:39,373 just copy its input. The danger with the shallow auto-encoder 37 00:02:39,373 --> 00:02:44,034 is that if you give it enough hidden units, it might just copy each pixel to 38 00:02:44,034 --> 00:02:48,328 one hidden unit, and then reconstruct that pixel from that hidden unit. 39 00:02:48,328 --> 00:02:53,173 A denoising auto-encoder clearly can't do that so it has to use hidden units to 40 00:02:53,173 --> 00:02:58,080 catch a correlation between inputs so that it can use the values of some inputs 41 00:02:58,080 --> 00:03:01,760 to help it reconstruct the inputs that have been zeroed out. 42 00:03:01,760 --> 00:03:05,625 If we use a stack of denoising auto-encoders, pre-training is very 43 00:03:05,625 --> 00:03:08,598 effective. There's some cases in which RBMs still 44 00:03:08,598 --> 00:03:12,879 work better, but in most cases denoising auto-encoders are more effective. 45 00:03:12,879 --> 00:03:17,041 It's also much simpler to evaluate the pre-training using a denoising 46 00:03:17,041 --> 00:03:21,680 autoencoder because we can easily compute the value of the objective function. 47 00:03:21,680 --> 00:03:25,728 When we pre-train a restricted Boltzmann machine with contrast divergence, we 48 00:03:25,728 --> 00:03:29,510 can't compute the value of the real objective function we're trying to 49 00:03:29,510 --> 00:03:31,747 minimize. So, we often just use the squared 50 00:03:31,747 --> 00:03:35,262 reconstruction error, which is not actually what's being minimized. 51 00:03:35,262 --> 00:03:39,523 In a denoising auto-encoder, we can print out the value of the thing we're trying 52 00:03:39,523 --> 00:03:43,252 to minimize, and that's very helpful. One disadvantage of the denoising 53 00:03:43,252 --> 00:03:47,034 autoencoder is that it lacks the nice variational boundary we get with 54 00:03:47,034 --> 00:03:50,762 restricted Boltzmann machines. But that's only of theoretical interest 55 00:03:50,762 --> 00:03:55,024 because it only applies if the restricted Boltzmann machine is trained with maximum 56 00:03:55,024 --> 00:03:57,817 likelihood. Yet another kind of auto-encoder is the 57 00:03:57,817 --> 00:04:01,950 contractive auto-encoder, that was also developed by the group in Montreal. 58 00:04:01,950 --> 00:04:07,607 The way this works is that we try to make the hidden activities be as insensitive 59 00:04:07,607 --> 00:04:12,023 as possible to the inputs. Of course, the hidden units can't just 60 00:04:12,023 --> 00:04:17,405 ignore the inputs altogether because they have to be able to reconstruct them. 61 00:04:17,405 --> 00:04:22,856 The way we achieve this insensitivity is by penalizing the squared gradient of 62 00:04:22,856 --> 00:04:27,962 each hidden unit with respect to each input. So, we try to make each hidden 63 00:04:27,962 --> 00:04:32,240 unit so that it won't change much if we change an input value. 64 00:04:32,240 --> 00:04:36,149 Contractive auto-encoders also work very well for pre-training. 65 00:04:36,149 --> 00:04:41,005 Their codes tend to have the property that only a small subset of the hidden 66 00:04:41,005 --> 00:04:45,734 units are in their sensitive range. For different parts of the space, it's a 67 00:04:45,734 --> 00:04:49,770 different subset and so this active set acts like a sparse code. 68 00:04:49,770 --> 00:04:54,105 The other hidden units are unsaturated and are insensitive. 69 00:04:54,105 --> 00:04:57,265 RBMs actually have a very similar behavior. 70 00:04:57,265 --> 00:05:02,997 After they've been trained, many of the hidden units will be saturated, and the 71 00:05:02,997 --> 00:05:08,655 working set of the unsaturated ones will be different for different training 72 00:05:08,655 --> 00:05:11,668 cases. I want to finish by summarizing my 73 00:05:11,668 --> 00:05:16,932 current view of pre-training. There are now many different ways to do 74 00:05:16,932 --> 00:05:20,597 layer by layer pre-training that discovers good features. 75 00:05:20,597 --> 00:05:24,198 When our data set does not have a huge number of labels, 76 00:05:24,198 --> 00:05:29,342 this way of discovering features before you ever use the labels is very helpful 77 00:05:29,342 --> 00:05:32,300 for the subsequent discriminative fine tuning. 78 00:05:32,300 --> 00:05:37,352 It discovers the features without using the information in the labels, and then 79 00:05:37,352 --> 00:05:42,725 the information in the labels is used for fine tuning the decision banquets between 80 00:05:42,725 --> 00:05:45,794 classes. It's especially useful if we have a lot 81 00:05:45,794 --> 00:05:50,208 of unlabeled data so that the pre-training can be a very good job of 82 00:05:50,208 --> 00:05:53,597 discovering interesting features, using a lot of data. 83 00:05:53,597 --> 00:05:58,258 For very large labeled data sets however, initializing the weights that are going 84 00:05:58,258 --> 00:06:03,055 to be used for supervised learning by using unsupervised pre-training is not 85 00:06:03,055 --> 00:06:07,552 necessary, even if the nets are deep. Pre-training was the first good way to 86 00:06:07,552 --> 00:06:11,870 initialize the weights for deep nets, but now we have lots of other ways. 87 00:06:11,870 --> 00:06:17,890 However, even if we have a lot of labels, if we make the nets much larger again, 88 00:06:17,890 --> 00:06:22,016 we'll need pretraining again. So, an argument I often have with people 89 00:06:22,016 --> 00:06:26,561 from Google is they say, we've got lots and lots of labelled data so we don't 90 00:06:26,561 --> 00:06:30,568 need regularization methods. Our nets won't over fit anyway because 91 00:06:30,568 --> 00:06:34,096 we've got so much data. The counter-argument is, that's only 92 00:06:34,096 --> 00:06:37,087 because you're using nets that are much too small. 93 00:06:37,087 --> 00:06:41,512 You should use much, much bigger nets on much, much more powerful computers. 94 00:06:41,512 --> 00:06:46,177 And then, you'll start over fitting again and you'll need these regularization 95 00:06:46,177 --> 00:06:50,860 methods, like dropout and pre-training. If you ask which regime the brain is in, 96 00:06:50,860 --> 00:06:55,492 the brain is clearly in the regime where it got huge numbers of parameters 97 00:06:55,492 --> 00:06:59,753 compared with the amount of data its got. And so to the brain, at least, 98 00:06:59,753 --> 00:07:02,347 regularization methods are very important.