1 00:00:00,000 --> 00:00:04,108 . In this video I'm going to show how we 2 00:00:04,108 --> 00:00:08,969 can first learn a deep belief net by stacking up restricted Boltzmann 3 00:00:08,969 --> 00:00:12,164 machines. And then we can treat that as a deep 4 00:00:12,164 --> 00:00:16,140 neural net that we fine tune discriminatory. 5 00:00:16,140 --> 00:00:21,331 So instead of fine tuning it to be better at generation, as we did in the previous 6 00:00:21,331 --> 00:00:26,880 video, we're going to fine tune it to be better at discriminating between classes. 7 00:00:26,880 --> 00:00:32,320 This works very well and led to a big renewal of interest in neural networks. 8 00:00:33,440 --> 00:00:39,489 In speech recognition, it's had a major influence and many leading groups are now 9 00:00:39,489 --> 00:00:45,463 switching to using deep neural nets in order to reduce the error rate, in speech 10 00:00:45,463 --> 00:00:48,895 recognition. I now want to talk about fine tuning 11 00:00:48,895 --> 00:00:52,390 these deep networks to be better at discrimination. 12 00:00:52,390 --> 00:00:57,598 So we first learn one layer of features at a time, by stacking up restricted 13 00:00:57,598 --> 00:01:01,817 Boltzmann machines. Then we treat this as pre-training that 14 00:01:01,817 --> 00:01:07,506 finds a good initial set of weights in the DPO network and we fine tune those 15 00:01:07,506 --> 00:01:10,569 weights using some local search procedure. 16 00:01:10,569 --> 00:01:16,696 In the previous video I showed you how to use contrustive weight sleep to fine tune 17 00:01:16,696 --> 00:01:21,620 a deep network so that it was better generating its inputs. 18 00:01:21,620 --> 00:01:27,120 In this video we're going to use back propagation to fine tune a model to be 19 00:01:27,120 --> 00:01:31,691 better at discrimination. If we do this it overcomes many of the 20 00:01:31,691 --> 00:01:38,636 standard limitations of back propagation. It makes it much easier to learn deep 21 00:01:38,636 --> 00:01:41,990 nets. And it makes those nets generalise 22 00:01:41,990 --> 00:01:44,911 better. We need to understand why back 23 00:01:44,911 --> 00:01:48,420 propagation when we pre-train the weights. 24 00:01:48,420 --> 00:01:53,395 And there's really two effects. There's an effect on optimization and 25 00:01:53,395 --> 00:01:59,020 there's an effect on generalization. So the pre-training scales really well if 26 00:01:59,020 --> 00:02:03,274 we have big networks, especially if each layer has locality. 27 00:02:03,274 --> 00:02:08,899 So if we're doing vision, for example, and we had local receptor fields in each 28 00:02:08,899 --> 00:02:14,235 layer, then there's not much interaction between widely separate locations. 29 00:02:14,235 --> 00:02:19,139 And so it's very easy to learn a big layer more or less in parallel. 30 00:02:19,139 --> 00:02:23,310 When we do pre-training. We don't start back propagation until 31 00:02:23,310 --> 00:02:26,456 we've already learned sensible feature detectors. 32 00:02:26,456 --> 00:02:30,952 And these feature detectors should be very helpful for discrimination. 33 00:02:30,952 --> 00:02:35,768 So the initial gradients are much more sensible than if we use random ones. 34 00:02:35,768 --> 00:02:39,364 And back propagation doesn't need to do a global search. 35 00:02:39,364 --> 00:02:43,603 It just needs to do a local search from a sensible starting point. 36 00:02:43,603 --> 00:02:48,419 In addition to being easier to optimize, pre-trained nets exhibit much less 37 00:02:48,419 --> 00:02:51,901 overfitting. That's because most of the information in 38 00:02:51,901 --> 00:02:56,550 the final weights comes from modeling the distribution of input vectors. 39 00:02:56,550 --> 00:03:01,005 And these input vectors, if you're dealing with something like images, 40 00:03:01,005 --> 00:03:04,427 generally contain a lot more information than labels. 41 00:03:04,427 --> 00:03:09,269 A label typically only contains a few bits of information to constrain the 42 00:03:09,269 --> 00:03:13,401 mapping from input to output. Whereas an image contains a lot of 43 00:03:13,401 --> 00:03:18,115 information which will constrain any generative model of a set of images. 44 00:03:18,115 --> 00:03:22,803 The information in the labels is only used for the final fine tuning And 45 00:03:22,803 --> 00:03:27,183 because by that stage we've already decided on the feature detectors, we're 46 00:03:27,183 --> 00:03:32,030 not squandering that precious information designing feature detectors from scratch. 47 00:03:32,030 --> 00:03:37,890 The fine tuning only makes slight changes to the feature detectors we learned in 48 00:03:37,890 --> 00:03:43,248 the generative pre-training phase. And those are the changes required to get 49 00:03:43,248 --> 00:03:46,216 the category boundaries in the right place. 50 00:03:46,216 --> 00:03:50,979 The important thing is the back propagation is not being required to 51 00:03:50,979 --> 00:03:55,120 discover new features. And so it doesn't need nearly as much 52 00:03:55,120 --> 00:03:58,517 label data. In fact, this type of learning works well 53 00:03:58,517 --> 00:04:03,341 when most of the data is unlabeled, because the generative pre-training can 54 00:04:03,341 --> 00:04:07,650 make use of the light data. The unlabeled data is still very useful 55 00:04:07,650 --> 00:04:12,024 for discovering good features. There is an obvious objection to this 56 00:04:12,024 --> 00:04:16,270 type of learning, which is that when we do generative pre-training. 57 00:04:16,270 --> 00:04:20,511 We'll be learning lots of features that are useless for the particular 58 00:04:20,511 --> 00:04:23,020 discriminative task we want the net to do. 59 00:04:23,020 --> 00:04:27,382 Consider, for example, that you might want the net to discriminate between 60 00:04:27,382 --> 00:04:31,922 shapes or you might want the net to discriminate between different poses of 61 00:04:31,922 --> 00:04:35,692 one shape. They need very different features, and if 62 00:04:35,692 --> 00:04:41,129 you don't know the task in advance. You'll inevitably learn features that are 63 00:04:41,129 --> 00:04:44,314 never used. When computers were much smaller, that 64 00:04:44,314 --> 00:04:48,603 was the serious objection. But now that computers are large enough, 65 00:04:48,603 --> 00:04:51,982 we can afford to learn features that are never used. 66 00:04:51,982 --> 00:04:57,181 And, we can afford it because among all the features we learn, there will be some 67 00:04:57,181 --> 00:05:00,301 that are much more useful than their raw inputs. 68 00:05:00,301 --> 00:05:05,565 And that more than makes up for the fact that we have learned some features that 69 00:05:05,565 --> 00:05:09,400 aren't helpful for the particular task we're interested in. 70 00:05:09,400 --> 00:05:13,125 So let's apply this to modeling the m-list digits. 71 00:05:13,125 --> 00:05:18,600 We'll now learn three hidden layers of features entirely unsupervised. 72 00:05:18,600 --> 00:05:23,020 Once we've done that learning, when we generate from the model, 73 00:05:23,020 --> 00:05:26,089 it will generate things that look like real digits. 74 00:05:26,089 --> 00:05:29,399 And it'll generate them from all the different classes. 75 00:05:29,399 --> 00:05:34,093 And it'll typically take a while before it switches from one class to another. 76 00:05:34,093 --> 00:05:38,907 And it will typically take a while before it switches from one class to another 77 00:05:38,907 --> 00:05:43,541 because it'll tend to stay in the same ravine for a while before it jumps to 78 00:05:43,541 --> 00:05:46,670 another ravine. But the question is, are the features 79 00:05:46,670 --> 00:05:50,281 that we've learned that way useful for doing discrimination? 80 00:05:50,281 --> 00:05:54,072 So all we need to do is add a final 10-way soft max at the top. 81 00:05:54,072 --> 00:05:58,827 And fine tune it with back propagation. And see if we do better than purely 82 00:05:58,827 --> 00:06:03,173 discriminative training. So here's the results on the permutation 83 00:06:03,173 --> 00:06:07,386 invariant M-ness task. And what I mean is permutation invariant 84 00:06:07,386 --> 00:06:12,535 is, if we were to apply a fixed random permutation to all the pixels, the same 85 00:06:12,535 --> 00:06:17,417 permutation to every test and training case, the results of our algorithm 86 00:06:17,417 --> 00:06:21,028 wouldn't change. That's clearly not true for something 87 00:06:21,028 --> 00:06:25,627 like a convolutional net. A convolutional net's been told something 88 00:06:25,627 --> 00:06:29,320 about the task. By applying this fixed permutation, we 89 00:06:29,320 --> 00:06:34,965 destroy all simple ways of telling the net something about the spatial nature of 90 00:06:34,965 --> 00:06:37,683 the task. So if you apply standard back 91 00:06:37,683 --> 00:06:42,370 propagation. It's hard to do better than 1.6% errors. 92 00:06:42,370 --> 00:06:47,308 John Platt and myself have both tried quite hard applying standard back 93 00:06:47,308 --> 00:06:50,669 propagation with various different architectures. 94 00:06:50,669 --> 00:06:55,060 And we're both quite good at doing it. You can actually beat 1.6%. 95 00:06:55,060 --> 00:07:00,289 By using constraints on the incoming weight vectors of the hidden units. 96 00:07:00,289 --> 00:07:06,173 If you use an appropriate restriction on the length of an incoming weight vector, 97 00:07:06,173 --> 00:07:12,309 you can do a bit better than 1.6%. Support vector machines can get 1.4 98 00:07:12,309 --> 00:07:15,719 percent. And this was one of the pieces of 99 00:07:15,719 --> 00:07:21,333 evidence that led to support vector machines, supplanting back propagation. 100 00:07:21,333 --> 00:07:26,037 If you pretrain a network using a stack of Boltzmann Machines. 101 00:07:26,037 --> 00:07:32,182 And then you fine tune it to be better at generating the joint density of digits 102 00:07:32,182 --> 00:07:35,748 and image labels. Then you can get down to 1.25%. 103 00:07:35,748 --> 00:07:40,831 If you train a stack of Boltzmann machines, and simply put a 10-way 104 00:07:40,831 --> 00:07:45,080 [INAUDIBLE] on top, and fine tune it. You can get to 1.15%. 105 00:07:45,080 --> 00:07:49,519 And with more fiddling around, you can get that down to about one%. 106 00:07:49,519 --> 00:07:53,480 So you can do a lot better than standard back propagation. 107 00:07:53,480 --> 00:07:59,026 And also better than support vector machines by using generative pre-training 108 00:07:59,026 --> 00:08:04,286 followed by discriminative fine tuning. Mackerie Yerenzato working in Yan 109 00:08:04,286 --> 00:08:09,772 LeCanne's group also showed, using a slightly different pre-training method, 110 00:08:09,772 --> 00:08:15,112 that pre-training helps for models that have more data and better prions. 111 00:08:15,112 --> 00:08:19,428 So they used an additional 60,000 distorted digital images. 112 00:08:19,428 --> 00:08:24,987 So they had a lot more training data. They also used convolution multilinear 113 00:08:24,987 --> 00:08:28,279 network. And Yan's group is the best group, at 114 00:08:28,279 --> 00:08:32,188 tuning those. With back propagation, they managed to 115 00:08:32,188 --> 00:08:36,471 get down to.49%. When they did the unsupervised layer by 116 00:08:36,471 --> 00:08:41,765 layer pre-training, followed by back propagation they got down to.39%. 117 00:08:41,765 --> 00:08:47,449 Which at the time was a record. So you may remember this picture from the 118 00:08:47,449 --> 00:08:52,152 first lecture. This was one of the examples I gave of 119 00:08:52,152 --> 00:08:56,013 the successive neural nets. It's the same picture. 120 00:08:56,013 --> 00:09:01,764 Back then, I said we could get down to 20.7% by pre-training and then fine 121 00:09:01,764 --> 00:09:07,988 tuning with back propagation, and that the previous, and that the previous speak 122 00:09:07,988 --> 00:09:14,291 independent record on Timint was 24.4%. Which actually required averaging several 123 00:09:14,291 --> 00:09:18,072 models. Lee Ding at Microsoft Research picked up 124 00:09:18,072 --> 00:09:23,540 at this result immediately and collaborated on improving it. 125 00:09:23,540 --> 00:09:27,222 And this has led to a big change in speech recognition. 126 00:09:27,222 --> 00:09:32,243 If you look at this news story, it will refer you to a blog where the chief 127 00:09:32,243 --> 00:09:37,532 research officer for Microsoft is talking about the big improvements in speech 128 00:09:37,532 --> 00:09:40,545 recognition caused by using deep neural nets.