1 00:00:00,000 --> 00:00:05,330 In this video, I'll talk about another way of restricting the capacity of a neural 2 00:00:05,330 --> 00:00:08,515 network. We can do that by adding noise, either to 3 00:00:08,515 --> 00:00:13,764 the weights or to the activities. I'll start by showing, that if we add 4 00:00:13,764 --> 00:00:18,809 noise to the inputs in a simple linear network, that's trying to minimize the 5 00:00:18,809 --> 00:00:24,117 squared error, that's exactly equivalent to imposing an L2 penalty on the weights 6 00:00:24,117 --> 00:00:28,112 of the network. I'll then describe uses of noisy weights 7 00:00:28,112 --> 00:00:33,843 in more complicated networks and I'll finish by describing a recent discovery 8 00:00:33,843 --> 00:00:39,206 that extreme noise in the activities can also be a very good regularizer. 9 00:00:39,206 --> 00:00:45,157 So let's look at what happens if we add Gaussian noise to the inputs to a simple 10 00:00:45,157 --> 00:00:50,070 neural network. The variance of the noise gets amplified 11 00:00:50,070 --> 00:00:55,320 by the squared weights on the connections going into the next hidden layer. 12 00:00:56,960 --> 00:01:02,371 If we have a very simple net, with just a linear output unit that's directly 13 00:01:02,371 --> 00:01:07,640 connected to the inputs, the amplified noise then gets added to the output. 14 00:01:09,080 --> 00:01:13,967 So if you look at the diagram, We put in an input Xi with additional 15 00:01:13,967 --> 00:01:20,562 Gaussian noise that's sampled from a Gaussian with zero mean and variant sigma 16 00:01:20,562 --> 00:01:24,442 I^.. That additional noise has it's variants 17 00:01:24,442 --> 00:01:30,028 multiplied by the squared weight. It then goes through the linear output 18 00:01:30,028 --> 00:01:36,313 unit j. And so what comes out of j is the yj that would have come out before plus 19 00:01:36,313 --> 00:01:41,899 Gaussian noise but has zero mean and has variance Wi^ sigma I^.. 20 00:01:44,200 --> 00:01:49,306 This additional variance makes an additive contribution to the squared error. 21 00:01:49,306 --> 00:01:54,346 You can think of it like Pythagoras theorem, that the squared error is going 22 00:01:54,346 --> 00:01:59,784 to be the sum of the squared error caused by Yj, and this additional noise, because 23 00:01:59,784 --> 00:02:05,678 the noise is independent of Yj. So when we minimize the total squared 24 00:02:05,678 --> 00:02:11,644 error, we'll minimize the squared error that will come out if it was a noise-free 25 00:02:11,644 --> 00:02:16,063 system. And in addition, we'll be minimizing that second term. 26 00:02:16,063 --> 00:02:22,102 That is, we'd be minimizing the expected squared value of that second term and the 27 00:02:22,102 --> 00:02:27,920 expected squared value is just Wi^2, sigma I^2, so that corresponds to an I2 penalty 28 00:02:27,920 --> 00:02:30,940 on wi with a penalty strength of sigma I^2. 29 00:02:33,760 --> 00:02:38,890 For those of you who like math, I'm gonna derive that on this slide. 30 00:02:38,890 --> 00:02:42,871 If you don't like math, you can just skip this slide. 31 00:02:42,871 --> 00:02:49,150 The output, Y noisy, when we add noise to all of the inputs, is just what the output 32 00:02:49,150 --> 00:02:55,352 would have been with noise-free system. The sum of all the inputs at wixi, plus wi 33 00:02:55,352 --> 00:02:58,721 times the noise that we added to each input. 34 00:02:58,721 --> 00:03:04,694 And those noises are sampled from a Gaussian with zero mean of variance sigma 35 00:03:04,694 --> 00:03:10,499 I^..so So if we compute the expected squared 36 00:03:10,499 --> 00:03:15,741 difference between Y noise E and the target value t, that's the quantity that's 37 00:03:15,741 --> 00:03:18,589 shown on the left-hand side of the equation. 38 00:03:18,589 --> 00:03:23,119 And I'm using an e followed by square brackets to mean an expectation. 39 00:03:23,119 --> 00:03:25,902 That's not the arrow, that's an expectation. 40 00:03:25,902 --> 00:03:31,079 And what we're computing the expectation of is the thing in the square brackets. 41 00:03:31,079 --> 00:03:35,869 So in this case, we're computing the expectation of the squared arrow that 42 00:03:35,869 --> 00:03:43,082 we'll get with the noisy system. So if we substitute the equation above for 43 00:03:43,082 --> 00:03:50,120 Y noisy, we need the expectation of Y. Plus the sum of all the I, WI, epsilon I 44 00:03:58,820 --> 00:04:04,449 So when we complete the square, the first time we get is Yt^ and that's not in the 45 00:04:04,449 --> 00:04:08,640 side of expectation bracket because it doesn't involve any noise. 46 00:04:08,980 --> 00:04:15,510 The second term is the cross product of the two terms above and the third term is 47 00:04:15,510 --> 00:04:21,700 the square of the last term. Now that equation simplifies a lot. 48 00:04:22,380 --> 00:04:26,556 In fact, it simplifies down to the normal squared error. 49 00:04:26,556 --> 00:04:30,733 Plus the expectation of WI^2, epsilon I^2, summed over all I. 50 00:04:30,733 --> 00:04:36,352 The reason it simplifies is because epsilon I is independent of epsilon J. 51 00:04:36,352 --> 00:04:42,503 So if you look at the last term, when we multiply at that square, all of the cross 52 00:04:42,503 --> 00:04:48,274 terms have an expected value of zero. Because we're multiplying together two 53 00:04:48,274 --> 00:04:54,813 independent things that are zero mean. If you look at the middle chart, that also 54 00:04:54,813 --> 00:05:00,893 has an expectation of zero, because each of the epsilon I's is independent of the 55 00:05:00,893 --> 00:05:06,241 residual error. So we can rewrite the expectation of the 56 00:05:06,241 --> 00:05:12,062 sum over all I of Wi^ epsilon squared, as simply the sum over all I with w I 57 00:05:12,062 --> 00:05:17,260 squared, sigma I squared, because the expectation of up to I squared is just 58 00:05:17,260 --> 00:05:21,280 sigma I squared, because that's how we generated epsilon i. 59 00:05:21,560 --> 00:05:27,249 And so we see that the expected squared error we get is just the squared error we 60 00:05:27,249 --> 00:05:31,066 get in the noise free system. Plus this additional term. 61 00:05:31,066 --> 00:05:34,466 And that looks just like an L2 penalty on the WI. 62 00:05:34,466 --> 00:05:38,560 With the sigma I^ being the strength of the penalty. 63 00:05:39,160 --> 00:05:45,334 In more complex nets, we can restrict the capacity by adding Gaussian noise to the 64 00:05:45,334 --> 00:05:49,100 weights. This isn't exactly equal to an L2 penalty. 65 00:05:49,600 --> 00:05:53,840 But it seems actually to work better, especially in recurring networks. 66 00:05:54,560 --> 00:06:00,766 So Alex Graves recently took his recurrent net that recognizes handwriting and tried 67 00:06:00,766 --> 00:06:05,440 it with noise added to the weights. And it actually works better. 68 00:06:07,440 --> 00:06:13,259 We can also use noise in the activities as a regularizer So suppose we use back 69 00:06:13,259 --> 00:06:17,640 propagation to train a multi-lanural match with logistic hidden units. 70 00:06:18,940 --> 00:06:24,153 What's gonna happen if we make the units binary and stochastic on the forward pass 71 00:06:24,153 --> 00:06:29,304 but then we do the backward pass as if we'd done the normal deterministic forward 72 00:06:29,304 --> 00:06:34,923 pass using the real values? So we're going to treat a logistic unit, 73 00:06:34,923 --> 00:06:38,719 in the forward pass, as if it's a stacastic binary neuron. 74 00:06:38,719 --> 00:06:43,780 That is, we compute the output of the logistic P, and then we treat that P as 75 00:06:43,780 --> 00:06:48,974 the probability of outputting a one. And in the forward pass, you make a random 76 00:06:48,974 --> 00:06:53,370 decision whether to output a one or a zero using that probability. 77 00:06:53,370 --> 00:06:58,499 But in the backward paths, you use the real value of p for back propagating 78 00:06:58,499 --> 00:07:03,834 derivatives through the hidden unit. This isn't exactly correct, but it's close 79 00:07:03,834 --> 00:07:09,443 to being a correct thing to do for the stochastic system if all of the units make 80 00:07:09,443 --> 00:07:13,000 small contributions to each unit in the layer above. 81 00:07:15,020 --> 00:07:20,903 When we do this the performance on the training set is worse and training is 82 00:07:20,903 --> 00:07:24,800 considerably slower. It may be several times slower. 83 00:07:25,080 --> 00:07:28,372 But it does significantly better on the test set. 84 00:07:28,372 --> 00:07:31,060 This is currently an unpublished result.