In this video, I'll talk about another way of restricting the capacity of a neural network. We can do that by adding noise, either to the weights or to the activities. I'll start by showing, that if we add noise to the inputs in a simple linear network, that's trying to minimize the squared error, that's exactly equivalent to imposing an L2 penalty on the weights of the network. I'll then describe uses of noisy weights in more complicated networks and I'll finish by describing a recent discovery that extreme noise in the activities can also be a very good regularizer. So let's look at what happens if we add Gaussian noise to the inputs to a simple neural network. The variance of the noise gets amplified by the squared weights on the connections going into the next hidden layer. If we have a very simple net, with just a linear output unit that's directly connected to the inputs, the amplified noise then gets added to the output. So if you look at the diagram, We put in an input Xi with additional Gaussian noise that's sampled from a Gaussian with zero mean and variant sigma I^.. That additional noise has it's variants multiplied by the squared weight. It then goes through the linear output unit j. And so what comes out of j is the yj that would have come out before plus Gaussian noise but has zero mean and has variance Wi^ sigma I^.. This additional variance makes an additive contribution to the squared error. You can think of it like Pythagoras theorem, that the squared error is going to be the sum of the squared error caused by Yj, and this additional noise, because the noise is independent of Yj. So when we minimize the total squared error, we'll minimize the squared error that will come out if it was a noise-free system. And in addition, we'll be minimizing that second term. That is, we'd be minimizing the expected squared value of that second term and the expected squared value is just Wi^2, sigma I^2, so that corresponds to an I2 penalty on wi with a penalty strength of sigma I^2. For those of you who like math, I'm gonna derive that on this slide. If you don't like math, you can just skip this slide. The output, Y noisy, when we add noise to all of the inputs, is just what the output would have been with noise-free system. The sum of all the inputs at wixi, plus wi times the noise that we added to each input. And those noises are sampled from a Gaussian with zero mean of variance sigma I^..so So if we compute the expected squared difference between Y noise E and the target value t, that's the quantity that's shown on the left-hand side of the equation. And I'm using an e followed by square brackets to mean an expectation. That's not the arrow, that's an expectation. And what we're computing the expectation of is the thing in the square brackets. So in this case, we're computing the expectation of the squared arrow that we'll get with the noisy system. So if we substitute the equation above for Y noisy, we need the expectation of Y. Plus the sum of all the I, WI, epsilon I So when we complete the square, the first time we get is Yt^ and that's not in the side of expectation bracket because it doesn't involve any noise. The second term is the cross product of the two terms above and the third term is the square of the last term. Now that equation simplifies a lot. In fact, it simplifies down to the normal squared error. Plus the expectation of WI^2, epsilon I^2, summed over all I. The reason it simplifies is because epsilon I is independent of epsilon J. So if you look at the last term, when we multiply at that square, all of the cross terms have an expected value of zero. Because we're multiplying together two independent things that are zero mean. If you look at the middle chart, that also has an expectation of zero, because each of the epsilon I's is independent of the residual error. So we can rewrite the expectation of the sum over all I of Wi^ epsilon squared, as simply the sum over all I with w I squared, sigma I squared, because the expectation of up to I squared is just sigma I squared, because that's how we generated epsilon i. And so we see that the expected squared error we get is just the squared error we get in the noise free system. Plus this additional term. And that looks just like an L2 penalty on the WI. With the sigma I^ being the strength of the penalty. In more complex nets, we can restrict the capacity by adding Gaussian noise to the weights. This isn't exactly equal to an L2 penalty. But it seems actually to work better, especially in recurring networks. So Alex Graves recently took his recurrent net that recognizes handwriting and tried it with noise added to the weights. And it actually works better. We can also use noise in the activities as a regularizer So suppose we use back propagation to train a multi-lanural match with logistic hidden units. What's gonna happen if we make the units binary and stochastic on the forward pass but then we do the backward pass as if we'd done the normal deterministic forward pass using the real values? So we're going to treat a logistic unit, in the forward pass, as if it's a stacastic binary neuron. That is, we compute the output of the logistic P, and then we treat that P as the probability of outputting a one. And in the forward pass, you make a random decision whether to output a one or a zero using that probability. But in the backward paths, you use the real value of p for back propagating derivatives through the hidden unit. This isn't exactly correct, but it's close to being a correct thing to do for the stochastic system if all of the units make small contributions to each unit in the layer above. When we do this the performance on the training set is worse and training is considerably slower. It may be several times slower. But it does significantly better on the test set. This is currently an unpublished result.