1 00:00:03,280 --> 00:00:06,370 In the last video of this week, 2 00:00:06,370 --> 00:00:12,030 let's discuss how can we apply Markov chain Monte Carlo to Bayesian Neural Networks. 3 00:00:12,030 --> 00:00:14,655 So this is your usual neural network, 4 00:00:14,655 --> 00:00:17,250 and it has weights on each h, right? 5 00:00:17,250 --> 00:00:20,150 So each connection has some weights which would train 6 00:00:20,150 --> 00:00:24,324 during basically fitting our neural network into data. 7 00:00:24,324 --> 00:00:26,175 Basing neural networks instead of weights, 8 00:00:26,175 --> 00:00:28,340 they have distributions and weights. 9 00:00:28,340 --> 00:00:31,090 So we treat w, the weights, 10 00:00:31,090 --> 00:00:33,125 as a latent variable, 11 00:00:33,125 --> 00:00:36,175 and then to do predictions, we marginalize w out. 12 00:00:36,175 --> 00:00:40,765 And this way, instead of just hard set failure for W11 like three, 13 00:00:40,765 --> 00:00:43,020 we'll have a distribution on w in 14 00:00:43,020 --> 00:00:46,575 posterior distribution which we'll use to obtain the predictions. 15 00:00:46,575 --> 00:00:51,630 And so, to make a prediction for new data object 16 00:00:51,630 --> 00:00:54,225 x or and though using 17 00:00:54,225 --> 00:00:58,950 the training data set of objects X_train and Y_train, we do the following. 18 00:00:58,950 --> 00:01:02,790 We say that this thing equals to integral where we 19 00:01:02,790 --> 00:01:07,628 marginalize our w. So we consider all possible values for the weights w, 20 00:01:07,628 --> 00:01:10,945 and we average the predictions with the respect to them. 21 00:01:10,945 --> 00:01:13,950 So here you have p of y given x and w, 22 00:01:13,950 --> 00:01:18,395 is your usual neural network output. 23 00:01:18,395 --> 00:01:22,425 So, you have your image x, for example, 24 00:01:22,425 --> 00:01:25,500 and you pass it through your neural network with parameters 25 00:01:25,500 --> 00:01:29,100 w. And then you record these predictions. 26 00:01:29,100 --> 00:01:32,760 And you do that for all possible values for 27 00:01:32,760 --> 00:01:37,680 the parameters w. So there are infinitely many values for W, 28 00:01:37,680 --> 00:01:41,865 and for each of them you pass your image through the corresponding neural network, 29 00:01:41,865 --> 00:01:44,775 and write down the prediction. 30 00:01:44,775 --> 00:01:48,292 And then you average all these predictions with weights, 31 00:01:48,292 --> 00:01:53,375 where weights are the posterior distribution on w, 32 00:01:53,375 --> 00:01:56,815 which basically says us, how probable is that these particular w was 33 00:01:56,815 --> 00:02:01,505 according to the training data set. 34 00:02:01,505 --> 00:02:04,500 So, you have kind of an infinitely large ensemble 35 00:02:04,500 --> 00:02:07,573 of neural networks with all possible weights, 36 00:02:07,573 --> 00:02:12,780 and with basically importance being proportional 37 00:02:12,780 --> 00:02:15,390 to the posterior distribution w. And this is 38 00:02:15,390 --> 00:02:18,135 full base in inference applied in neural networks, 39 00:02:18,135 --> 00:02:22,710 and this way we can get some benefits from our produced programming in neural networks. 40 00:02:22,710 --> 00:02:24,720 So again, estimate uncertainty, 41 00:02:24,720 --> 00:02:30,045 we may tune some hyperparameters naturally and stuff like that. 42 00:02:30,045 --> 00:02:33,690 And so we may notice here that this prediction, 43 00:02:33,690 --> 00:02:40,035 this integral, equals to an expected value of your output from your neural network, 44 00:02:40,035 --> 00:02:44,724 with respect to the posterior distribution w. So, 45 00:02:44,724 --> 00:02:48,330 basically it's an expected output 46 00:02:48,330 --> 00:02:52,340 of your neural network with weights defined by the posterior. 47 00:02:52,340 --> 00:02:54,525 And so to solve this problem, 48 00:02:54,525 --> 00:02:58,020 let's use your favorite Markov chain Monte Carlo procedure. 49 00:02:58,020 --> 00:03:02,078 So let's approximate this expected failure with sampling, 50 00:03:02,078 --> 00:03:03,965 for example with Gibbs sampling. 51 00:03:03,965 --> 00:03:09,955 And if require a few samples from the posterior distribution w, 52 00:03:09,955 --> 00:03:11,765 we can use that Ws, 53 00:03:11,765 --> 00:03:14,685 that weights of neural network and then, 54 00:03:14,685 --> 00:03:16,800 if we have like, for example, 10 samples. 55 00:03:16,800 --> 00:03:19,080 For each sample is a neural network, 56 00:03:19,080 --> 00:03:20,820 is a weights for some network. 57 00:03:20,820 --> 00:03:22,240 And then for new image, 58 00:03:22,240 --> 00:03:24,630 we can just pass it through all this 10 neural networks, 59 00:03:24,630 --> 00:03:27,570 and then average their predictions to get approximation of 60 00:03:27,570 --> 00:03:30,915 the full weight in inference with an integral. 61 00:03:30,915 --> 00:03:33,540 And how can we sample from the posterior? 62 00:03:33,540 --> 00:03:36,970 Well, we know it after normalization counts, as usually. 63 00:03:36,970 --> 00:03:43,590 So, here this posterior distribution W is proportional to the likelihood, 64 00:03:43,590 --> 00:03:46,890 so basically the prediction of a neural network on the training data 65 00:03:46,890 --> 00:03:50,490 set with parameters W times the prior, 66 00:03:50,490 --> 00:03:52,950 p of w which you can define as you wish, 67 00:03:52,950 --> 00:03:55,980 for example, just a standard normal distribution. 68 00:03:55,980 --> 00:03:59,685 And you have to divide by normalization constant, which you've done now. 69 00:03:59,685 --> 00:04:03,800 But it's okay because Gibbs sampling doesn't care, right? 70 00:04:03,800 --> 00:04:06,135 So it's a valid approach, 71 00:04:06,135 --> 00:04:09,165 but I think the problem here is that 72 00:04:09,165 --> 00:04:13,427 Gibbs sampling or Metropolis-Hastings sampling for that matter, 73 00:04:13,427 --> 00:04:18,155 it depends on the whole data set to make its steps, right? 74 00:04:18,155 --> 00:04:19,965 We discussed at the end of the previous video, 75 00:04:19,965 --> 00:04:26,210 that sometimes Gibbs sampling is okay with using mini-batches to make moves, 76 00:04:26,210 --> 00:04:28,045 but sometimes it's not. 77 00:04:28,045 --> 00:04:29,618 And as far as I know, 78 00:04:29,618 --> 00:04:30,860 in Bayesian neural networks, 79 00:04:30,860 --> 00:04:34,425 it's not a good idea to use Gibbs sampling with the mini-batches. 80 00:04:34,425 --> 00:04:36,795 So, we'll have to do something else. 81 00:04:36,795 --> 00:04:39,168 If we don't want to, you know, 82 00:04:39,168 --> 00:04:43,370 when we ran our Bayesian neural network on large data set, 83 00:04:43,370 --> 00:04:46,500 we don't want to spend time proportional 84 00:04:46,500 --> 00:04:51,790 to the size of the whole large data set or at each duration of training. 85 00:04:51,790 --> 00:04:55,120 Want to avoid that. So let's see what else can we do. 86 00:04:55,120 --> 00:05:04,226 And here comes the really nice idea of something called, Langevin Monte Carlo. 87 00:05:04,226 --> 00:05:07,130 So it forces false. 88 00:05:07,130 --> 00:05:11,920 Say, we want to sample from the posterior distribution p of w given some data. 89 00:05:11,920 --> 00:05:14,485 So train in data X_train and Y_train. 90 00:05:14,485 --> 00:05:19,760 Let's start from some initial value for the base w, 91 00:05:19,760 --> 00:05:25,260 and then in iterations do updates like this. 92 00:05:25,260 --> 00:05:31,435 So here, we update our w to be our previous w, plus epsilon, 93 00:05:31,435 --> 00:05:32,850 which is kind of learning create, 94 00:05:32,850 --> 00:05:36,783 times gradient of our logarithm of the posterior, 95 00:05:36,783 --> 00:05:39,590 plus some random noise. 96 00:05:39,590 --> 00:05:42,700 So the first part of this expression is actually 97 00:05:42,700 --> 00:05:47,595 a usual gradient ascent applied to train the weights of your neural network. 98 00:05:47,595 --> 00:05:50,230 And you can see it here clearly. 99 00:05:50,230 --> 00:05:53,875 So if you look at your posterior p of w given data, 100 00:05:53,875 --> 00:05:57,460 it will be proportional to logarithm of prior, 101 00:05:57,460 --> 00:06:00,850 plus logarithm of the condition distribution, 102 00:06:00,850 --> 00:06:04,330 p of y given x and w. And you can write it as 103 00:06:04,330 --> 00:06:07,865 follows by using the purpose of logarithm that, 104 00:06:07,865 --> 00:06:11,140 like logarithm of multiplication is sum of logarithms. 105 00:06:11,140 --> 00:06:15,412 And you should also have a normalization constant, here is that. 106 00:06:15,412 --> 00:06:17,135 But is a constant with respect to 107 00:06:17,135 --> 00:06:20,945 our optimization problem so we don't care about it, right? 108 00:06:20,945 --> 00:06:24,795 And on practice this first term, 109 00:06:24,795 --> 00:06:29,035 the prior, if you took a logarithm of a standard to normal distribution for example, 110 00:06:29,035 --> 00:06:33,700 it just gets some constant times the Euclidean norm of your weights 111 00:06:33,700 --> 00:06:40,500 w. So it's your usual weight decay which people oftenly use in neural networks. 112 00:06:40,500 --> 00:06:42,750 And the second term is, usual cross entropy. 113 00:06:42,750 --> 00:06:46,560 Usual objective that people use to train neural networks. 114 00:06:46,560 --> 00:06:51,355 So this particular update is actually a gradient descent or ascent 115 00:06:51,355 --> 00:06:54,280 with step size epsilon applied to 116 00:06:54,280 --> 00:06:58,010 your neural network to find the best possible values for parameters. 117 00:06:58,010 --> 00:06:59,538 But on each iteration, 118 00:06:59,538 --> 00:07:05,810 you add some Gaussian noise with variants being epsilon. 119 00:07:05,810 --> 00:07:08,735 So proportional to your learning crate. 120 00:07:08,735 --> 00:07:14,305 And if you do that, and if you choose your learning crate to be infinitely small, 121 00:07:14,305 --> 00:07:17,920 you can prove that this procedure will eventually 122 00:07:17,920 --> 00:07:22,150 generate your sample from the desired distribution, p of w given data. 123 00:07:22,150 --> 00:07:26,345 So basically, if you omit the noise, 124 00:07:26,345 --> 00:07:28,940 you will just have the usual gradient ascent. 125 00:07:28,940 --> 00:07:31,295 And if you use infinitely small learning crate, 126 00:07:31,295 --> 00:07:36,940 then you will definitely goes to just the local maximum around the current point, right? 127 00:07:36,940 --> 00:07:39,740 But if you add the noise in each iteration, 128 00:07:39,740 --> 00:07:45,095 theoretically you can end up in any point in the parameter space, like any point. 129 00:07:45,095 --> 00:07:48,140 But of course, with more probability, 130 00:07:48,140 --> 00:07:51,330 you will end up somewhere around the local maximum. 131 00:07:51,330 --> 00:07:54,670 If you're doing that, you will actually a sample from a posterior distribution. 132 00:07:54,670 --> 00:07:56,985 So you will end up in points with 133 00:07:56,985 --> 00:08:00,680 high probability of more often than in points with low probability. 134 00:08:00,680 --> 00:08:07,670 On practice, you will never use infinitely small learning crate, of course. 135 00:08:07,670 --> 00:08:12,800 But one thing you can do about it is to correct this scheme with Metropolis-Hastings. 136 00:08:12,800 --> 00:08:15,440 So you can say that theoretically, 137 00:08:15,440 --> 00:08:17,790 I should use infinitely small learning crate. 138 00:08:17,790 --> 00:08:20,845 I use not infinitely small but like .1, 139 00:08:20,845 --> 00:08:24,770 so I have to correct, I'm standing from the wrong distribution. 140 00:08:24,770 --> 00:08:28,400 And I can do Metropolis-Hastings correction to reject 141 00:08:28,400 --> 00:08:33,045 some of the moves and then to guarantee that I will sample from the current distribution. 142 00:08:33,045 --> 00:08:35,840 But, since we want to do 143 00:08:35,840 --> 00:08:39,440 some large scale optimization here and to work with mini-batches, 144 00:08:39,440 --> 00:08:43,775 we will not use this Metropolis- Hastings corrections because it's not scalable, 145 00:08:43,775 --> 00:08:47,840 and we'll just use small learning crate and hope for the best. 146 00:08:47,840 --> 00:08:52,720 So this way, we will not actually derive samples from the true posterior distribution w, 147 00:08:52,720 --> 00:08:56,120 but will be close enough if your learning crate is small enough, 148 00:08:56,120 --> 00:08:59,980 is close enough to the infinitely small, right? 149 00:08:59,980 --> 00:09:02,420 So the overall scheme is false. 150 00:09:02,420 --> 00:09:06,505 We initialized some ways of your neural network, 151 00:09:06,505 --> 00:09:12,735 then we do a few iterations or epochs of your favorite SGD. 152 00:09:12,735 --> 00:09:15,740 But on each iteration, you add some noise, 153 00:09:15,740 --> 00:09:22,775 some Gaussian noise with a variance being equal to the learning crate, to your update. 154 00:09:22,775 --> 00:09:27,530 And notice here also that you can't change learning crate at all, 155 00:09:27,530 --> 00:09:30,770 at any stage of your symbolic or you will also break 156 00:09:30,770 --> 00:09:35,985 the properties of this Langevin Monte Carlo idea. 157 00:09:35,985 --> 00:09:39,950 And then after doing a few iterations like hundred of them, 158 00:09:39,950 --> 00:09:41,625 you may say that, okay, 159 00:09:41,625 --> 00:09:44,265 I believe that now I have already converged. 160 00:09:44,265 --> 00:09:48,290 So, let's collect the full learning samples 161 00:09:48,290 --> 00:09:53,425 and use them as actual samples from the posterior distribution. 162 00:09:53,425 --> 00:09:56,410 That's the usual idea of Monte Carlo. 163 00:09:56,410 --> 00:10:00,710 And then finally, for a new point you can just diverge the predictions 164 00:10:00,710 --> 00:10:07,100 of your hundred slightly different neural networks 165 00:10:07,100 --> 00:10:12,530 on these new objects to get the prediction for your object. 166 00:10:12,530 --> 00:10:15,100 But this is really expensive, right? 167 00:10:15,100 --> 00:10:19,655 So there is this really nice and cool idea that we can use 168 00:10:19,655 --> 00:10:25,100 a separate neural network that will approximate the behavior of these in sample. 169 00:10:25,100 --> 00:10:30,390 So we are simultaneously training these Bayesian neural network. 170 00:10:30,390 --> 00:10:31,852 And simultaneously with that, 171 00:10:31,852 --> 00:10:37,310 we're using its behavior to train a student neural network that will 172 00:10:37,310 --> 00:10:44,175 try to mimic the behavior of this Bayesian neural network in the usual one. 173 00:10:44,175 --> 00:10:46,430 And so it has quite a few details 174 00:10:46,430 --> 00:10:49,400 there on how to do it efficiently, but it's really cool. 175 00:10:49,400 --> 00:10:53,160 So if you're interested in these kind of things, check it out.