1 00:00:03,610 --> 00:00:07,770 In the previous lecture, we have shown the equivalence of a Gaussian dropout with 2 00:00:07,770 --> 00:00:11,250 a special kind of variational base and inference. 3 00:00:11,250 --> 00:00:16,190 So we have proved that Gaussian dropout certainly optimizes 4 00:00:16,190 --> 00:00:20,240 the following ELBO and in this ELBO the second term doesn't 5 00:00:20,240 --> 00:00:25,350 depend on theta so it can be ignored if we optimize only with respect to theta. 6 00:00:25,350 --> 00:00:28,790 And now, the question is why not to optimize 7 00:00:28,790 --> 00:00:31,895 both with respect to theta and alpha because remember that 8 00:00:31,895 --> 00:00:38,870 our variational approximation our q_w depends on theta and also of alpha. 9 00:00:38,870 --> 00:00:41,760 And remember that the more relational parameters we 10 00:00:41,760 --> 00:00:44,775 have because it worked with the two Basuto distribution. 11 00:00:44,775 --> 00:00:48,530 So we will only get our approximation better and better. 12 00:00:48,530 --> 00:00:54,130 So then why not optimize ELBO with respect to theta and alpha? 13 00:00:54,130 --> 00:00:56,715 It's important to know that 14 00:00:56,715 --> 00:01:01,075 it wasn't possible until we came with a Bayesian interpretation of Gaussian dropout. 15 00:01:01,075 --> 00:01:05,570 Really, It will try to optimize with just the first term, 16 00:01:05,570 --> 00:01:08,395 the theta term with respect to both theta and alpha. 17 00:01:08,395 --> 00:01:12,240 We would quickly end up with zero values of alpha. Why so? 18 00:01:12,240 --> 00:01:17,805 Because we know that the maximum value of the first term is achieved when 19 00:01:17,805 --> 00:01:21,410 our distribution is delta function at w_m_o and 20 00:01:21,410 --> 00:01:25,500 delta function means zero variance and zero variance means zero alpha. 21 00:01:25,500 --> 00:01:31,355 So we may obtain some non-zero values of alpha only if we optimize both terms, 22 00:01:31,355 --> 00:01:33,430 the theta term and our regularizer. 23 00:01:33,430 --> 00:01:37,460 So now our variational approximation looks as follows, 24 00:01:37,460 --> 00:01:41,280 so this is fully factorized it goes in distribution with respect to all weights 25 00:01:41,280 --> 00:01:46,020 w_i_j with the mean theta i_j and with variance alpha theta i_j squared. 26 00:01:46,020 --> 00:01:47,590 But we may go even further. 27 00:01:47,590 --> 00:01:51,195 Why not to assign individual to department to each weight? 28 00:01:51,195 --> 00:01:55,425 Why not to say that our variational approximation looks as follows, 29 00:01:55,425 --> 00:01:59,220 so this is fully factorized Gaussian distribution over w_i_j with 30 00:01:59,220 --> 00:02:04,185 the mean theta i_j and with variants alpha i_j times theta i_j squared. 31 00:02:04,185 --> 00:02:07,265 So we may now assign individual department, 32 00:02:07,265 --> 00:02:09,740 individual alpha to each of the weights. 33 00:02:09,740 --> 00:02:14,895 And again, this will make our approximation only tighter. 34 00:02:14,895 --> 00:02:19,055 We only come close to the true posterior distribution. 35 00:02:19,055 --> 00:02:23,450 But before we proceed let us examine the purposes 36 00:02:23,450 --> 00:02:28,275 of our regulizer with dependence on alpha. 37 00:02:28,275 --> 00:02:33,500 Remember that we may approximate it with a small differential function and we see 38 00:02:33,500 --> 00:02:37,850 that the maximum value of this regularizer is achieved when alpha goes to plus infinity. 39 00:02:37,850 --> 00:02:43,550 This means that the second term of our ELBO encourages a larger values of alphas. 40 00:02:43,550 --> 00:02:46,510 And that's quite interesting because we may easily 41 00:02:46,510 --> 00:02:49,635 prove that if alpha j goes to plus infinity 42 00:02:49,635 --> 00:02:52,800 then the corresponding theta i_j which is the mean 43 00:02:52,800 --> 00:02:55,825 in our variational approximation converts to zero. 44 00:02:55,825 --> 00:03:02,110 In such a way that alpha i_j times theta i_j squared also converts to zero. 45 00:03:02,110 --> 00:03:07,325 What this means that our variational approximation, 46 00:03:07,325 --> 00:03:12,260 our q_w_i_j becomes delta function 47 00:03:12,260 --> 00:03:17,175 when alpha j goes to plus infinity and delta function stands at zero. 48 00:03:17,175 --> 00:03:21,290 And delta function stands at zero means that the corresponding w_i_j 49 00:03:21,290 --> 00:03:25,540 is exactly zero and this means that we may simply skip this connection. 50 00:03:25,540 --> 00:03:27,390 Simply remove the corresponding weight from 51 00:03:27,390 --> 00:03:31,225 our neural network thus effectively sparcifying it. 52 00:03:31,225 --> 00:03:37,950 So the whole procedure which is known as a sparse variational dropout looks as follows. 53 00:03:37,950 --> 00:03:41,615 First, we assign log-uniform prior distribution over the weights, 54 00:03:41,615 --> 00:03:44,065 which is fully factorized prior distribution. 55 00:03:44,065 --> 00:03:48,330 Then I fix variational family of distributions q of w given theta alpha. 56 00:03:48,330 --> 00:03:53,390 And again, this fully factorized distribution or o weight w_i_j with 57 00:03:53,390 --> 00:03:59,080 a mean theta i_j and with variants given by alpha i_j times theta i_j squared. 58 00:03:59,080 --> 00:04:02,710 And finally, we perform a stochastic variational inference trying to 59 00:04:02,710 --> 00:04:07,165 optimize our ELBO both with respect to thetas and with respect to all alphas. 60 00:04:07,165 --> 00:04:09,890 And in the end, we remove all weights 61 00:04:09,890 --> 00:04:12,910 whose alphas exceeded some predefined large threshold. 62 00:04:12,910 --> 00:04:15,900 And surprisingly, this procedure works quite well. 63 00:04:15,900 --> 00:04:19,890 So in this picture you see the behavior of convolution kernels from 64 00:04:19,890 --> 00:04:24,245 convolution layers and the fragments of weight matrix from fully connected squares. 65 00:04:24,245 --> 00:04:27,980 So you see that as training progresses the more and more weights, 66 00:04:27,980 --> 00:04:32,445 and the more and more coefficients and convolution kernels converge to zero. 67 00:04:32,445 --> 00:04:39,860 The compression in fact exceeds 200 and pay attention that the accuracy doesn't decrease. 68 00:04:39,860 --> 00:04:41,795 So we're keeping the same accuracy. 69 00:04:41,795 --> 00:04:44,490 This is the baseline while effectively 70 00:04:44,490 --> 00:04:48,990 compressing the whole network in hundredths of times. 71 00:04:48,990 --> 00:04:54,065 This only became possible due to this Bayesian dropout. 72 00:04:54,065 --> 00:05:01,205 So to conclude, it is known that modern deep architectures are very redundant. 73 00:05:01,205 --> 00:05:04,580 But it is quite problematic to remove this redundancy and one of 74 00:05:04,580 --> 00:05:06,260 the most successful ways to do this is 75 00:05:06,260 --> 00:05:10,550 the Bayesian dropout or sparse variational dropout. 76 00:05:10,550 --> 00:05:14,540 Variational Bayesian inference is a highly skilled procedure that allows 77 00:05:14,540 --> 00:05:15,650 to optimize millions of 78 00:05:15,650 --> 00:05:18,520 variational parameters and this 79 00:05:18,520 --> 00:05:19,380 is just one of 80 00:05:19,380 --> 00:05:23,195 many examples of successful combinations of Bayesian methods and deep learning. 81 00:05:23,195 --> 00:05:27,380 More examples of successful application of Bayesian methods for DNNs you may find 82 00:05:27,380 --> 00:05:29,720 in additional reading materials.