1 00:00:03,830 --> 00:00:08,140 So let's see how can we improve the idea of variational inference, 2 00:00:08,140 --> 00:00:12,480 such that it will be applicable to our latent variable model. 3 00:00:12,480 --> 00:00:16,215 So again the idea of variational inference is to 4 00:00:16,215 --> 00:00:20,395 maximize lower bound on the thing we want to maximize actually, 5 00:00:20,395 --> 00:00:23,700 with respect to a constraint 6 00:00:23,700 --> 00:00:27,600 that says that the variational distribution Q for each object should be factorized. 7 00:00:27,600 --> 00:00:31,620 So product of one-dimensional distributions. 8 00:00:31,620 --> 00:00:35,685 And let's emphasize the fact that each object has 9 00:00:35,685 --> 00:00:39,765 its own individual variational distribution Q, 10 00:00:39,765 --> 00:00:44,455 and these distributions are not connected in any way. 11 00:00:44,455 --> 00:00:48,510 So, one idea we can use here is as follows. 12 00:00:48,510 --> 00:00:55,530 So if saying that variational distribution Q for each object factorized is not enough, 13 00:00:55,530 --> 00:00:57,700 let's approximate it even further. 14 00:00:57,700 --> 00:00:59,820 And let's say that it's a Gaussian. 15 00:00:59,820 --> 00:01:02,435 So not only factorized but a factorized Gaussian. 16 00:01:02,435 --> 00:01:04,315 This way everything should be easier. 17 00:01:04,315 --> 00:01:09,393 Right? So, every object has its own latent variable T_i. 18 00:01:09,393 --> 00:01:14,495 And this latent variable T_i will have variational distribution Q_i, 19 00:01:14,495 --> 00:01:17,697 which is a Gaussian with some parameters M_i and S_i, 20 00:01:17,697 --> 00:01:22,405 which are parameters of our model which we want to train. 21 00:01:22,405 --> 00:01:26,475 Then we will maximize our lower bound with respect to these parameters. 22 00:01:26,475 --> 00:01:29,625 So, it's a nice idea, 23 00:01:29,625 --> 00:01:36,180 but the problem here is that we just added a lot of parameters for each training objects. 24 00:01:36,180 --> 00:01:39,915 So, for example if your latent variable Q_i is 50-dimensional, 25 00:01:39,915 --> 00:01:42,435 so it's vector with 50 numbers, 26 00:01:42,435 --> 00:01:48,420 then you just added 50 numbers for the vector M_i for each object, 27 00:01:48,420 --> 00:01:51,340 and 50 numbers for the vector S_i for each object. 28 00:01:51,340 --> 00:01:55,540 So 100 numbers, 100 parameters for each training object. 29 00:01:55,540 --> 00:01:58,056 And if you have million of training objects, 30 00:01:58,056 --> 00:02:04,052 then it's not a very good idea to add like 100 million parameters to your model, 31 00:02:04,052 --> 00:02:07,320 just because of some approximation, right? 32 00:02:07,320 --> 00:02:08,475 It will probably overfeed, 33 00:02:08,475 --> 00:02:11,085 and it will probably be really hard to train because 34 00:02:11,085 --> 00:02:14,405 of this really high number of parameters. 35 00:02:14,405 --> 00:02:18,188 And also it's not obvious how to find these parameters, M and S, 36 00:02:18,188 --> 00:02:22,680 for new objects to do inference, 37 00:02:22,680 --> 00:02:25,255 to do some predictions or generation, 38 00:02:25,255 --> 00:02:27,970 because for new objects, 39 00:02:27,970 --> 00:02:30,090 you have to solve again 40 00:02:30,090 --> 00:02:34,570 some optimization problem to find these parameters, and it can be slow. 41 00:02:34,570 --> 00:02:37,825 Okay, so we said that 42 00:02:37,825 --> 00:02:43,425 approximating the variational distribution with a factorized one is not enough. 43 00:02:43,425 --> 00:02:48,515 Approximation of the factors of the variational distribution with Gaussian is nice, 44 00:02:48,515 --> 00:02:50,610 but we have too many parameters for each object, 45 00:02:50,610 --> 00:02:53,900 because each of these Gaussians are not connected to each other. 46 00:02:53,900 --> 00:02:55,895 They have separate parameters. 47 00:02:55,895 --> 00:03:00,960 So let's try to connect these variational distributions Q_i of individual objects. 48 00:03:00,960 --> 00:03:04,890 One way we can do that is to say that they are all the same. 49 00:03:04,890 --> 00:03:08,035 So Q_i's all equal to each other. 50 00:03:08,035 --> 00:03:10,655 We can do that, but it will be too restrictive, 51 00:03:10,655 --> 00:03:13,420 we'll not be able to train anything meaningful. 52 00:03:13,420 --> 00:03:18,379 Other approach here is to say that all Q_i's are the same distribution, 53 00:03:18,379 --> 00:03:22,465 but it depends on X_i's and weight. 54 00:03:22,465 --> 00:03:25,995 So let's say that each Q_i is a normal distribution, 55 00:03:25,995 --> 00:03:29,990 which has parameters that somehow depend on X_i. 56 00:03:29,990 --> 00:03:32,980 So it turns out that actually now each Q_i is different, 57 00:03:32,980 --> 00:03:35,775 but they all share the same parameterization. 58 00:03:35,775 --> 00:03:39,095 So they all share the same form. 59 00:03:39,095 --> 00:03:41,010 And now, even for new objects, 60 00:03:41,010 --> 00:03:45,020 we can easily find its variational approximation Q. 61 00:03:45,020 --> 00:03:48,645 We can pass this new object through the function M, 62 00:03:48,645 --> 00:03:50,265 and for the function S, 63 00:03:50,265 --> 00:03:53,065 and then find its parameters of its Gaussian. 64 00:03:53,065 --> 00:03:57,825 And this way, we now need to maximize our lower bound 65 00:03:57,825 --> 00:04:02,770 with respect to our original parameters W. And this parameter Phi, 66 00:04:02,770 --> 00:04:05,760 that defines the parametric way on how we 67 00:04:05,760 --> 00:04:09,555 convert X_i's to the parameters of the distribution. 68 00:04:09,555 --> 00:04:14,085 And how can we define this with this function M of X_i, 69 00:04:14,085 --> 00:04:16,745 and with parameters Phi. 70 00:04:16,745 --> 00:04:18,690 Well, as we have already discussed, 71 00:04:18,690 --> 00:04:23,456 convolutional neural networks are a really powerful tool to work with images, right? 72 00:04:23,456 --> 00:04:25,369 So let's use them here too. 73 00:04:25,369 --> 00:04:29,340 So now we will have a convolutional neural network with 74 00:04:29,340 --> 00:04:33,863 parameters Phi that looks at your original input image, 75 00:04:33,863 --> 00:04:35,670 for example of a cat, 76 00:04:35,670 --> 00:04:41,045 and then transforms it to parameters of your variational distribution. 77 00:04:41,045 --> 00:04:45,450 And this way, we defined how can 78 00:04:45,450 --> 00:04:51,160 we approximate the variational distribution Q in this form, right? 79 00:04:51,160 --> 00:04:55,635 Okay, so let's look closer into the object we are trying to maximize. 80 00:04:55,635 --> 00:04:57,330 Recall that the lower bound is, 81 00:04:57,330 --> 00:04:59,880 by definition, equal to the sum, 82 00:04:59,880 --> 00:05:04,400 with respect to the objects in the data set of expected values of 83 00:05:04,400 --> 00:05:10,030 sum logarithm with respect to the variation distribution Q_i, right? 84 00:05:10,030 --> 00:05:13,050 And recall that in 85 00:05:13,050 --> 00:05:17,365 the plane expectation maximization algorithm it 86 00:05:17,365 --> 00:05:21,615 was really hard to approximate this expected value by sampling, 87 00:05:21,615 --> 00:05:23,760 because the Q and 88 00:05:23,760 --> 00:05:25,800 this expected value used to be 89 00:05:25,800 --> 00:05:29,035 the true posterior distribution on the latent variable T_i. 90 00:05:29,035 --> 00:05:31,410 And this true posterior is complicated, 91 00:05:31,410 --> 00:05:34,025 and we know it up to normalization constant. 92 00:05:34,025 --> 00:05:38,543 So we have to use Markov chain Monte Carlo to sample from it, which is slow. 93 00:05:38,543 --> 00:05:41,880 But now we approximate Q with a Gaussian, 94 00:05:41,880 --> 00:05:45,216 with known parameters which we know how to obtain. 95 00:05:45,216 --> 00:05:46,665 So for any object, 96 00:05:46,665 --> 00:05:51,341 we can pass it through our convolutional neural network with parameters Phi, 97 00:05:51,341 --> 00:05:54,360 obtaining parameters M and S, 98 00:05:54,360 --> 00:05:58,080 and then we can easily sample from these Gaussian, 99 00:05:58,080 --> 00:06:01,985 from these Q, to approximate our expected value. 100 00:06:01,985 --> 00:06:06,815 So now again, is a low half of this intractable expected value. 101 00:06:06,815 --> 00:06:11,150 We can easily approximate it with sampling because sampling is now cheap, 102 00:06:11,150 --> 00:06:14,065 it's just sampling from Gaussians. 103 00:06:14,065 --> 00:06:18,990 And if we recall how the model defined, 104 00:06:18,990 --> 00:06:25,815 the P of X_i on T, it's actually defined by another convolutional neural network. 105 00:06:25,815 --> 00:06:29,410 So the overall workflow will be as follows. 106 00:06:29,410 --> 00:06:32,017 We started with training image X, 107 00:06:32,017 --> 00:06:36,083 we pass it through the first neural network with parameters Phi. 108 00:06:36,083 --> 00:06:40,744 We get the parameters M and S of the variational distribution Q_i. 109 00:06:40,744 --> 00:06:44,910 We sample from this distribution one data point, 110 00:06:44,910 --> 00:06:47,545 which is something random. 111 00:06:47,545 --> 00:06:51,745 It can be different depending on our random seat or something. 112 00:06:51,745 --> 00:06:57,870 And then we pass this just sampled vector 113 00:06:57,870 --> 00:07:03,717 of latent variable T_i into the second part of our neural network, 114 00:07:03,717 --> 00:07:07,855 so into the convolutional neural network with parameters W. And this CNN, 115 00:07:07,855 --> 00:07:13,905 this second part, outputs us the distribution on the images, 116 00:07:13,905 --> 00:07:18,213 and actually we will try to make this whole structure to return 117 00:07:18,213 --> 00:07:24,815 us the images that are as close to the input images as possible. 118 00:07:24,815 --> 00:07:28,980 So this thing is looks really close to something called auto encoders in neural networks, 119 00:07:28,980 --> 00:07:31,540 which is just a neural network which is trying 120 00:07:31,540 --> 00:07:36,960 to output something which is as close as possible to the input. 121 00:07:36,960 --> 00:07:39,290 And this model is called variational auto encoder, 122 00:07:39,290 --> 00:07:43,200 because in contrast to the usual auto encoders, 123 00:07:43,200 --> 00:07:47,555 it has some assembly inside and it has some variational approximations. 124 00:07:47,555 --> 00:07:51,830 And the first part of this network is called encoder 125 00:07:51,830 --> 00:07:57,580 because it encodes the images into latent code or into the distribution on latent code. 126 00:07:57,580 --> 00:08:01,155 And the second part is called decoder, 127 00:08:01,155 --> 00:08:05,395 because it decodes the latent code into an image. 128 00:08:05,395 --> 00:08:07,680 Let's look what will happen if we 129 00:08:07,680 --> 00:08:11,910 forget about the variance in the variational distribution q. 130 00:08:11,910 --> 00:08:17,315 So let's say that we set s to be always zero, okay? 131 00:08:17,315 --> 00:08:21,200 So for any M(X), S of X is 0. 132 00:08:21,200 --> 00:08:27,400 Then the variational distribution QI is actually a deterministic one. 133 00:08:27,400 --> 00:08:31,940 It always outputs you the main value, M of XI. 134 00:08:31,940 --> 00:08:34,505 And in this case, 135 00:08:34,505 --> 00:08:38,420 we are actually directly passing 136 00:08:38,420 --> 00:08:43,510 this M of X into the second part of the network, into the decoder. 137 00:08:43,510 --> 00:08:46,450 So this way were updating the usual autoencoder, 138 00:08:46,450 --> 00:08:50,350 no stochastic elements inside. 139 00:08:50,350 --> 00:08:54,660 So this variance in the variational distribution Q is actually 140 00:08:54,660 --> 00:08:58,835 something that makes this model different from the usual autoencoder. 141 00:08:58,835 --> 00:09:04,690 Okay, so let's look a little bit closer into the objective we're trying to maximize. 142 00:09:04,690 --> 00:09:08,430 So this lower band, variational lower band, 143 00:09:08,430 --> 00:09:10,829 it can be decomposed into a sum of two terms, 144 00:09:10,829 --> 00:09:15,490 because the logarithm of a product is the sum of logarithms, right? 145 00:09:15,490 --> 00:09:21,955 And the second term in this equation equals 146 00:09:21,955 --> 00:09:25,885 to minus Kullback-Leibler divergence between 147 00:09:25,885 --> 00:09:30,755 the variational distribution Q and the prime distribution P of Ti. 148 00:09:30,755 --> 00:09:36,064 Just by definition. So KL divergence is something we discussed in week two, 149 00:09:36,064 --> 00:09:39,040 and also week three and it's 150 00:09:39,040 --> 00:09:44,050 something which measures some kind of a difference between distributions. 151 00:09:44,050 --> 00:09:50,855 So when we maximize this minus KL we are actually trying to minimize KL 152 00:09:50,855 --> 00:09:53,900 so we are trying to push the variational distribution 153 00:09:53,900 --> 00:09:57,970 QI as close to the prior as possible. 154 00:09:57,970 --> 00:10:00,030 And the prior is just the standard normal, 155 00:10:00,030 --> 00:10:02,120 as we decided, okay? 156 00:10:02,120 --> 00:10:07,315 This is the second term and the first term can be interpreted as follows, 157 00:10:07,315 --> 00:10:12,820 if for simplicity we set all the output variances to be 1, 158 00:10:12,820 --> 00:10:17,350 then this log likelihood of XI given Ti is 159 00:10:17,350 --> 00:10:23,535 just minus euclidean distance between XI and the predicted mu of Ti. 160 00:10:23,535 --> 00:10:29,775 So this thing is actually a reconstruction loss. 161 00:10:29,775 --> 00:10:34,710 It tries to push XI as close to the reconstruction as possible. 162 00:10:34,710 --> 00:10:39,275 And mu of Ti is just the mean output of our neural network. 163 00:10:39,275 --> 00:10:43,130 So if we consider our whole variational autoencoder, 164 00:10:43,130 --> 00:10:46,515 it takes as input an image X, XI, 165 00:10:46,515 --> 00:10:51,430 and then it's our posterior mu of Ti plus some noise. 166 00:10:51,430 --> 00:10:53,255 And if noise is constant, 167 00:10:53,255 --> 00:10:55,380 then we're training this model, 168 00:10:55,380 --> 00:10:58,340 we're just trying to make XI as close to mu of Ti as 169 00:10:58,340 --> 00:11:03,360 possible which is basically the objective of the usual autoencoder. 170 00:11:03,360 --> 00:11:11,790 And note that we are also computing the expected failure of 171 00:11:11,790 --> 00:11:14,910 this reconstruction loss with respect to 172 00:11:14,910 --> 00:11:17,280 the QI and QI is trying 173 00:11:17,280 --> 00:11:21,355 to approximate the posterior distrobution of the latent variables. 174 00:11:21,355 --> 00:11:28,035 So we're trying to say that for the latent variables Ti that are likely to cause X, 175 00:11:28,035 --> 00:11:31,035 according to our approximation of QI, 176 00:11:31,035 --> 00:11:34,030 we want the reconstruction loss to be low. 177 00:11:34,030 --> 00:11:41,265 So we want for these particular sensible Ti's for this particular XI, 178 00:11:41,265 --> 00:11:47,200 we want the reconstruction to be accurate. 179 00:11:47,200 --> 00:11:49,590 And this is kind of the same, 180 00:11:49,590 --> 00:11:52,870 not the same but it's really close to the usual autoencoder. 181 00:11:52,870 --> 00:11:56,075 But the second part is what makes the difference. 182 00:11:56,075 --> 00:12:00,335 This Kullback-Leibler divergence, it's something that 183 00:12:00,335 --> 00:12:06,090 pushes the QI to be non-deterministic, to be stochastic. 184 00:12:06,090 --> 00:12:10,635 So if you recall the idea that if we 185 00:12:10,635 --> 00:12:16,270 set the QI variance to zero we get the usual autoencoder, right? 186 00:12:16,270 --> 00:12:17,830 But why, while training the model, 187 00:12:17,830 --> 00:12:19,925 will we not choose that? 188 00:12:19,925 --> 00:12:25,525 Because if you reduce the number of noise inside it will be easier to train. 189 00:12:25,525 --> 00:12:29,685 So why will it choose not to inject noise in itself? 190 00:12:29,685 --> 00:12:32,440 Well, because of this regularization. 191 00:12:32,440 --> 00:12:35,263 So this KL divergence, 192 00:12:35,263 --> 00:12:40,690 it will not allow QI to be very deterministic because if 193 00:12:40,690 --> 00:12:43,410 QI variance is zero then this KL term is 194 00:12:43,410 --> 00:12:47,925 just infinity and we will not choose this kind of point of parameters. 195 00:12:47,925 --> 00:12:54,720 This regularization forces the overall structure to have some noise inside. 196 00:12:54,720 --> 00:12:58,770 And also notice that because of this KL divergence, 197 00:12:58,770 --> 00:13:03,870 because we are forcing our QI to be close to the standard Gaussian, 198 00:13:03,870 --> 00:13:07,250 we may now detect outliers because if we have 199 00:13:07,250 --> 00:13:12,547 a usual image from the training data set or something close to the training data set, 200 00:13:12,547 --> 00:13:17,416 then if you pass this image through our encoder, 201 00:13:17,416 --> 00:13:23,170 then it will output as a distribution, 202 00:13:23,170 --> 00:13:26,130 QI, which is close to the standard Gaussian. 203 00:13:26,130 --> 00:13:28,600 Because they train it this way. 204 00:13:28,600 --> 00:13:31,320 Because during training we try to force 205 00:13:31,320 --> 00:13:35,910 all those distributions to lie close to the standard Gaussian. 206 00:13:35,910 --> 00:13:40,530 But for a new image which the network never saw, 207 00:13:40,530 --> 00:13:44,055 of some suspicious behavior or something else, 208 00:13:44,055 --> 00:13:51,265 the conditional neural network of the encoder never saw these kind of images, right? 209 00:13:51,265 --> 00:13:57,930 So it can output your distribution on Ti as far away from the Gaussian at it wants. 210 00:13:57,930 --> 00:14:00,850 Because it wasn't trained to make them close to Gaussian. 211 00:14:00,850 --> 00:14:03,180 And so by looking at the distance between 212 00:14:03,180 --> 00:14:06,645 the variational distribution QI and the standard Gaussian, 213 00:14:06,645 --> 00:14:15,090 you can understand how anomalistic this point is and you can detect outliers. 214 00:14:15,090 --> 00:14:18,890 And also note that it's kind of easy to generate new points, 215 00:14:18,890 --> 00:14:22,245 nearly to hallucinate new data in these kind of models. 216 00:14:22,245 --> 00:14:25,275 So, because your model is defined this way, 217 00:14:25,275 --> 00:14:29,765 as an integral with respect to P of T, 218 00:14:29,765 --> 00:14:31,935 you can make a new point, 219 00:14:31,935 --> 00:14:34,230 a new image in two steps. 220 00:14:34,230 --> 00:14:37,110 First of all, sample Ti from the prior, 221 00:14:37,110 --> 00:14:39,675 from the standard normal and then just 222 00:14:39,675 --> 00:14:42,580 pass this sample from the standard Gaussian through 223 00:14:42,580 --> 00:14:49,200 your decoder network to decode your latent code into an image, 224 00:14:49,200 --> 00:14:56,050 and you will get some new samples of a fake silly picture or a fake ad or something.