1 00:00:00,000 --> 00:00:00,710 [MUSIC] 2 00:00:00,710 --> 00:00:08,154 In the previous video, we completely defined our model. 3 00:00:08,154 --> 00:00:13,471 And now everything that is left is to understand how to maximize it, 4 00:00:13,471 --> 00:00:18,530 with respect to the weights of both neural networks, w and phi. 5 00:00:18,530 --> 00:00:21,584 So we have to maximize this kind of objective. 6 00:00:21,584 --> 00:00:25,387 And since it hasn't an expect value inside, 7 00:00:25,387 --> 00:00:30,501 we have to approximate it with Monte Carlo somehow, right. 8 00:00:30,501 --> 00:00:33,446 So let's look closer into the subject. 9 00:00:33,446 --> 00:00:36,622 First of all part is easy. 10 00:00:36,622 --> 00:00:41,513 Because it's just KL distance between some Gaussian with 11 00:00:41,513 --> 00:00:45,620 known parameters, and the standard Gaussian. 12 00:00:45,620 --> 00:00:48,450 So we can compute this term analytically. 13 00:00:48,450 --> 00:00:52,662 So although it has an integral inside, we can compute it analytically. 14 00:00:52,662 --> 00:00:56,215 And this expression will not cause us any trouble, 15 00:00:56,215 --> 00:01:02,010 both in terms of evaluating it and finding gradients with respective parameters. 16 00:01:02,010 --> 00:01:07,702 So we can just not think about it and let TensorFlow think about the gradients, 17 00:01:07,702 --> 00:01:13,155 if we define the diversions, as this kind of analytical formula. 18 00:01:13,155 --> 00:01:16,938 So let's look a little closer into the first term of this expression. 19 00:01:16,938 --> 00:01:21,332 That's called f of parameters w and phi. 20 00:01:21,332 --> 00:01:24,602 So this function is sum with respect to objects, 21 00:01:24,602 --> 00:01:28,526 of expected values of logarithm of probability. 22 00:01:28,526 --> 00:01:33,792 And recall that we decided that each q i of individual object, 23 00:01:33,792 --> 00:01:38,972 would be some distribution, which on t i, given x i and phi. 24 00:01:38,972 --> 00:01:43,842 Which is defined by convolutional neural networks with parameters phi. 25 00:01:43,842 --> 00:01:46,543 So let's re-write it as false, and 26 00:01:46,543 --> 00:01:52,041 let's start with looks at the gradient of this function with respect to w. 27 00:01:52,041 --> 00:01:56,115 So the gradient of this function with respect to w, 28 00:01:56,115 --> 00:02:01,443 it looks as false, so half the gradient of sum of expected values. 29 00:02:01,443 --> 00:02:06,239 And we'll write the expected value by the definition. 30 00:02:06,239 --> 00:02:09,789 So latent variable t i is continuous, and thus, 31 00:02:09,789 --> 00:02:15,592 the expected value is just the integral of the probability times the function, 32 00:02:15,592 --> 00:02:18,031 the logarithm of p of xi given t i. 33 00:02:18,031 --> 00:02:22,659 Now, we can move the gradient sign inside the summation. 34 00:02:22,659 --> 00:02:28,040 Because summing, taking the gradient do not interfere with each other, 35 00:02:28,040 --> 00:02:29,790 we can swap this sides. 36 00:02:29,790 --> 00:02:32,889 And also for smooth and nice functions, 37 00:02:32,889 --> 00:02:38,647 we usually can swap the integration and great gradient sides, like this. 38 00:02:38,647 --> 00:02:42,231 Finally, since the first function q of t i given x i and phi, 39 00:02:42,231 --> 00:02:46,992 it doesn't depend on w, so we can easily push the equation side even further. 40 00:02:46,992 --> 00:02:52,580 Because this q is just a constant with respect to w. 41 00:02:52,580 --> 00:02:55,115 And it doesn't affect the value of gradient, 42 00:02:55,115 --> 00:02:58,820 we just have to multiply the gradient of logarithm with this value. 43 00:02:58,820 --> 00:03:03,798 And now we can see that what we obtained is just an expected value of 44 00:03:03,798 --> 00:03:05,619 the gradient, right? 45 00:03:05,619 --> 00:03:08,336 Sum with respect to the objects in theta set, 46 00:03:08,336 --> 00:03:11,133 expected value of the gradient of logarithm. 47 00:03:11,133 --> 00:03:14,120 And we can approximate this expect failure by sampling. 48 00:03:14,120 --> 00:03:17,182 So we can sample one, for example, 49 00:03:17,182 --> 00:03:21,273 point from the [INAUDIBLE] distribution q of t i. 50 00:03:21,273 --> 00:03:26,728 And then put that inside the logarithm of p of x i given t i, 51 00:03:26,728 --> 00:03:30,835 compute its gradient, with respect to w. 52 00:03:30,835 --> 00:03:36,468 So basically what we're doing here is just we're passing our image through our 53 00:03:36,468 --> 00:03:42,199 [INAUDIBLE], to get the parameters of the variation distribution theory q of t i. 54 00:03:42,199 --> 00:03:46,056 Then we sample on point from the variation distribution. 55 00:03:46,056 --> 00:03:51,348 And then we put this point as input to the secondary network with parameters w. 56 00:03:51,348 --> 00:03:55,976 And then we just compute the usual gradient of this second neural 57 00:03:55,976 --> 00:03:58,987 network with respect to its parameters. 58 00:03:58,987 --> 00:04:02,312 And given that its input is this sample t i hat. 59 00:04:02,312 --> 00:04:05,248 So this is just the usual gradient. 60 00:04:05,248 --> 00:04:08,586 We can use TensorFlow to find it automatically. 61 00:04:08,586 --> 00:04:12,706 And finally, this thing depends on the whole data set, 62 00:04:12,706 --> 00:04:17,100 but we can easily approximate it with a mini bunch, right? 63 00:04:17,100 --> 00:04:22,284 We can write it as some constants to normalize things of some, with respect to 64 00:04:22,284 --> 00:04:28,046 minimization of random objects which we have chosen for this particular iteration. 65 00:04:28,046 --> 00:04:32,379 And this is a standard stochastic gradient for a neural network. 66 00:04:32,379 --> 00:04:36,692 So you don't have to think here too much, you just have to find the gradient 67 00:04:36,692 --> 00:04:40,185 with TensorFlow of your second part of your neural network, 68 00:04:40,185 --> 00:04:41,909 with respect to parameters. 69 00:04:43,967 --> 00:04:46,771 So the overall [INAUDIBLE] here is false. 70 00:04:46,771 --> 00:04:52,359 We have our function, our objective, we pass our input image through 71 00:04:52,359 --> 00:04:57,390 the first convolutional neural network with parameters phi. 72 00:04:57,390 --> 00:05:00,856 We find the parameters m and s of variation distribution q. 73 00:05:00,856 --> 00:05:05,932 We sample one point from this Gaussian with parameters m and s. 74 00:05:05,932 --> 00:05:11,073 We put this point t i hat inside the second convolutional neural network w. 75 00:05:11,073 --> 00:05:14,974 And then we treat this t i hat as an input data, 76 00:05:14,974 --> 00:05:20,849 as a training object for the second convolutional neural network. 77 00:05:20,849 --> 00:05:24,581 And then we just output, we compute the objective of this second CNN. 78 00:05:24,581 --> 00:05:29,283 And then just use TensorFlow to differentiate it with respect to 79 00:05:29,283 --> 00:05:30,336 parameters. 80 00:05:30,336 --> 00:05:35,170 Note that here, we always used unwise estimation of the expected values. 81 00:05:35,170 --> 00:05:39,815 So we always substitute expected values with sample averages, and 82 00:05:39,815 --> 00:05:45,522 not some more complicated expressions, which are not obviously unbiased or not. 83 00:05:45,522 --> 00:05:48,865 So here, everything is unbiased, and on others, 84 00:05:48,865 --> 00:05:53,161 this stochastic approximation of the gradient will be correct. 85 00:05:53,161 --> 00:05:58,336 And if you do enough iterations, then you will 86 00:05:58,336 --> 00:06:04,318 converge to some good point in your parameter space. 87 00:06:04,318 --> 00:06:14,318 [MUSIC]