1 00:00:04,040 --> 00:00:07,365 Welcome to week five of our course. 2 00:00:07,365 --> 00:00:14,310 This week we're going to talk about how to scale Bayesian methods to large data sets. 3 00:00:14,310 --> 00:00:16,665 So, even like 10 years ago, 4 00:00:16,665 --> 00:00:19,500 people used to think that Bayesian methods are mostly 5 00:00:19,500 --> 00:00:22,560 suited for small data sets because first of all, 6 00:00:22,560 --> 00:00:26,236 they're expensive, computation expensive. 7 00:00:26,236 --> 00:00:31,215 So if you want to do full Bayesian inference on like one million training examples, 8 00:00:31,215 --> 00:00:35,015 you are going to face lots of troubles. 9 00:00:35,015 --> 00:00:39,180 And second of all, there may not be beneficial anyway in the case of 10 00:00:39,180 --> 00:00:43,510 large data because people used to think that the main idea, 11 00:00:43,510 --> 00:00:47,520 the main benefit of Bayesian methods is to utilize your model, 12 00:00:47,520 --> 00:00:52,895 and to be able to extract as much information as possible from small data set. 13 00:00:52,895 --> 00:00:55,675 And if you have free large data set, then you don't need that, 14 00:00:55,675 --> 00:00:59,285 you can use any method you want and it will work just fine. 15 00:00:59,285 --> 00:01:00,683 But things changed then, 16 00:01:00,683 --> 00:01:03,190 Bayesian methods met deep learning, 17 00:01:03,190 --> 00:01:06,090 and people started to make 18 00:01:06,090 --> 00:01:10,670 some mixture models that has neural networks instead of a probabilistic model. 19 00:01:10,670 --> 00:01:12,908 And this is what this week will be about, 20 00:01:12,908 --> 00:01:16,185 how to combine neural networks with the Bayesian methods. 21 00:01:16,185 --> 00:01:18,885 So we'll discuss that. 22 00:01:18,885 --> 00:01:21,360 We'll discuss how to combine these two ideas. 23 00:01:21,360 --> 00:01:25,365 We'll see a particular example of variational old encoder, 24 00:01:25,365 --> 00:01:27,690 which allows you to generate nice samples, 25 00:01:27,690 --> 00:01:33,440 nice images by using neural network which has some probabilistic interpretation. 26 00:01:33,440 --> 00:01:37,620 And then, in the second module of Professor Dmitry Vetrov, 27 00:01:37,620 --> 00:01:41,205 will tell you about scalable methods for Bayesian neural networks, 28 00:01:41,205 --> 00:01:43,770 and about his cutting edge research in 29 00:01:43,770 --> 00:01:47,375 this area that allowed him to compress neural networks by a lot, 30 00:01:47,375 --> 00:01:53,810 and then to fight severe over fitting on some complicated data sets. 31 00:01:53,810 --> 00:01:55,065 So, to start with, 32 00:01:55,065 --> 00:02:03,295 let's discuss a little bit of the concept of estimation being unbiased. 33 00:02:03,295 --> 00:02:06,854 We have already touched on that in the previous week, on week four, 34 00:02:06,854 --> 00:02:08,279 on Markov Chain Monte Carlo, 35 00:02:08,279 --> 00:02:12,625 but let's make our self a little bit more clear here. 36 00:02:12,625 --> 00:02:17,783 We'll need that to build unbiased estimates for gradients of some neural networks. 37 00:02:17,783 --> 00:02:21,995 So, let's say you want to estimate an expected value. 38 00:02:21,995 --> 00:02:24,225 If you're using Monte Carlo estimation, 39 00:02:24,225 --> 00:02:27,115 you will substitute that with an average 40 00:02:27,115 --> 00:02:32,210 with respect to samples taken from that distribution, pure facts. 41 00:02:32,210 --> 00:02:35,250 And this idea may look like this. 42 00:02:35,250 --> 00:02:38,421 So here, the blue line is your distribution, 43 00:02:38,421 --> 00:02:42,615 pure facts and you can generate samples from it like this. 44 00:02:42,615 --> 00:02:48,555 And then you can take the average of your f_x on this data set and it can look like this, 45 00:02:48,555 --> 00:02:51,775 for example like red cross here. 46 00:02:51,775 --> 00:02:54,180 And this is actually a random variable. 47 00:02:54,180 --> 00:02:55,695 So if you repeat this process, 48 00:02:55,695 --> 00:02:58,305 if you generate other set of samples, and repeat, 49 00:02:58,305 --> 00:03:01,085 and again look right down the average of them, 50 00:03:01,085 --> 00:03:05,280 you will get some other approximation of your expected value. 51 00:03:05,280 --> 00:03:07,980 And by repeating this process more and more times, 52 00:03:07,980 --> 00:03:11,530 you can get samples from your random variable R. 53 00:03:11,530 --> 00:03:16,575 And this random variable has its own distribution, 54 00:03:16,575 --> 00:03:19,650 and its average, its expected value, 55 00:03:19,650 --> 00:03:26,280 exactly equals to the expected value of f_x which we wanted to estimate. 56 00:03:26,280 --> 00:03:30,455 So you can see that all these samples from the random variable R, 57 00:03:30,455 --> 00:03:35,125 are close to the expected value which we want to estimate, are around it. 58 00:03:35,125 --> 00:03:38,430 And which will basically means that if we use 59 00:03:38,430 --> 00:03:41,910 more samples like hundreds samples for each estimation, 60 00:03:41,910 --> 00:03:45,855 then we will make more accurate predictions. 61 00:03:45,855 --> 00:03:50,245 So, then these samples of 62 00:03:50,245 --> 00:03:56,545 the averages of R will like close to the expect value which we want to approximate. 63 00:03:56,545 --> 00:03:58,785 And the more samples we use like this, 64 00:03:58,785 --> 00:04:00,850 the more accurate the prediction becomes. 65 00:04:00,850 --> 00:04:04,160 So, it more and more peaked around the true value. 66 00:04:04,160 --> 00:04:05,460 And if you put it formally, 67 00:04:05,460 --> 00:04:08,320 this is the definition of an unbiased estimate. 68 00:04:08,320 --> 00:04:10,470 An estimate R is called unbiased if 69 00:04:10,470 --> 00:04:14,225 its expected value equals to the thing we want to approximate. 70 00:04:14,225 --> 00:04:16,130 So, if here is true, then 71 00:04:16,130 --> 00:04:22,180 all the samples of R lies around the expected value which we want to approximate. 72 00:04:22,180 --> 00:04:24,575 But how can it not be true? 73 00:04:24,575 --> 00:04:28,360 Well, if you look for example at the log from the expected value, 74 00:04:28,360 --> 00:04:31,274 and try to approximate it with Monte Carlo, 75 00:04:31,274 --> 00:04:36,710 it's kind of natural to approximate it like log of the sample average. 76 00:04:36,710 --> 00:04:41,515 But it turns out that it's not an unbiased estimate. 77 00:04:41,515 --> 00:04:47,155 So, if you look at the samples here, they will be, 78 00:04:47,155 --> 00:04:51,210 all the samples of the variant variable G, 79 00:04:51,210 --> 00:04:55,460 will lay to the left of the actual expected value. 80 00:04:55,460 --> 00:05:02,100 So, you're underestimating your true value which you want to approximate even on average. 81 00:05:02,100 --> 00:05:06,490 So all this red crosses are like not around the true value, 82 00:05:06,490 --> 00:05:07,900 but around some smaller value, 83 00:05:07,900 --> 00:05:10,210 and thus you're not doing the right job, 84 00:05:10,210 --> 00:05:17,990 you're doing biased estimation of your function logarithm of an expected value. 85 00:05:17,990 --> 00:05:21,700 And to summarize, so an estimate is called unbiased if 86 00:05:21,700 --> 00:05:27,455 its expected values equal to the thing which you want approximate. 87 00:05:27,455 --> 00:05:33,910 And it's entirely non trivial to understand if your estimator is unbiased or not. 88 00:05:33,910 --> 00:05:36,260 So for the simplest estimator, 89 00:05:36,260 --> 00:05:39,045 an expected value of function can be unbiasedly estimated 90 00:05:39,045 --> 00:05:42,475 as an average with respect to samples. 91 00:05:42,475 --> 00:05:45,040 For anything more complicated than that, 92 00:05:45,040 --> 00:05:51,545 you have to think carefully and check that you're not going to biased territory. 93 00:05:51,545 --> 00:05:54,905 And if you don't want to check or if you can't do it, 94 00:05:54,905 --> 00:05:57,430 then you're better to reduce 95 00:05:57,430 --> 00:06:02,083 your particular problem to the form of just expected value of some function, 96 00:06:02,083 --> 00:06:06,055 and then estimate this with these sample average. 97 00:06:06,055 --> 00:06:12,490 So this is a way to go to be sure that your estimation is unbiased.