1 00:00:03,770 --> 00:00:07,585 So, let's summarize what we learned so far. 2 00:00:07,585 --> 00:00:09,885 We discussed Markov Chain Monte Carlo methods, 3 00:00:09,885 --> 00:00:15,195 and the Monte Carlo part is about approximating expected values by sampling. 4 00:00:15,195 --> 00:00:18,315 And the main question here is how to sample. 5 00:00:18,315 --> 00:00:20,850 How can we generate samples from 6 00:00:20,850 --> 00:00:24,926 a complicated distribution which we may know up to normalization constant, 7 00:00:24,926 --> 00:00:27,420 which usually happens in probabilistic modeling? 8 00:00:27,420 --> 00:00:31,065 To do that we discussed two main methods. 9 00:00:31,065 --> 00:00:35,295 Both of them are coming from the Markov Chain Monte Carlo family, 10 00:00:35,295 --> 00:00:39,510 and the idea of Markov Chain Monte Carlo is to build a dynamic system, 11 00:00:39,510 --> 00:00:42,280 such that if you simulate it long enough, 12 00:00:42,280 --> 00:00:45,450 it's state start to look like samples from a desired distribution. 13 00:00:45,450 --> 00:00:48,795 In this dynamic system called Markov Chain, 14 00:00:48,795 --> 00:00:50,940 we discussed two ways to build a Markov Chain that 15 00:00:50,940 --> 00:00:53,790 converges to your distribution you want to sample from. 16 00:00:53,790 --> 00:00:56,325 The first method here is Gibbs sampling, 17 00:00:56,325 --> 00:00:59,035 which reduces the problem of sampling from 18 00:00:59,035 --> 00:01:04,290 multidimensional distribution to a sequence of one dimensional samplings. 19 00:01:04,290 --> 00:01:06,570 And one dimensional samplings are usually much easier 20 00:01:06,570 --> 00:01:09,000 because we've first of all some efficient methods 21 00:01:09,000 --> 00:01:14,300 for one dimensional sampling cyclic chain all purpose methods like rejection sampling. 22 00:01:14,300 --> 00:01:18,010 Second of all, sometimes you can even compute the normalization constant, 23 00:01:18,010 --> 00:01:21,990 because integrating things in one mention is much easier. 24 00:01:21,990 --> 00:01:24,330 And finally, sometimes you can even find 25 00:01:24,330 --> 00:01:28,530 analytical solutions to this one dimensional sampling problem. 26 00:01:28,530 --> 00:01:35,005 If your original multidimensional sampling is structured enough. 27 00:01:35,005 --> 00:01:37,780 But we discussed some downsides to this Gibbs sampling, 28 00:01:37,780 --> 00:01:43,340 is that, is generate highly correlated updates. 29 00:01:43,340 --> 00:01:47,040 And this means that it converge slowly and that you have to use lots 30 00:01:47,040 --> 00:01:52,165 and lots of samples to approximate your expected value accurately enough. 31 00:01:52,165 --> 00:01:55,495 The second method we discussed here is Metropolis Hastings, 32 00:01:55,495 --> 00:01:58,935 and it gives you a whole family of Markov Chains, 33 00:01:58,935 --> 00:02:02,569 which are all converged to the desired distribution you want, 34 00:02:02,569 --> 00:02:05,010 and it kind of gives you more freedom, 35 00:02:05,010 --> 00:02:06,305 you can choose, right? 36 00:02:06,305 --> 00:02:10,320 But with this freedom comes great responsibility. You have to choose wisely. 37 00:02:10,320 --> 00:02:14,895 If you choose this Markov Chain to be not well suited for your purpose like, 38 00:02:14,895 --> 00:02:19,725 if you choose Markov Chain that converge slowly or has again highly correlated samples, 39 00:02:19,725 --> 00:02:21,660 then you will waste lots of resources and 40 00:02:21,660 --> 00:02:24,330 the methods available will not be very efficient. 41 00:02:24,330 --> 00:02:28,445 So let's put this Markov Chain Monte Carlo methods in the context. 42 00:02:28,445 --> 00:02:33,350 Let's compare it to the variational inference we covered in the previous week. 43 00:02:33,350 --> 00:02:37,920 Again the Monte Carlo is about approximating the expected values. 44 00:02:37,920 --> 00:02:40,230 So substituting them with average with 45 00:02:40,230 --> 00:02:42,880 respect to samples you obtained from your distribution. 46 00:02:42,880 --> 00:02:47,490 And the main property here is that this approximation is unbiased. 47 00:02:47,490 --> 00:02:48,765 So, the more you wait, 48 00:02:48,765 --> 00:02:50,205 the more accuracy you get, 49 00:02:50,205 --> 00:02:54,835 and you can get this accuracy as low as you want if you're willing to wait long enough. 50 00:02:54,835 --> 00:03:01,165 If you want to express this property a little bit more mathematically, 51 00:03:01,165 --> 00:03:03,310 you can write down like this. 52 00:03:03,310 --> 00:03:09,195 So, this average, this approximation is a random variable itself, right? 53 00:03:09,195 --> 00:03:12,210 So if you repeat this process of obtaining samples and 54 00:03:12,210 --> 00:03:15,210 then averaging them to get an approximation, 55 00:03:15,210 --> 00:03:16,950 you'll get another approximation. 56 00:03:16,950 --> 00:03:18,360 So to run a variable, 57 00:03:18,360 --> 00:03:22,082 [inaudible] different approximations on different samples. 58 00:03:22,082 --> 00:03:24,835 And an estimate like this is called unbiased, 59 00:03:24,835 --> 00:03:30,640 if its expected value is equal to the thing you want to approximate. 60 00:03:30,640 --> 00:03:35,530 So, on average your approximation is correct. 61 00:03:35,530 --> 00:03:41,495 And if you wait long enough this approximation will converge to the right answer. 62 00:03:41,495 --> 00:03:44,935 And this is something that is not true for Variational Inference. 63 00:03:44,935 --> 00:03:46,425 So, in Variational Inference, 64 00:03:46,425 --> 00:03:48,510 we're trying to approximate the distribution we 65 00:03:48,510 --> 00:03:51,030 want to sample from or we want to work with, 66 00:03:51,030 --> 00:03:54,345 with distribution from some simple family, 67 00:03:54,345 --> 00:03:57,245 like for example of family of factorised distributions. 68 00:03:57,245 --> 00:04:01,380 And of course in iterations, Variational Inference will become better and better. 69 00:04:01,380 --> 00:04:05,700 Because so you think better and better approximation of 70 00:04:05,700 --> 00:04:09,722 your distribution in this class of factorised distributions. 71 00:04:09,722 --> 00:04:11,643 But you can't get your error to zero, 72 00:04:11,643 --> 00:04:13,920 because you can never approximate 73 00:04:13,920 --> 00:04:19,405 your complicated distribution arbitrarily well with a factorized one. 74 00:04:19,405 --> 00:04:25,470 So, a typical picture for this describing distribution is like this. 75 00:04:25,470 --> 00:04:29,773 You have your problem and you are trying to solve it with Markov Chain Monte Carlo, 76 00:04:29,773 --> 00:04:32,490 and it kind of works and gives you a solution which is 77 00:04:32,490 --> 00:04:36,330 better and better and better with time and it approaching to zero error. 78 00:04:36,330 --> 00:04:38,390 Then you use Variational Inference, 79 00:04:38,390 --> 00:04:42,840 it usually gives you a nice solution much faster but it stagnates. 80 00:04:42,840 --> 00:04:48,440 It can't get past some critical value fairer and can get you to zero. 81 00:04:48,440 --> 00:04:54,450 So, to put this Markov Chain Monte Carlo even more in the context, 82 00:04:54,450 --> 00:05:00,205 we can recall the table we had in the end of week three about Variational Inference. 83 00:05:00,205 --> 00:05:04,095 The table of approximate method store solve for 84 00:05:04,095 --> 00:05:08,990 probabilistic modeling problems we have covered so far. 85 00:05:08,990 --> 00:05:12,135 So first of all, there is Full Bayesian Inference, 86 00:05:12,135 --> 00:05:17,640 where you model all your parameters and latent variables as latent variables. 87 00:05:17,640 --> 00:05:21,240 So, you don't train anything in the usual sense. 88 00:05:21,240 --> 00:05:24,350 You're not trying to find the best fitters for parameters, 89 00:05:24,350 --> 00:05:28,550 but instead you're trying to marginalize how they'll think you don't know. 90 00:05:28,550 --> 00:05:32,430 This one theoritically get the best results. 91 00:05:32,430 --> 00:05:35,553 But on most for practical models, 92 00:05:35,553 --> 00:05:37,975 this Full Bayesian Inferences isn't feasible, 93 00:05:37,975 --> 00:05:39,850 and you have to do some approximations. 94 00:05:39,850 --> 00:05:43,730 And one of them is to do a midfield or variational inference. 95 00:05:43,730 --> 00:05:48,905 Where you assume that your posterior distribution is factorized, 96 00:05:48,905 --> 00:05:50,565 and you're trying to approximate 97 00:05:50,565 --> 00:05:55,100 your posterior distribution in these factorized family as well as you can. 98 00:05:55,100 --> 00:05:57,750 Now we have just added one more method here. 99 00:05:57,750 --> 00:06:01,480 We can approximate this Full Bayesian inference 100 00:06:01,480 --> 00:06:04,562 by sampling from the posterior distribution, 101 00:06:04,562 --> 00:06:06,740 and using these samples to obtain, 102 00:06:06,740 --> 00:06:12,285 to estimate the values you care about. 103 00:06:12,285 --> 00:06:17,710 So, we can use Gibbs sampling or Metropolis Hastings here and assemble a few points 104 00:06:17,710 --> 00:06:19,900 from the posterior distribution so far from 105 00:06:19,900 --> 00:06:23,960 the all latent variables and then use them to approximate. 106 00:06:23,960 --> 00:06:29,125 Then if you can do in that or if you don't want to because it processes slow, 107 00:06:29,125 --> 00:06:32,545 you can use the expectation maximization algorithm. 108 00:06:32,545 --> 00:06:41,110 Which basically tries to find the maximum likelihood estimation on the parameters theta, 109 00:06:41,110 --> 00:06:43,765 so it doesn't treat theta as latent variable, 110 00:06:43,765 --> 00:06:46,115 but instead it treats theta as parameters, 111 00:06:46,115 --> 00:06:52,155 and is trying to find the best theta for these parameters by using some iterative scheme. 112 00:06:52,155 --> 00:06:56,045 And this expectation maximization algorithm is also sometimes intractable. 113 00:06:56,045 --> 00:06:58,859 So, sometimes it's very hard to do E or M step. 114 00:06:58,859 --> 00:07:02,050 And in these cases you can do again midfield 115 00:07:02,050 --> 00:07:06,450 variational inference with applied to expectation maximization. 116 00:07:06,450 --> 00:07:08,020 So basically on the E step, 117 00:07:08,020 --> 00:07:09,885 you can approximate your posterior, 118 00:07:09,885 --> 00:07:14,160 your variational distribution cue with a factorized one, 119 00:07:14,160 --> 00:07:17,195 or again, you can do Markov Chain Monte Carlo for EM. 120 00:07:17,195 --> 00:07:18,442 So on the EM step, 121 00:07:18,442 --> 00:07:19,818 instead of working with 122 00:07:19,818 --> 00:07:25,060 a distribution cue directly 123 00:07:25,060 --> 00:07:28,975 and trying to find expected value with respect to this distribution, 124 00:07:28,975 --> 00:07:33,720 you can acquire samples from this posterior on the latent variables, 125 00:07:33,720 --> 00:07:39,700 and then use the samples to approximate the expected value on the EM step, 126 00:07:39,700 --> 00:07:45,475 and maximize this approximation instead of the original expected value. 127 00:07:45,475 --> 00:07:47,533 As of the final summary, 128 00:07:47,533 --> 00:07:50,590 Markov Chain Monte Carlo is a method that allows you 129 00:07:50,590 --> 00:07:54,100 to do training or inferencing probabilistic models, 130 00:07:54,100 --> 00:07:56,290 and it's really easy to implement. 131 00:07:56,290 --> 00:08:01,375 It's really easy to parallelize at least in terms of like if you have 100 computers, 132 00:08:01,375 --> 00:08:06,115 you can run 100 independent cue centers for example on each computer, 133 00:08:06,115 --> 00:08:11,460 and then combine the samples obtained from all these servers. 134 00:08:11,460 --> 00:08:13,500 And finally, it gives you unbiased estimates. 135 00:08:13,500 --> 00:08:18,640 So in principle, if you wait long enough you can get as low accuracy as you may want. 136 00:08:18,640 --> 00:08:19,992 On the negative side, 137 00:08:19,992 --> 00:08:25,070 sometimes this Markov Chain Monte Carlo methods are slower than the others. 138 00:08:25,070 --> 00:08:27,440 But because they're so widely applicable, 139 00:08:27,440 --> 00:08:30,405 sometimes is just the only choice. 140 00:08:30,405 --> 00:08:32,770 So, in the next videos we will cover 141 00:08:32,770 --> 00:08:38,190 a few practical examples of how can you apply Markov Chain Monte Carlo to your problems.