1 00:00:00,000 --> 00:00:04,593 [MUSIC] 2 00:00:04,593 --> 00:00:08,569 But, as an alternative, we can perform Bayesian inference using a totally 3 00:00:08,569 --> 00:00:11,002 different algorithm called Gibbs sampling. 4 00:00:11,002 --> 00:00:15,955 And in Gibbs sampling, we're going to iteratively provide hard assignments 5 00:00:15,955 --> 00:00:20,832 just like we did in k-means, but these hard assignments are going to be drawn 6 00:00:20,832 --> 00:00:25,562 randomly from a specific distribution, whereas remembering k-means, 7 00:00:25,562 --> 00:00:28,800 we just solved an equation and got a solution. 8 00:00:28,800 --> 00:00:33,490 But the point here is that as we look across these random samples of our 9 00:00:33,490 --> 00:00:37,150 model parameters and assignment variables that are being produced by Gibbs sampling, 10 00:00:38,310 --> 00:00:43,570 this is capturing the uncertainty in the values that these variables can take. 11 00:00:43,570 --> 00:00:47,870 And the reason we're presenting Gibbs sampling in contrast to variational 12 00:00:47,870 --> 00:00:51,260 Is the fact that the updates of our Gibbs sampling steps 13 00:00:51,260 --> 00:00:55,030 tend to be very intuitive and very straightforward to implement. 14 00:00:56,320 --> 00:00:59,930 Okay, so let's look at a couple samples 15 00:00:59,930 --> 00:01:03,530 from a Gibbs sampler to get a little intuition about what's going on here. 16 00:01:03,530 --> 00:01:07,430 So remember in our LDA model, here are all our assignment variables and 17 00:01:07,430 --> 00:01:11,410 model parameters that we want to do inference over, and 18 00:01:11,410 --> 00:01:16,380 this is iteration 10,000 of a Gibbs sampler, and the reason I'm looking so 19 00:01:16,380 --> 00:01:21,360 far out at iteration 10,000 of the Gibbs sampler is the fact that in very large 20 00:01:21,360 --> 00:01:26,920 models, like LDA, it can take quite a few iterations before we get out 21 00:01:26,920 --> 00:01:30,600 of a potentially bad initialization and start getting some reasonable assignments. 22 00:01:30,600 --> 00:01:35,810 So let's imagine this is iteration 10,000, and these are the values assigned, 23 00:01:35,810 --> 00:01:40,053 they were randomly assigned to each of these module parameters and 24 00:01:40,053 --> 00:01:41,330 assignment variables. 25 00:01:41,330 --> 00:01:47,890 And then maybe this is sample 10,001, and sample 10,002, and if i just flip back and 26 00:01:47,890 --> 00:01:53,170 forth between these samples and do this a few times, we're just going back and 27 00:01:53,170 --> 00:01:58,100 forth through these three samples, we see how the values change. 28 00:01:58,100 --> 00:02:02,400 This is the randomness inherent in this algorithm, and so we can talk about 29 00:02:02,400 --> 00:02:07,840 what do we expect to happen, as we keep sampling over iterations. 30 00:02:07,840 --> 00:02:10,730 Well, one thing we can look at here is something called 31 00:02:10,730 --> 00:02:13,580 the joint model probability. 32 00:02:13,580 --> 00:02:18,389 So this is like the likelihood that we monitored before, but in a Bayesian model, 33 00:02:18,389 --> 00:02:21,943 not only do we have to account for the likelihood of our data, 34 00:02:21,943 --> 00:02:24,470 given a specific set of model parameters. 35 00:02:24,470 --> 00:02:26,870 But we also have to look at the probability 36 00:02:26,870 --> 00:02:28,620 of those model parameters themselves. 37 00:02:30,080 --> 00:02:33,480 And together, these two terms form the joint model probability. 38 00:02:33,480 --> 00:02:37,449 And if we look at joint model probability over Gibbs sampling iterations, 39 00:02:37,449 --> 00:02:41,859 we tend to get plots that look like the following, where there's this initial burn 40 00:02:41,859 --> 00:02:46,458 in period where we're just moving from our initialization to some reasonable space, 41 00:02:46,458 --> 00:02:48,040 and then we wander around. 42 00:02:48,040 --> 00:02:50,050 And throughout this whole process, 43 00:02:50,050 --> 00:02:53,740 the joint model probability can go up and down. 44 00:02:53,740 --> 00:02:57,820 And the reason is because this is a randomized algorithm. 45 00:02:57,820 --> 00:03:03,100 There's no guarantee that we're going to increase our joint probability. 46 00:03:03,100 --> 00:03:05,400 This is not an optimization algorithm. 47 00:03:05,400 --> 00:03:10,590 And that's actually the point of it, we're actually intending to explore this space 48 00:03:10,590 --> 00:03:16,110 of possible solutions, rather than converge just to a single point. 49 00:03:17,410 --> 00:03:21,010 And this is very similar to what we saw for stochastic gradient descent, 50 00:03:21,010 --> 00:03:25,440 or ascent, where there, there was randomness too 51 00:03:25,440 --> 00:03:28,670 because we were computing a noisy estimate of the gradient, and 52 00:03:28,670 --> 00:03:33,120 we showed that our objective function wasn't going to simply 53 00:03:33,120 --> 00:03:38,090 monotonically increase or not decrease, it might actually go up and down. 54 00:03:39,590 --> 00:03:42,920 But the important thing here is that eventually, meaning after enough 55 00:03:42,920 --> 00:03:48,150 iterations, the random samples that are produced by our Gibbs sampling algorithm 56 00:03:48,150 --> 00:03:52,370 allow us to form what are called correct Bayesian estimates. 57 00:03:52,370 --> 00:03:56,480 And to provide some examples of what this means, if we're thinking of a predictive 58 00:03:56,480 --> 00:04:02,550 task, we said that Bayesian's average over uncertainty in the parameters. 59 00:04:02,550 --> 00:04:07,930 Well what we can do from the output of our Gibbs sampling is we can look at a given 60 00:04:07,930 --> 00:04:13,200 sample and all the assignments, they're hard assignments, of the model parameters 61 00:04:13,200 --> 00:04:17,870 and variables, and we can form our prediction from those set of assignments. 62 00:04:17,870 --> 00:04:21,700 Then we can look at our next sample, form our prediction, our next sample, 63 00:04:21,700 --> 00:04:22,910 form our prediction. 64 00:04:22,910 --> 00:04:28,700 Then we can average over these predictions, across the different samples, 65 00:04:28,700 --> 00:04:33,010 and naturally because of how the Gibbs sampler is exploring uncertainty, 66 00:04:33,010 --> 00:04:36,050 it's spending more time in high probability regions and 67 00:04:36,050 --> 00:04:38,620 less time in low probability regions. 68 00:04:38,620 --> 00:04:40,810 It's forming a correct prediction, 69 00:04:40,810 --> 00:04:45,800 a correct Bayesian average, over the uncertainty in our parameters. 70 00:04:47,470 --> 00:04:50,660 Another thing we can think about doing is grabbing out 71 00:04:50,660 --> 00:04:52,710 what I'll call a good solution. 72 00:04:52,710 --> 00:04:58,030 And what I mean by a good solution is the set of model parameters and 73 00:04:58,030 --> 00:05:01,990 assignment variables that out of the space of everything we've explored, 74 00:05:01,990 --> 00:05:05,290 that maximize this joint model probability. 75 00:05:05,290 --> 00:05:09,610 And this is an estimate of what we call that maximum a posteriori 76 00:05:09,610 --> 00:05:11,500 parameter estimate. 77 00:05:11,500 --> 00:05:16,333 So if we look at monitoring joint probability over iterations, 78 00:05:16,333 --> 00:05:20,360 we just look at grabbing out the hard assignments made 79 00:05:20,360 --> 00:05:24,397 at the iteration that maximizes joint probability. 80 00:05:24,397 --> 00:05:28,539 [MUSIC]