1 00:00:00,000 --> 00:00:04,077 [MUSIC] 2 00:00:04,077 --> 00:00:08,241 Let's now describe the algorithm that produces these random samples, and, 3 00:00:08,241 --> 00:00:12,670 to begin with, let's just present a standard implementation of Gibbs sampling. 4 00:00:14,000 --> 00:00:17,280 Gibbs sampling treats both our assignment variables and 5 00:00:17,280 --> 00:00:20,920 our model parameters exactly in the same manner. 6 00:00:20,920 --> 00:00:23,220 Whereas, remember, when we're looking at the Algorithm, 7 00:00:23,220 --> 00:00:24,170 there were different updates, 8 00:00:24,170 --> 00:00:28,070 when we're looking at our assignment variables, then our model parameters. 9 00:00:28,070 --> 00:00:29,850 Okay. And what Gibbs sampling does 10 00:00:29,850 --> 00:00:33,450 in its most standard implementation, is it just cycles through all of these 11 00:00:33,450 --> 00:00:37,160 assignment variables and model parameters and randomly samples 12 00:00:37,160 --> 00:00:41,520 each one from a conditional distribution where we're conditioning on 13 00:00:41,520 --> 00:00:45,770 the previously sampled values of all the other model parameters and 14 00:00:45,770 --> 00:00:50,370 assignment variables, and we're also conditioning on our observations. 15 00:00:50,370 --> 00:00:54,470 So from iteration to iteration, what we're conditioning on, 16 00:00:54,470 --> 00:00:59,830 the values of those assignment variables and model parameters are going to change, 17 00:00:59,830 --> 00:01:02,500 but the observations, they're always the same set of values. 18 00:01:02,500 --> 00:01:04,460 Whatever we observed from our data set. 19 00:01:05,680 --> 00:01:08,520 Well, let's look at this in pictures for our LDA model. 20 00:01:08,520 --> 00:01:11,490 And let's imagine that at some iteration of our Gibbs sampler, 21 00:01:11,490 --> 00:01:15,540 we have the following instantiation of all of our assignment variables and 22 00:01:15,540 --> 00:01:20,820 model parameters, which, in this case, are the topic vocabulary distributions And, 23 00:01:20,820 --> 00:01:26,210 the document specific assignment variables of words to topics and 24 00:01:26,210 --> 00:01:29,070 the document specific topic proportion factors. 25 00:01:30,570 --> 00:01:35,480 Well, at the next iteration of our Gibbs sampler, we might consider reassigning all 26 00:01:35,480 --> 00:01:40,980 of our assignment variables of words in a document to a topic. 27 00:01:40,980 --> 00:01:46,026 So this are the ziw's for some document i and when we're going to 28 00:01:46,026 --> 00:01:51,075 resample these variables, we form a conditional distribution 29 00:01:51,075 --> 00:01:55,841 where we're fixing the values of all of the topic vocabulary 30 00:01:55,841 --> 00:02:01,100 distributions as well as the topic preferences in this document. 31 00:02:02,830 --> 00:02:06,640 And so a question is what's the form of this conditional distribution? 32 00:02:06,640 --> 00:02:09,400 Well, let's look at just one word. 33 00:02:09,400 --> 00:02:12,630 Let's say eeg in this document. 34 00:02:12,630 --> 00:02:17,310 And let's look at, let's assume that this is word w, 35 00:02:19,930 --> 00:02:25,050 the w word in document i. 36 00:02:25,050 --> 00:02:29,310 And now let's look at riw2 37 00:02:29,310 --> 00:02:34,390 which will be our 38 00:02:34,390 --> 00:02:39,710 notation just like when we're talking about responsibilities in the Algorithm. 39 00:02:39,710 --> 00:02:46,376 This is going to indicate the probability of assigning 40 00:02:49,432 --> 00:02:53,803 ziw = 2, meaning word w in 41 00:02:53,803 --> 00:02:58,940 document i to the second topic. 42 00:03:00,360 --> 00:03:01,840 And what's this probability? 43 00:03:03,105 --> 00:03:08,750 Wel,l what's the prior probability that I randomly choose a word in this document 44 00:03:08,750 --> 00:03:10,680 and that it happens to be from topic two 45 00:03:12,580 --> 00:03:16,520 before I actually look at the value of what that word is? 46 00:03:16,520 --> 00:03:22,130 Well that's just how prevalent topic two is in this document. 47 00:03:22,130 --> 00:03:27,269 So here we have high i2 48 00:03:27,269 --> 00:03:36,740 being the prior probability of ziw = 2. 49 00:03:36,740 --> 00:03:41,010 And then we're going to multiply by the likelihood 50 00:03:41,010 --> 00:03:44,900 of observing the word EEG within the second topic. 51 00:03:46,070 --> 00:03:51,090 So this will be probability of EEG, 52 00:03:51,090 --> 00:03:55,926 the actual value of this wth word in the document, 53 00:03:55,926 --> 00:04:00,368 given ziw = 2. 54 00:04:00,368 --> 00:04:07,993 So given and this probability vector right here, so we're going to grab this out 55 00:04:12,286 --> 00:04:15,084 Right here. 56 00:04:19,526 --> 00:04:24,380 And what we do to compute this likelihood is we simply look at topic two and 57 00:04:24,380 --> 00:04:27,882 we scroll all the way down till we find the word EEG and 58 00:04:27,882 --> 00:04:30,985 then we look at the probability of that word and 59 00:04:30,985 --> 00:04:35,460 within this tech topic, the probability is probably pretty low. 60 00:04:36,560 --> 00:04:40,670 Then the final thing that we do to compute this probability is we have to normalize 61 00:04:40,670 --> 00:04:42,470 over all possible assignments. 62 00:04:42,470 --> 00:04:46,930 So, summing over all j equals 1 to capital 63 00:04:46,930 --> 00:04:51,747 K total number of topics, we look at pi ij, 64 00:04:51,747 --> 00:05:00,230 probability of EEG under an assignment of 65 00:05:01,330 --> 00:05:06,910 the Wth word in document i to cluster j or topic, rather, j. 66 00:05:08,190 --> 00:05:12,630 Okay, so this looks exactly like the responsibilities we saw in 67 00:05:12,630 --> 00:05:17,570 Except here when we're looking at the prior probability on a given 68 00:05:17,570 --> 00:05:22,350 assignment we're looking specifically within this document because we have these 69 00:05:22,350 --> 00:05:25,460 document specific topic prevalences. 70 00:05:25,460 --> 00:05:29,620 And then we're scoring our word now under a given 71 00:05:29,620 --> 00:05:33,490 topic probability vector instead of when we look at For mixtures of Gaussian 72 00:05:33,490 --> 00:05:38,470 we were scoring a whole vector under a given Gaussian distribution. 73 00:05:39,540 --> 00:05:43,600 Okay, the idea is that we would for this specific word 74 00:05:43,600 --> 00:05:47,060 compute the responsibility for every possible topic not just topic two. 75 00:05:47,060 --> 00:05:50,350 Topic one, two, three all the way to topic capital K. 76 00:05:50,350 --> 00:05:53,620 And then we normalize, 77 00:05:53,620 --> 00:05:57,860 we look at these normalized numbers so this whole vector sums to one. 78 00:05:57,860 --> 00:06:01,390 So this represents what's called a probability mass function. 79 00:06:01,390 --> 00:06:05,890 So it's a distribution over a set of integers, one to capital K, 80 00:06:05,890 --> 00:06:10,030 and then we just draw a value randomly from this distribution. 81 00:06:12,010 --> 00:06:16,370 And that value is going to be our assignment of the word EEG 82 00:06:16,370 --> 00:06:17,250 to a given topic. 83 00:06:19,250 --> 00:06:24,690 So perhaps, for example for this word EEG, maybe we would assign it to topic one. 84 00:06:24,690 --> 00:06:26,900 Since that's the topic about science but 85 00:06:26,900 --> 00:06:32,360 of course through this random set of assignments that we’re making here, 86 00:06:32,360 --> 00:06:35,940 we could’ve drawn any topic to assign to this specific word. 87 00:06:37,300 --> 00:06:42,420 Okay, we will repeat this procedure for every word on this document, and 88 00:06:42,420 --> 00:06:45,730 then we can think about reassigning out topic proportions for 89 00:06:45,730 --> 00:06:50,070 this document, given the set of word assignments that we've just made. 90 00:06:50,070 --> 00:06:54,220 So what informs these topic proportions for this document? 91 00:06:54,220 --> 00:06:58,860 Do we care about the topic vocabulary distributions? 92 00:06:58,860 --> 00:07:00,130 No, we actually don't. 93 00:07:00,130 --> 00:07:05,300 All we need are the counts of how many times a given topic was used 94 00:07:05,300 --> 00:07:10,720 in this document to inform these topic proportions for this document. 95 00:07:12,170 --> 00:07:16,570 But these counts are going to be regularized by our Bayesian prior. 96 00:07:16,570 --> 00:07:19,970 because remember we discussed before that we can think of the Bayesian prior 97 00:07:19,970 --> 00:07:23,220 as introducing a set of what we called pseudo-observations. 98 00:07:23,220 --> 00:07:29,450 So we can think of every topic as having a fixed number of observations in that topic 99 00:07:29,450 --> 00:07:36,450 that bias the distribution from just using the observe counts in this document. 100 00:07:36,450 --> 00:07:40,890 So we use these counts both the deserve counts and this document 101 00:07:40,890 --> 00:07:45,880 given this sample set of work assignment variables as well as the parameters of our 102 00:07:45,880 --> 00:07:51,000 basic in prior to form a distribution over these topic proportions, and 103 00:07:51,000 --> 00:07:55,670 then, we sample these topic proportions from that distribution, and the specific 104 00:07:55,670 --> 00:07:59,160 form of this distribution, however, is beyond the scope of this course. 105 00:08:00,590 --> 00:08:04,840 But the point here is that we can sample these topic proportions and then, 106 00:08:04,840 --> 00:08:09,720 we repeat this process for every document in our corpus. 107 00:08:09,720 --> 00:08:13,480 So we repeat sampling the word assignment variables and 108 00:08:13,480 --> 00:08:18,740 the topic proportions for each document in our entire data set. 109 00:08:18,740 --> 00:08:22,390 Then having done this, we can turn to the corpus wide 110 00:08:22,390 --> 00:08:25,500 topic vocabulary distributions and reassign those as well. 111 00:08:26,640 --> 00:08:31,340 And now when we go to figure out how probable our 112 00:08:31,340 --> 00:08:35,390 words within a given topic, what informs that? 113 00:08:35,390 --> 00:08:39,620 Well we can simply look at our word assignment variables in the entire corpus 114 00:08:39,620 --> 00:08:43,710 now and say, for example, for the word EEG, 115 00:08:43,710 --> 00:08:49,510 we can think of how many times was the word EEG assigned to topic one, and 116 00:08:49,510 --> 00:08:55,760 we can use that information to inform us of how probable EEG is under topic one. 117 00:08:55,760 --> 00:08:57,120 And we can do this for 118 00:08:57,120 --> 00:09:00,750 every word in our vocabulary for each one of these different topics. 119 00:09:02,038 --> 00:09:08,790 But again, these counts of topic usage within the corpus are realized by 120 00:09:08,790 --> 00:09:12,900 priors placed over this topic vocabulary distributions in our Bayesian framework. 121 00:09:14,210 --> 00:09:19,950 Okay, but in summary, we're going to randomly resample our topic vocabulary 122 00:09:19,950 --> 00:09:24,630 distributions and then, the Gibbs sampling algorithm repeats these steps again, and 123 00:09:24,630 --> 00:09:26,220 again, and again, and again. 124 00:09:26,220 --> 00:09:32,105 Resampling our word assignment variables, our document specific topic proportions, 125 00:09:32,105 --> 00:09:36,890 and our corpus wide topic vocabulary distributions, again and again, 126 00:09:36,890 --> 00:09:39,959 until we run out of our computational budget. 127 00:09:39,959 --> 00:09:45,139 [MUSIC]