1 00:00:00,149 --> 00:00:05,088 [MUSIC] 2 00:00:05,088 --> 00:00:09,690 Let's now describe a specific algorithm we can use for entrance in our LDA model. 3 00:00:09,690 --> 00:00:13,620 An algorithm we're going to cover is called Gibbs sampling. 4 00:00:13,620 --> 00:00:17,548 Recall that for the clustering task, we focused on two different algorithms. 5 00:00:17,548 --> 00:00:21,710 One was k-means which iterated between the following steps 6 00:00:21,710 --> 00:00:25,850 where we're making hard assignments of datapoints to clusters. 7 00:00:25,850 --> 00:00:30,850 And we said that we can view k-means as a coordinate descent algorithm, and as such, 8 00:00:30,850 --> 00:00:34,190 we can view it as maximizing a given objective function. 9 00:00:36,370 --> 00:00:39,660 We then turn to For mixtures of Gaussians and 10 00:00:39,660 --> 00:00:42,530 show that we iterate it between this E-step and M-step and 11 00:00:42,530 --> 00:00:46,890 again, show that this could be viewed as a coordinate descent algorithm. 12 00:00:46,890 --> 00:00:51,200 So in this case, we're also maximizing an objective function, 13 00:00:52,260 --> 00:00:57,890 but here we're making soft assignments of data points to clusters. 14 00:00:57,890 --> 00:01:01,870 Well, before we get to LDA, a natural question is what can we do for 15 00:01:01,870 --> 00:01:06,220 this alternative clustering model we presented where we used a bag of words 16 00:01:06,220 --> 00:01:10,220 representation of our document instead of a tf-idf vector, and 17 00:01:10,220 --> 00:01:13,510 then used this alternative clustering model instead of mixtures of Gaussians. 18 00:01:14,670 --> 00:01:19,570 Well, in this case, we can derive an Algorithm just like we did for 19 00:01:19,570 --> 00:01:21,170 mixtures of Gaussian. 20 00:01:21,170 --> 00:01:26,400 But here, instead of having a Gaussian likelihood of the tf-idf factor, we have 21 00:01:26,400 --> 00:01:32,420 what's called a multinomial likelihood of our observed word counts in the document. 22 00:01:32,420 --> 00:01:36,160 So specifically, we can look at the document and say, well, 23 00:01:36,160 --> 00:01:41,750 maybe there are m sub w counts of word w in this document. 24 00:01:41,750 --> 00:01:48,780 So we can think of this as mw successes of word w, where the probability of a success 25 00:01:48,780 --> 00:01:54,750 was given by the topic specific probability of that given word. 26 00:01:54,750 --> 00:01:59,720 So we can use this to compute the likelihood of all of our 27 00:01:59,720 --> 00:02:04,270 word counts in this document under a given topic. 28 00:02:04,270 --> 00:02:08,410 And so we can think of this alternative clustering model as a mixture of 29 00:02:08,410 --> 00:02:12,284 multinomial distributions instead of a mixture of a Gaussians. 30 00:02:12,284 --> 00:02:16,335 And in this case, we can derive an Algorithm just like we did for 31 00:02:16,335 --> 00:02:17,890 a mixture of Gaussians. 32 00:02:19,310 --> 00:02:21,130 But what can we do for our LDA model? 33 00:02:22,250 --> 00:02:28,090 Well, here we could derive an Style algorithm to perform maximum likelihood 34 00:02:28,090 --> 00:02:33,700 estimation of our model parameters, but this maximum likelihood estimation 35 00:02:33,700 --> 00:02:38,250 suffers from the same over fitting problem that we saw in the mixtures of Gaussians. 36 00:02:38,250 --> 00:02:41,250 And we talked about issues and high dimensional spaces. 37 00:02:41,250 --> 00:02:44,250 But the challenge is even harder here or 38 00:02:44,250 --> 00:02:49,600 the issue of over-fitting tends to be even more severe in LDA, 39 00:02:49,600 --> 00:02:52,810 just because of the sheer number of parameters that we have to learn. 40 00:02:53,980 --> 00:02:59,060 So instead, LDA is actually typically specified as a Bayesian model. 41 00:02:59,060 --> 00:03:01,050 And when it's not, it's not actually called LDA. 42 00:03:01,050 --> 00:03:04,740 It's called probabilistic latent semantic analysis or 43 00:03:04,740 --> 00:03:07,910 probabilistic latent semantic indexing. 44 00:03:09,470 --> 00:03:13,940 But we've been talking about everything in the context of LDA, which 45 00:03:13,940 --> 00:03:18,250 kind of swept under the rug some of the details about this Bayesian specification. 46 00:03:18,250 --> 00:03:21,380 It's a bit too advanced for the purposes of this course. 47 00:03:21,380 --> 00:03:24,370 But we're going to go through some of the intuition, some of 48 00:03:24,370 --> 00:03:29,320 what we get out of this Bayesian approach as well as a Bayesian inference algorithm. 49 00:03:29,320 --> 00:03:31,860 So, one thing a Bayesian approach does, 50 00:03:31,860 --> 00:03:36,610 is it accounts uncertainty in the parameters when we're making predictions. 51 00:03:36,610 --> 00:03:39,780 So, instead of just doing something like computing the maximum likelihood 52 00:03:39,780 --> 00:03:43,230 estimate of our parameters, fixing the value of those parameters and 53 00:03:43,230 --> 00:03:45,280 then making predictions. 54 00:03:45,280 --> 00:03:48,950 We say, well, the parameter could have been this, or it could have been that, or 55 00:03:48,950 --> 00:03:50,080 it could have been that. 56 00:03:50,080 --> 00:03:54,200 And we think about averaging over that uncertainty 57 00:03:54,200 --> 00:03:55,390 when we're forming our predictions. 58 00:03:55,390 --> 00:03:59,600 So you can think of this as forming a prediction for each one of these values. 59 00:03:59,600 --> 00:04:02,740 And computing a weighted average of those predictions 60 00:04:03,760 --> 00:04:08,160 based on how much we think the parameter is each one of these values. 61 00:04:09,220 --> 00:04:12,020 And the other thing that a Bayesian approach does is that it 62 00:04:12,020 --> 00:04:15,230 naturally regularizes our parameter estimates. 63 00:04:15,230 --> 00:04:19,820 So, if we think about maximizing our objective instead of computing what's 64 00:04:19,820 --> 00:04:21,920 called a maximum likelihood estimate. 65 00:04:21,920 --> 00:04:25,850 You end up with something called a maximum a posteriori estimate, and 66 00:04:25,850 --> 00:04:31,080 this is exactly akin to ideas we've talked about many times in the specialization 67 00:04:31,080 --> 00:04:33,850 of regularizing the maximum likelihood estimate. 68 00:04:35,030 --> 00:04:38,780 Okay, well in the context of this Bayesian specification, 69 00:04:38,780 --> 00:04:41,240 the Algorithm is no longer tractable. 70 00:04:41,240 --> 00:04:45,540 Because the E step in particular has this intractable expectation. 71 00:04:45,540 --> 00:04:50,014 So instead what you can do is something that's called variational Where you 72 00:04:50,014 --> 00:04:54,505 introduce an approximation to handle that really complicated expectation. 73 00:04:54,505 --> 00:04:55,005 [MUSIC]