1
00:00:00,149 --> 00:00:05,088
[MUSIC]

2
00:00:05,088 --> 00:00:09,690
Let's now describe a specific algorithm
we can use for entrance in our LDA model.

3
00:00:09,690 --> 00:00:13,620
An algorithm we're going to
cover is called Gibbs sampling.

4
00:00:13,620 --> 00:00:17,548
Recall that for the clustering task,
we focused on two different algorithms.

5
00:00:17,548 --> 00:00:21,710
One was k-means which iterated
between the following steps

6
00:00:21,710 --> 00:00:25,850
where we're making hard assignments
of datapoints to clusters.

7
00:00:25,850 --> 00:00:30,850
And we said that we can view k-means as a
coordinate descent algorithm, and as such,

8
00:00:30,850 --> 00:00:34,190
we can view it as maximizing
a given objective function.

9
00:00:36,370 --> 00:00:39,660
We then turn to For
mixtures of Gaussians and

10
00:00:39,660 --> 00:00:42,530
show that we iterate it between
this E-step and M-step and

11
00:00:42,530 --> 00:00:46,890
again, show that this could be viewed
as a coordinate descent algorithm.

12
00:00:46,890 --> 00:00:51,200
So in this case, we're also
maximizing an objective function,

13
00:00:52,260 --> 00:00:57,890
but here we're making soft assignments
of data points to clusters.

14
00:00:57,890 --> 00:01:01,870
Well, before we get to LDA,
a natural question is what can we do for

15
00:01:01,870 --> 00:01:06,220
this alternative clustering model we
presented where we used a bag of words

16
00:01:06,220 --> 00:01:10,220
representation of our document
instead of a tf-idf vector, and

17
00:01:10,220 --> 00:01:13,510
then used this alternative clustering
model instead of mixtures of Gaussians.

18
00:01:14,670 --> 00:01:19,570
Well, in this case, we can derive
an Algorithm just like we did for

19
00:01:19,570 --> 00:01:21,170
mixtures of Gaussian.

20
00:01:21,170 --> 00:01:26,400
But here, instead of having a Gaussian
likelihood of the tf-idf factor, we have

21
00:01:26,400 --> 00:01:32,420
what's called a multinomial likelihood of
our observed word counts in the document.

22
00:01:32,420 --> 00:01:36,160
So specifically,
we can look at the document and say, well,

23
00:01:36,160 --> 00:01:41,750
maybe there are m sub w counts
of word w in this document.

24
00:01:41,750 --> 00:01:48,780
So we can think of this as mw successes of
word w, where the probability of a success

25
00:01:48,780 --> 00:01:54,750
was given by the topic specific
probability of that given word.

26
00:01:54,750 --> 00:01:59,720
So we can use this to compute
the likelihood of all of our

27
00:01:59,720 --> 00:02:04,270
word counts in this document
under a given topic.

28
00:02:04,270 --> 00:02:08,410
And so we can think of this alternative
clustering model as a mixture of

29
00:02:08,410 --> 00:02:12,284
multinomial distributions instead
of a mixture of a Gaussians.

30
00:02:12,284 --> 00:02:16,335
And in this case, we can derive
an Algorithm just like we did for

31
00:02:16,335 --> 00:02:17,890
a mixture of Gaussians.

32
00:02:19,310 --> 00:02:21,130
But what can we do for our LDA model?

33
00:02:22,250 --> 00:02:28,090
Well, here we could derive an Style
algorithm to perform maximum likelihood

34
00:02:28,090 --> 00:02:33,700
estimation of our model parameters,
but this maximum likelihood estimation

35
00:02:33,700 --> 00:02:38,250
suffers from the same over fitting problem
that we saw in the mixtures of Gaussians.

36
00:02:38,250 --> 00:02:41,250
And we talked about issues and
high dimensional spaces.

37
00:02:41,250 --> 00:02:44,250
But the challenge is even harder here or

38
00:02:44,250 --> 00:02:49,600
the issue of over-fitting tends
to be even more severe in LDA,

39
00:02:49,600 --> 00:02:52,810
just because of the sheer number of
parameters that we have to learn.

40
00:02:53,980 --> 00:02:59,060
So instead, LDA is actually typically
specified as a Bayesian model.

41
00:02:59,060 --> 00:03:01,050
And when it's not,
it's not actually called LDA.

42
00:03:01,050 --> 00:03:04,740
It's called probabilistic
latent semantic analysis or

43
00:03:04,740 --> 00:03:07,910
probabilistic latent semantic indexing.

44
00:03:09,470 --> 00:03:13,940
But we've been talking about
everything in the context of LDA, which

45
00:03:13,940 --> 00:03:18,250
kind of swept under the rug some of the
details about this Bayesian specification.

46
00:03:18,250 --> 00:03:21,380
It's a bit too advanced for
the purposes of this course.

47
00:03:21,380 --> 00:03:24,370
But we're going to go through
some of the intuition, some of

48
00:03:24,370 --> 00:03:29,320
what we get out of this Bayesian approach
as well as a Bayesian inference algorithm.

49
00:03:29,320 --> 00:03:31,860
So, one thing a Bayesian approach does,

50
00:03:31,860 --> 00:03:36,610
is it accounts uncertainty in the
parameters when we're making predictions.

51
00:03:36,610 --> 00:03:39,780
So, instead of just doing something
like computing the maximum likelihood

52
00:03:39,780 --> 00:03:43,230
estimate of our parameters,
fixing the value of those parameters and

53
00:03:43,230 --> 00:03:45,280
then making predictions.

54
00:03:45,280 --> 00:03:48,950
We say, well, the parameter could have
been this, or it could have been that, or

55
00:03:48,950 --> 00:03:50,080
it could have been that.

56
00:03:50,080 --> 00:03:54,200
And we think about averaging
over that uncertainty

57
00:03:54,200 --> 00:03:55,390
when we're forming our predictions.

58
00:03:55,390 --> 00:03:59,600
So you can think of this as forming
a prediction for each one of these values.

59
00:03:59,600 --> 00:04:02,740
And computing a weighted
average of those predictions

60
00:04:03,760 --> 00:04:08,160
based on how much we think the parameter
is each one of these values.

61
00:04:09,220 --> 00:04:12,020
And the other thing that a Bayesian
approach does is that it

62
00:04:12,020 --> 00:04:15,230
naturally regularizes
our parameter estimates.

63
00:04:15,230 --> 00:04:19,820
So, if we think about maximizing our
objective instead of computing what's

64
00:04:19,820 --> 00:04:21,920
called a maximum likelihood estimate.

65
00:04:21,920 --> 00:04:25,850
You end up with something called
a maximum a posteriori estimate, and

66
00:04:25,850 --> 00:04:31,080
this is exactly akin to ideas we've talked
about many times in the specialization

67
00:04:31,080 --> 00:04:33,850
of regularizing the maximum
likelihood estimate.

68
00:04:35,030 --> 00:04:38,780
Okay, well in the context of
this Bayesian specification,

69
00:04:38,780 --> 00:04:41,240
the Algorithm is no longer tractable.

70
00:04:41,240 --> 00:04:45,540
Because the E step in particular
has this intractable expectation.

71
00:04:45,540 --> 00:04:50,014
So instead what you can do is something
that's called variational Where you

72
00:04:50,014 --> 00:04:54,505
introduce an approximation to handle
that really complicated expectation.

73
00:04:54,505 --> 00:04:55,005
[MUSIC]