1
00:00:04,840 --> 00:00:07,800
To wrap up our discussion on mixture
models, I want to discuss how we

2
00:00:07,800 --> 00:00:13,090
can think about applying this mixture
model to our document clustering task.

3
00:00:13,090 --> 00:00:17,140
This is something else that you guys
are going to explore in the assignment.

4
00:00:17,140 --> 00:00:21,869
And remember, our goal is to
discover groups of related articles,

5
00:00:22,870 --> 00:00:26,460
but when reducing our mixture model,
what we're aiming to do is to account for

6
00:00:26,460 --> 00:00:30,290
uncertainty in the assignments
of data points to these clusters

7
00:00:30,290 --> 00:00:34,830
like this article that lies between
this blue and orange cluster.

8
00:00:34,830 --> 00:00:40,170
And for our document representation, we're
going to use our standard td-idf vector.

9
00:00:40,170 --> 00:00:43,230
So, it's going to be
a vector with the number of

10
00:00:43,230 --> 00:00:47,450
elements of this vector equal
to the size of the vocabulary.

11
00:00:47,450 --> 00:00:50,430
And so, then based on this
document representation,

12
00:00:50,430 --> 00:00:54,320
we can think of embedding
our documents in r to the v.

13
00:00:54,320 --> 00:00:58,553
So, the real space with v
different dimensions where v is

14
00:00:58,553 --> 00:01:00,679
the size of the vocabulary.

15
00:01:00,679 --> 00:01:04,431
And this picture here, we're just
showing two dimensions of course, so

16
00:01:04,431 --> 00:01:08,140
it's as if we had a vocabulary,
which is two different words.

17
00:01:08,140 --> 00:01:13,870
And then, based on this representation,
our goal of using mixtures of

18
00:01:13,870 --> 00:01:18,760
Gaussians to discover clusters that might
look like something like the following.

19
00:01:18,760 --> 00:01:22,290
And again, our mixtures of Gaussians
are going to be able to provide

20
00:01:22,290 --> 00:01:25,530
soft assignments of data
points to these clusters.

21
00:01:25,530 --> 00:01:28,673
And in the next session we're
going to describe algorithmically how

22
00:01:28,673 --> 00:01:31,400
we provide these soft assignments.

23
00:01:31,400 --> 00:01:34,940
Let's think a little bit more about
applying just a vanilla mixture of

24
00:01:34,940 --> 00:01:37,910
Gaussian to this application,

25
00:01:37,910 --> 00:01:42,090
in particular, this high dimensional
document representation.

26
00:01:42,090 --> 00:01:45,480
Because in two dimensions,
when we talked about

27
00:01:45,480 --> 00:01:49,900
parameters of the Galician reset
there was a mean vector that has

28
00:01:49,900 --> 00:01:54,050
two parameters that we're going to need to
learn, which center is this distribution.

29
00:01:54,050 --> 00:01:56,614
But then,
there is this covariance metric and

30
00:01:56,614 --> 00:01:59,058
in two dimensions we have four parameters.

31
00:01:59,058 --> 00:02:04,371
There was the variance terms so the spread
along each one of our two dimensions and

32
00:02:04,371 --> 00:02:10,040
then the off-diagonal term,
which captured the correlation structure.

33
00:02:10,040 --> 00:02:14,319
And we, very quickly, mention that it's
a symmetric matrix so it turns out these

34
00:02:14,319 --> 00:02:17,992
off-diagonal terms, sigma one,
two and sigma two, one are equal.

35
00:02:17,992 --> 00:02:21,961
So really there are three unique
parameters that we need to specify in two

36
00:02:21,961 --> 00:02:23,718
dimensions for this Gaussian,

37
00:02:23,718 --> 00:02:27,380
or three parameters that we're
going to need to learn from the data.

38
00:02:28,900 --> 00:02:31,440
But when we're looking at V
different dimensions, and

39
00:02:31,440 --> 00:02:36,870
imagine V is really large, if we
allowed for a full covariance matrix,

40
00:02:36,870 --> 00:02:41,980
we would have V(V+1) over 2
unique parameters to learn.

41
00:02:41,980 --> 00:02:46,090
And that's a lot on the order
of V squared parameters.

42
00:02:46,090 --> 00:02:50,450
And the way that we get V(V+1)
over 2 instead of V squared

43
00:02:50,450 --> 00:02:52,950
is by the fact that it's
a symmetric matrix.

44
00:02:54,490 --> 00:02:56,660
So it's not quite V squared parameters.

45
00:02:57,910 --> 00:03:02,300
Okay, so instead of specifying
these full covariance matrices,

46
00:03:02,300 --> 00:03:06,480
what we're going to do is assume
a diagonal form to the covariance matrix.

47
00:03:07,580 --> 00:03:13,460
So there's going to be a variance term for
each one of our different dimensions.

48
00:03:13,460 --> 00:03:19,510
One to capital V and then all of the off
diagonal terms are going to be fixed to 0.

49
00:03:19,510 --> 00:03:23,440
And so we're just going to be learning the
variances along the different dimensions.

50
00:03:24,730 --> 00:03:31,480
And, so, what this means is that, instead
of having an arbitrarily aligned ellipse,

51
00:03:31,480 --> 00:03:36,490
we're going to constrain ourselves just
to looking at access aligned ellipses.

52
00:03:37,850 --> 00:03:40,558
And this represents
a restrictive assumption, but

53
00:03:40,558 --> 00:03:42,720
I want to mention a few things.

54
00:03:42,720 --> 00:03:47,220
The first is that we can actually
learn the weights on the different

55
00:03:47,220 --> 00:03:51,220
dimensions and these can be different
between the different clusters.

56
00:03:51,220 --> 00:03:56,380
So, in one cluster, we might learn
that one word is more important

57
00:03:56,380 --> 00:04:02,370
than the other word in forming the score
of the observation under that cluster and

58
00:04:02,370 --> 00:04:04,590
that could be different
than any other clusters.

59
00:04:06,160 --> 00:04:09,840
So this represents a formulation that's
still a lot more flexible than what you

60
00:04:09,840 --> 00:04:11,123
had in k-means.

61
00:04:11,123 --> 00:04:12,718
because remembering k-means,

62
00:04:12,718 --> 00:04:17,108
the typical formulation would just
assume spherically symmetric clusters.

63
00:04:17,108 --> 00:04:21,793
That's a specific case of axis-aligned
ellipses where the variance terms

64
00:04:21,793 --> 00:04:26,380
are all the same, they're all
sigma squared along the diagonal.

65
00:04:26,380 --> 00:04:30,640
And we said we could also do a weighted
version of Euclidean distance but

66
00:04:30,640 --> 00:04:35,440
that would give us these axis-aligned
ellipses that are exactly the same

67
00:04:35,440 --> 00:04:37,590
between different clusters.

68
00:04:37,590 --> 00:04:42,720
And furthermore, in the case of k-means,
we had to specify those weights.

69
00:04:42,720 --> 00:04:46,837
Whereas, in the next section, we're
going to to show that we can actually

70
00:04:46,837 --> 00:04:49,199
learn the weights of this mixture model so

71
00:04:49,199 --> 00:04:53,270
the weights on these different
dimensions that are cluster specific.

72
00:04:53,270 --> 00:04:58,355
And in addition to learning the parameters
of these different Gaussians

73
00:04:58,355 --> 00:05:04,271
from the data, we're also going to learn
the soft assignments, so, the probability

74
00:05:04,271 --> 00:05:09,383
that any observation is associated
with any of the different clusters.

75
00:05:09,383 --> 00:05:13,749
[MUSIC]