1
00:00:04,609 --> 00:00:09,371
Let's now spend some time getting
into the nitty gritty details of And

2
00:00:09,371 --> 00:00:14,320
also provide a comparison between For
mixtures of calciums and K-mines.

3
00:00:14,320 --> 00:00:19,210
Well, to begin with,
we talked about the converge solution of

4
00:00:19,210 --> 00:00:22,550
In the series of pie chart
plots that we just showed.

5
00:00:22,550 --> 00:00:26,760
Well, the question is do we know that
the Algorithm is going to converge?

6
00:00:26,760 --> 00:00:29,220
Well, you can actually recast the E and

7
00:00:29,220 --> 00:00:35,390
M steps of the Algorithm as alternating
maximization of an objective.

8
00:00:35,390 --> 00:00:40,290
And from this we see that Is actually
a coordinate-ascent algorithm.

9
00:00:40,290 --> 00:00:41,570
And under mild conditions,

10
00:00:41,570 --> 00:00:45,700
you can prove that Indeed converges
to a local mode of the objective.

11
00:00:46,860 --> 00:00:47,760
And in your assignment,

12
00:00:47,760 --> 00:00:52,120
you're going to monitor convergence of the
Algorithm by computing the log likelihood

13
00:00:52,120 --> 00:00:58,140
of the data given the current estimates of
our model parameters and responsibilities.

14
00:00:58,140 --> 00:01:02,342
Well, based on the nonconvexed landscape
and convergence to a local mode,

15
00:01:02,342 --> 00:01:07,004
how we think of initializing the Algorithm
can be really critical to the quality of

16
00:01:07,004 --> 00:01:11,555
the local mode that we find, as well as to
the convergence rates of our algorithm.

17
00:01:11,555 --> 00:01:15,220
And there are many ways we
can think about initializing.

18
00:01:15,220 --> 00:01:19,950
One is we can think about just randomly
selecting some data points as our initial

19
00:01:19,950 --> 00:01:24,327
centers in doing hard assignments of
observations to the nearest cluster

20
00:01:24,327 --> 00:01:24,906
center.

21
00:01:24,906 --> 00:01:30,261
And then using those assignments to
form our initial parameter estimates and

22
00:01:30,261 --> 00:01:33,649
then our Algorithm would
iterate from there.

23
00:01:33,649 --> 00:01:38,963
Or, another alternative would be to
pick the cluster centers sequentially,

24
00:01:38,963 --> 00:01:42,666
like how we described in
our k-means++ algorithm,

25
00:01:42,666 --> 00:01:47,509
instead of just randomly choosing K
different data points all at once.

26
00:01:47,509 --> 00:01:51,832
Or, we could think about actually
running k-means and using the final

27
00:01:51,832 --> 00:01:56,316
solution of that as our cluster centers
for this initialization scheme.

28
00:01:56,316 --> 00:01:58,480
And there are also other approaches.

29
00:01:58,480 --> 00:02:03,926
For example, in speech applications,
it's really popular to do things where

30
00:02:03,926 --> 00:02:09,877
you gradually grow the mixture model, by
splitting and sometimes merging clusters.

31
00:02:09,877 --> 00:02:12,290
Until you have K clusters formed.

32
00:02:13,630 --> 00:02:14,970
Having discussed convergence and

33
00:02:14,970 --> 00:02:18,255
initialization, let's now spend
a little bit of time discussing

34
00:02:18,255 --> 00:02:21,690
a potential pit fall of
the Algorithm that we've described.

35
00:02:21,690 --> 00:02:24,983
And this all stems from the fact
that maximum likelihood estimation

36
00:02:24,983 --> 00:02:26,090
can over fit the data.

37
00:02:27,760 --> 00:02:31,570
So as an example of this let's imagine
that we just have two clusters.

38
00:02:31,570 --> 00:02:34,688
The green cluster and the blue cluster.

39
00:02:34,688 --> 00:02:38,832
And let's assume that it's some
iteration of our algorithm only one

40
00:02:38,832 --> 00:02:41,761
observation is assigned
to the green cluster and

41
00:02:41,761 --> 00:02:45,930
all the other observations
are assigned to the blue cluster.

42
00:02:45,930 --> 00:02:49,420
And when I'm speaking now,
I'm talking in terms of hard assignments.

43
00:02:49,420 --> 00:02:51,920
But this same thing that I'm
going to talk about holds if we're

44
00:02:51,920 --> 00:02:53,320
using soft assignments instead.

45
00:02:54,350 --> 00:02:58,910
Before this single observation
in this green cluster, what's

46
00:02:58,910 --> 00:03:04,150
the thing that's going to maximize the
likelihood of the data in that cluster?

47
00:03:04,150 --> 00:03:08,220
Well, for the green cluster,
we're simply going to

48
00:03:08,220 --> 00:03:12,560
set the mean of the cluster exactly
equal to that data point, and then we're

49
00:03:12,560 --> 00:03:17,500
going to shrink as a variance all the way
to zero, right around that data point.

50
00:03:18,630 --> 00:03:22,230
And for the blue cluster in this
example that I've shown here,

51
00:03:22,230 --> 00:03:23,900
we're going to shift it's mean over and

52
00:03:23,900 --> 00:03:27,540
increase it's variance to capture
all the data points that were shown.

53
00:03:29,490 --> 00:03:33,800
Okay so, this doesn't seem
like a great solution here.

54
00:03:33,800 --> 00:03:36,643
Right?
Because we have this one data point with

55
00:03:36,643 --> 00:03:39,335
this tiny, tiny, tiny, tiny, tiny,

56
00:03:39,335 --> 00:03:43,858
infinitesimally small cluster tied
right to the data point value.

57
00:03:43,858 --> 00:03:48,883
And than, these two sets of observations
which to me looked pretty separable,

58
00:03:48,883 --> 00:03:52,340
all lumped together in one
much more diffuse cluster.

59
00:03:53,340 --> 00:03:58,150
But, the likelihood of this
assignment is actually infinite.

60
00:03:58,150 --> 00:04:03,022
All driven by the fact that
the likelihood of the green observation,

61
00:04:03,022 --> 00:04:07,978
under a Gaussian that's absolutely
certain that the only value it can

62
00:04:07,978 --> 00:04:11,597
take is the value that was observed,
is infinite.

63
00:04:11,597 --> 00:04:16,117
And so that's going to completely dominate
the overall likelihood of this set of

64
00:04:16,117 --> 00:04:20,720
assignment and we're going to see that
this is actually the preferred solution.

65
00:04:22,000 --> 00:04:27,600
Luckily in practice you almost
never get into assignments and this

66
00:04:27,600 --> 00:04:35,520
very simplicity case of having just that
one green observation in that cluster,

67
00:04:35,520 --> 00:04:39,300
but this is something that really can
happen and you have to think carefully

68
00:04:39,300 --> 00:04:42,700
about it especially when you're thinking
about ways to initialize your algorithm.

69
00:04:43,810 --> 00:04:47,460
But the problem can actually be much
much worse in high dimensions, so

70
00:04:47,460 --> 00:04:51,730
to think about this let's go back
to our document cluster in example.

71
00:04:51,730 --> 00:04:56,350
So here we are going to assume that
we have some large vocabulary and

72
00:04:56,350 --> 00:04:59,440
every document is represented by
maybe just a word count factor,

73
00:04:59,440 --> 00:05:02,765
or a TFTIF over that vocabulary.

74
00:05:02,765 --> 00:05:07,544
And let's imagine that for
a given cluster, only one document in that

75
00:05:07,544 --> 00:05:13,310
cluster, again I'm talking in terms of
hard assignments but things generalized.

76
00:05:13,310 --> 00:05:20,020
But if there's only one document in that
cluster that has the word w in it or

77
00:05:20,020 --> 00:05:24,892
more generally if all of
the documents in that cluster

78
00:05:24,892 --> 00:05:29,457
happen to have the same
count of a specific word w.

79
00:05:29,457 --> 00:05:34,133
Then what's the Maximum

80
00:05:34,133 --> 00:05:38,164
likelihood estimate of the parameters
in that cluster for that word W?

81
00:05:38,164 --> 00:05:40,370
So just looking at that dimension,

82
00:05:40,370 --> 00:05:44,045
well, we simply set the mean
equal to that observed count,

83
00:05:44,045 --> 00:05:49,570
either from that one document, or all
these documents that have the same count.

84
00:05:49,570 --> 00:05:54,130
And then we shrink the variance
to zero about that value.

85
00:05:54,130 --> 00:05:56,688
So this isn't actually that
contrived of an example because if

86
00:05:56,688 --> 00:05:58,390
we're in a very high dimensional phase.

87
00:05:58,390 --> 00:06:02,619
So a very large vocabulary,
we really could imagine cases where all

88
00:06:02,619 --> 00:06:06,997
the documents in a given cluster
especially if we have enough clusters

89
00:06:06,997 --> 00:06:11,669
to partition up our data, all the
observations in a given cluster might not

90
00:06:11,669 --> 00:06:16,080
have a word at all or there might just
be one document that has the word.

91
00:06:17,090 --> 00:06:22,593
So this kind of set up really,
really can happen in high dimensions and

92
00:06:22,593 --> 00:06:27,910
what ends up happening as a result
of this is that in later iterations

93
00:06:27,910 --> 00:06:33,446
we go to compute the likelihood of
some other document that has word w.

94
00:06:33,446 --> 00:06:38,722
And some other count associated with word
w than what this cluster is specified,

95
00:06:38,722 --> 00:06:41,360
then it's going to have zero likelihood or

96
00:06:41,360 --> 00:06:46,872
zero responsibility assumed by this given
cluster because this cluster says nope,

97
00:06:46,872 --> 00:06:52,230
you can only have this value that
I've observed in my document so far.

98
00:06:52,230 --> 00:06:55,080
And so what we're going to see is that
our responsibilities are going to

99
00:06:55,080 --> 00:06:58,550
shoot to either zero responsibilities if

100
00:06:58,550 --> 00:07:03,170
a document doesn't agree on just this
one dimension this one vocabulary word.

101
00:07:03,170 --> 00:07:07,570
Or it's going to be one if
we're in that cluster and

102
00:07:07,570 --> 00:07:08,990
we have agreement on that dimension.

103
00:07:10,500 --> 00:07:13,330
And I want to actually emphasize
that really the way that

104
00:07:13,330 --> 00:07:16,790
you end up seeing this in practice
is with an underflow issue.

105
00:07:18,410 --> 00:07:21,302
So you're going to get these
covariance matrices and

106
00:07:21,302 --> 00:07:26,079
you can't compute their inverses even
when you're shrinking the variances very,

107
00:07:26,079 --> 00:07:29,454
very small and not even setting
them exactly equal to zero.

108
00:07:29,454 --> 00:07:34,314
So this can be just a practical issue
where your codes going to break

109
00:07:34,314 --> 00:07:38,013
if you don't correct for
this potential issue.

110
00:07:38,013 --> 00:07:42,513
And there is a very simple fix that we can
consider which we're going to call either

111
00:07:42,513 --> 00:07:47,220
smoothing the parameter estimate or
regularizing our Updates.

112
00:07:47,220 --> 00:07:52,504
And this simple fix is just not to let
our variance parameters go to zero.

113
00:07:52,504 --> 00:07:57,211
And to do this, every time we go
to estimate our covariance we're

114
00:07:57,211 --> 00:08:02,175
simply going to add a very small
diagonal term to that covariance matrix

115
00:08:02,175 --> 00:08:05,530
to ensure that none of
the variances are zero.

116
00:08:08,680 --> 00:08:13,410
As an alternative approach, you can
take a more formal Bayesian approach and

117
00:08:13,410 --> 00:08:16,560
place priors over the model parameters.

118
00:08:16,560 --> 00:08:20,870
So the cluster proportions, the means,
and covariances of the mixture model.

119
00:08:22,070 --> 00:08:25,460
And what you can view this as is placing

120
00:08:25,460 --> 00:08:29,120
what are called pseudo-observations
in each one of the clusters.

121
00:08:29,120 --> 00:08:34,030
So it's as if you've fixed some set
of observations with some values in

122
00:08:34,030 --> 00:08:35,540
every cluster.

123
00:08:35,540 --> 00:08:39,910
And that's going to bias our
maximum likelihood estimates,

124
00:08:39,910 --> 00:08:42,520
to account for those observations.

125
00:08:42,520 --> 00:08:46,670
And because of that structure,
it's going to avoid cases where you

126
00:08:46,670 --> 00:08:51,400
degenerate by not having any actual
observations of a certain value.

127
00:08:53,370 --> 00:08:57,450
And so we can think of this as
smoothing our parameters estimates but

128
00:08:57,450 --> 00:09:01,850
in this Bayesian approach it would be
smoothing the cluster proportion and

129
00:09:01,850 --> 00:09:05,870
the means and the covariances rather
than the simple approach that we've

130
00:09:05,870 --> 00:09:09,300
specify here where we just
smoothing our covariance estimate.

131
00:09:10,440 --> 00:09:14,067
And you're going to see
this type of smoothing,

132
00:09:14,067 --> 00:09:18,998
when you go to do For
document clustering in the assignment.

133
00:09:18,998 --> 00:09:23,039
[MUSIC]