1
00:00:00,000 --> 00:00:04,058
[MUSIC]

2
00:00:04,058 --> 00:00:08,258
In our third module, we continued this
discussion of clustering, but within

3
00:00:08,258 --> 00:00:13,400
the context of probabilistic model based
approaches, specifically mixture models.

4
00:00:13,400 --> 00:00:18,092
And we discuss how mixture models allow us
to capture uncertainty in the assignment

5
00:00:18,092 --> 00:00:22,449
of data points to clusters, and we also
motivated the use of mixture models as

6
00:00:22,449 --> 00:00:27,028
opposed to k-means, just by describing
some of the failure modes of k-means.

7
00:00:27,028 --> 00:00:30,822
Because k-means assume these spherically
symmetric clusters and computes

8
00:00:30,822 --> 00:00:34,613
assignments of data point to the clusters
just based of the cluster centers,

9
00:00:34,613 --> 00:00:38,520
so there're a number of scenarios in
which k-means just really can't capture

10
00:00:38,520 --> 00:00:40,820
the underlying cluster
structure very well.

11
00:00:42,080 --> 00:00:45,630
So in our discussion of mixture models,
we use the very visually

12
00:00:45,630 --> 00:00:50,290
appealing application of trying to
group images into related classes.

13
00:00:50,290 --> 00:00:53,760
So for example, here,
we might have an unlabeled,

14
00:00:53,760 --> 00:00:57,880
jumbled set of images that we somehow
want to group into, in this case,

15
00:00:57,880 --> 00:01:02,890
images that are all sunset images,
forest images, and cloud images.

16
00:01:02,890 --> 00:01:05,850
But we talked about the fact
that we just have this data

17
00:01:05,850 --> 00:01:10,010
that's all mixed together and the question
is how are we going to handle this.

18
00:01:10,010 --> 00:01:15,220
Well, in our probabilistic approach, we're
going to treat the observed quantity,

19
00:01:15,220 --> 00:01:19,820
which we represented just as this
RGB vector as a random quantity and

20
00:01:19,820 --> 00:01:22,190
look at distributions on that vector.

21
00:01:23,230 --> 00:01:26,926
So the distribution can have a really
complicated form because it's mixing over

22
00:01:26,926 --> 00:01:28,186
these different classes.

23
00:01:28,186 --> 00:01:32,980
But to think about unjumbling this mess,
we talked about introducing

24
00:01:32,980 --> 00:01:38,264
cluster specific Gaussian components,
or possibly other distributions,

25
00:01:38,264 --> 00:01:42,680
though in this application we're
going to looking at Gaussians.

26
00:01:42,680 --> 00:01:44,610
And then, saying that our mixture model,

27
00:01:44,610 --> 00:01:48,790
it's just a weighted collection of these
different components, with weights based

28
00:01:48,790 --> 00:01:54,230
on what the relative proportions of each
of these classes is, in this data set.

29
00:01:55,420 --> 00:02:00,740
So more formally, we said that our mixture
of Gaussian model was defined in terms of

30
00:02:00,740 --> 00:02:04,890
a collection of Gaussian components
each defined with a specific mean and

31
00:02:04,890 --> 00:02:08,460
variance as well as a mixture of weight.

32
00:02:08,460 --> 00:02:12,680
So how heavily that class was
going to represented in the mixture.

33
00:02:12,680 --> 00:02:17,380
So all of these pictures are just for
one dimensional data, but we talked about

34
00:02:17,380 --> 00:02:21,170
how we think of defining mixtures
of Gaussians at higher dimensions.

35
00:02:21,170 --> 00:02:24,790
And then, we turn back to our
document clustering task.

36
00:02:24,790 --> 00:02:27,080
Where here,
remember our document representation,

37
00:02:27,080 --> 00:02:32,610
if we think of using TFIDF or just simply
word count factors, has a representation

38
00:02:32,610 --> 00:02:38,920
that has dimension equal to capital V,
the number of words in our vocabulary.

39
00:02:38,920 --> 00:02:43,890
So we talked about some of the challenges
of scaling our mixture models to

40
00:02:43,890 --> 00:02:45,980
these high dimensional scenarios.

41
00:02:45,980 --> 00:02:49,180
And in particular instead
of allowing arbitrary

42
00:02:49,180 --> 00:02:52,680
Gaussian components in a mixture model,
we said that in practice,

43
00:02:52,680 --> 00:02:57,430
one would often want to specify just
a diagonal form to the Gaussian

44
00:02:57,430 --> 00:03:01,550
covariance matrix to dramatically
reduce the number of parameters.

45
00:03:01,550 --> 00:03:04,985
But we said that this form is still
much more flexible than what you have in

46
00:03:04,985 --> 00:03:07,940
k-means and in particular we're
still going to be learning

47
00:03:07,940 --> 00:03:13,690
the weightings on these different
dimensions in our mixture Gaussian model.

48
00:03:13,690 --> 00:03:17,450
And then, having specified
our mixture Gaussian model so

49
00:03:17,450 --> 00:03:20,830
understanding the components
of this probabilistic model,

50
00:03:20,830 --> 00:03:22,838
return to how to do
inference in this model.

51
00:03:22,838 --> 00:03:27,254
So remember we're just going to present
the algorithm with a data set, so

52
00:03:27,254 --> 00:03:32,187
a set of points that are unlabeled, and
we discussed an algorithm in particular

53
00:03:32,187 --> 00:03:35,816
called, expectation maximization or
the Algorithm.

54
00:03:35,816 --> 00:03:40,117
That allows us to jointly infer both
the model parameters as well as these soft

55
00:03:40,117 --> 00:03:44,484
assignments that describe our uncertainty
in the assignment of data points to

56
00:03:44,484 --> 00:03:45,689
specific clusters.

57
00:03:45,689 --> 00:03:50,658
So in particular this Algorithm
is an iterative algorithm where

58
00:03:50,658 --> 00:03:55,627
the first step which is the e step,
forms these soft assignments,

59
00:03:55,627 --> 00:03:59,819
based on our current estimate
of the model parameters.

60
00:03:59,819 --> 00:04:04,340
Then the second step, which is called the
M-step, computes the maximum likelihood

61
00:04:04,340 --> 00:04:08,863
estimate of our model perimeters given the
soft assignments at the current iteration

62
00:04:08,863 --> 00:04:13,390
and then we reiterate again and again
between these steps until convergence.

63
00:04:13,390 --> 00:04:14,881
And just like k-means,

64
00:04:14,881 --> 00:04:19,070
we showed that Can be describes as
a coordinate to send algorithms.

65
00:04:19,070 --> 00:04:20,710
So we get the same types of properties,

66
00:04:20,710 --> 00:04:23,620
where you get convergence to
a local mode of the objective.

67
00:04:25,090 --> 00:04:28,610
So k-means means
initialization matters a lot.

68
00:04:28,610 --> 00:04:32,027
And typically people rerun the algorithm
many times, and choose the best solution.

69
00:04:32,027 --> 00:04:36,466
But another thing that we described was
that for our mixture models we have

70
00:04:36,466 --> 00:04:40,831
a lot more parameters than we do for
k-means but we present in some ways to

71
00:04:40,831 --> 00:04:45,288
think about handling or at least
emulating these issues of over feeding.

72
00:04:45,288 --> 00:04:50,978
Well, in pictures we present a set
of illustration showing In practice

73
00:04:50,978 --> 00:04:57,494
where over iterations of the Steps we see
how we're able to learn soft assignments

74
00:04:57,494 --> 00:05:03,220
of observations to specific clusters
as well as the cluster parameters.

75
00:05:03,220 --> 00:05:07,745
Where even at convergence of algorithm
we can still get these uncertainties in

76
00:05:07,745 --> 00:05:12,280
whether a given data point is assigned
to one cluster versus another cluster.

77
00:05:12,280 --> 00:05:14,800
And this is actually something we want.

78
00:05:14,800 --> 00:05:19,090
We want these uncertainties as part
of the output of this algorithm.

79
00:05:19,090 --> 00:05:24,450
And finally, at the end of the third
module we related very formally

80
00:05:24,450 --> 00:05:29,940
the expectation maximization algorithm for
mixtures of Gaussians to k-means.

81
00:05:29,940 --> 00:05:34,404
In particular, we said if we look
at a mixture of Gaussians with

82
00:05:34,404 --> 00:05:39,548
spherically symmetric Gaussians,
so diagonal co-variants with same

83
00:05:39,548 --> 00:05:44,438
elements along the diagonal and
then we shrink that variance to 0 and

84
00:05:44,438 --> 00:05:49,357
we run our Algorithm,
the output is exactly the same as k-means.

85
00:05:49,357 --> 00:05:54,007
Because what we're going to end up doing
are making hard assignments of data

86
00:05:54,007 --> 00:05:59,600
points to specific clusters just based
on the distance to the cluster center.

87
00:05:59,600 --> 00:06:01,286
So indeed looking at For

88
00:06:01,286 --> 00:06:06,351
mixtures of Gaussians is really
a generalization of the k-means algorithm.

89
00:06:06,351 --> 00:06:11,569
[MUSIC]