1
00:00:00,000 --> 00:00:04,755
[MUSIC]

2
00:00:04,755 --> 00:00:07,880
Continuing our Story so
far we've shown that if we know

3
00:00:07,880 --> 00:00:12,860
the cluster parameters then the soft
assignments are very easy to compute.

4
00:00:12,860 --> 00:00:15,120
Well, now lets consider the alternative.

5
00:00:15,120 --> 00:00:19,050
Let's assume that we know
the cluster assignments and

6
00:00:19,050 --> 00:00:22,360
we want to infer our cluster parameters.

7
00:00:22,360 --> 00:00:23,391
And for this part,

8
00:00:23,391 --> 00:00:27,782
we're going to assume that we have hard
assignments of observations to clusters.

9
00:00:27,782 --> 00:00:31,778
And in pictures, these hard assignments
are showed by the colors of

10
00:00:31,778 --> 00:00:35,154
each data point and
when we have these hard assignments.

11
00:00:35,154 --> 00:00:38,827
We just have three different colors in
this plot rather than the spectrum of

12
00:00:38,827 --> 00:00:43,910
colors that we showed earlier when we were
talking about having soft assignments.

13
00:00:43,910 --> 00:00:47,860
But once we've fixed these cluster
assignments, a question is.

14
00:00:47,860 --> 00:00:52,780
Does this data point here, which is
assigned to this green cluster influence

15
00:00:52,780 --> 00:00:55,203
our parameter estimation problem for

16
00:00:55,203 --> 00:00:59,531
estimating the parameters of
the fuchsia cluster or blue cluster?

17
00:01:01,296 --> 00:01:03,300
And the answer is no.

18
00:01:03,300 --> 00:01:06,730
Only the observations that
are assigned to a given cluster

19
00:01:06,730 --> 00:01:09,530
inform the parameters of that cluster.

20
00:01:10,810 --> 00:01:15,153
We saw this in k-means when we were
talking about updating the means

21
00:01:15,153 --> 00:01:19,507
just using the observations that
were assigned to a given cluster.

22
00:01:19,507 --> 00:01:23,639
But now were going to have a more general
form of an update where were not just

23
00:01:23,639 --> 00:01:27,246
updating the centers of these clusters,
but also the shapes.

24
00:01:27,246 --> 00:01:32,293
But just to emphasize,
fixing the cluster assignments means that

25
00:01:32,293 --> 00:01:37,263
our estimation problem decouples
over our different clusters.

26
00:01:37,263 --> 00:01:41,462
Let's go back to our image clustering
task and assume that me store our data,

27
00:01:41,462 --> 00:01:43,865
the RGB values associated with each image.

28
00:01:43,865 --> 00:01:49,280
As well as these hard cluster
assignments in a table as shown here.

29
00:01:49,280 --> 00:01:52,880
Well, the first thing that we're going to
do is we're going to split up this table

30
00:01:52,880 --> 00:01:55,330
based on these hard cluster assignments.

31
00:01:55,330 --> 00:02:00,800
So we're going to have one table of data
points that are assigned to cluster three.

32
00:02:00,800 --> 00:02:04,000
One table of data points
assigned to cluster two, and

33
00:02:04,000 --> 00:02:07,340
another table of data points
assigned to cluster one.

34
00:02:07,340 --> 00:02:10,170
And then we're going to consider
each one of these data tables

35
00:02:10,170 --> 00:02:13,560
completely independently when we're
forming our parameter estimates for

36
00:02:13,560 --> 00:02:15,140
each one of these clusters.

37
00:02:15,140 --> 00:02:19,630
So let's look for example, at the data
points associated with cluster three.

38
00:02:19,630 --> 00:02:22,205
When we're going to form
our parameter estimates,

39
00:02:22,205 --> 00:02:25,894
what we're going to do is something
called maximum likelihood estimation,

40
00:02:25,894 --> 00:02:28,074
which we saw in the classification course.

41
00:02:28,074 --> 00:02:32,132
Where we're going to search over
all possible parameter settings,

42
00:02:32,132 --> 00:02:37,059
and find the settings that maximize the
likelihood of our observed data under our

43
00:02:37,059 --> 00:02:38,230
specified model.

44
00:02:39,920 --> 00:02:44,930
So, remember that each one of our clusters
is specified with a Gaussian distribution

45
00:02:44,930 --> 00:02:46,450
that has two parameters.

46
00:02:46,450 --> 00:02:47,780
A mean and a covariance.

47
00:02:47,780 --> 00:02:50,940
Let's spend a little bit
of time talking about

48
00:02:50,940 --> 00:02:55,750
the form of the maximum likelihood
estimate of these two parameters.

49
00:02:55,750 --> 00:02:56,860
Well for the mean,

50
00:02:56,860 --> 00:03:00,980
the maximum likelihood estimate is
exactly equal to just the sample mean.

51
00:03:02,090 --> 00:03:08,488
So what we're going to do is simply
average data points that are in cluster k.

52
00:03:12,203 --> 00:03:17,509
So here Nk represents the number

53
00:03:17,509 --> 00:03:22,626
of observations in cluster k.

54
00:03:22,626 --> 00:03:27,599
And here we're just summing over the data
indices of points in this table that

55
00:03:27,599 --> 00:03:31,740
had this hard assignment to cluster 3,
for example.

56
00:03:31,740 --> 00:03:34,560
And we're summing up all these values.

57
00:03:34,560 --> 00:03:37,211
So it would be this vector.

58
00:03:37,211 --> 00:03:45,275
So we would just literally
sum these vectors,

59
00:03:45,275 --> 00:03:51,283
and then we would divide by three,

60
00:03:54,846 --> 00:03:56,870
The total number of observations.

61
00:03:59,410 --> 00:04:02,050
Not to be confused with the fact that

62
00:04:02,050 --> 00:04:05,700
it's also the label of this cluster
three that's not what I mean.

63
00:04:05,700 --> 00:04:08,230
I mean the fact that there
are three rows in this table.

64
00:04:09,610 --> 00:04:13,582
Okay, so that would provide
our estimate of the mean, and

65
00:04:13,582 --> 00:04:17,002
we denote our estimate
with this little hat here.

66
00:04:20,252 --> 00:04:24,480
And if you look at this equation, you
see that it's exactly the same equation

67
00:04:24,480 --> 00:04:28,070
that we used in k means when we
went to update the cluster centers.

68
00:04:28,070 --> 00:04:31,710
We looked at all data points
assigned to a given cluster and

69
00:04:31,710 --> 00:04:35,880
just computed the average
of those data points.

70
00:04:35,880 --> 00:04:36,918
Okay.

71
00:04:36,918 --> 00:04:42,788
But now, we also have this covariance
term, which determines the spread and

72
00:04:42,788 --> 00:04:46,328
orientation of the ellipses
in the Gaussian.

73
00:04:46,328 --> 00:04:50,046
And for this, the maximum likelihood
estimate is also given by what's called

74
00:04:50,046 --> 00:04:51,695
the sample covariance estimate.

75
00:04:51,695 --> 00:04:56,042
Where at first we're going to subtract off
from each of our data points the estimated

76
00:04:56,042 --> 00:05:00,530
mean, and then we're going to compute
the outer product of these data points.

77
00:05:00,530 --> 00:05:04,910
Only summing over again the data
points in cluster k, and

78
00:05:04,910 --> 00:05:08,300
dividing by the total number
of observations in cluster k.

79
00:05:08,300 --> 00:05:12,640
So in the Scalar case,
if we just have our estimate.

80
00:05:12,640 --> 00:05:17,770
This would be our estimate of
the variance in the kth cluster.

81
00:05:17,770 --> 00:05:23,562
This would be equal to 1 over number
of observations in that cluster,

82
00:05:23,562 --> 00:05:27,597
summing over observations
in cluster little k.

83
00:05:27,597 --> 00:05:31,506
And here instead of this transpose,
because this is a vector.

84
00:05:31,506 --> 00:05:32,550
We would just have scalars.

85
00:05:32,550 --> 00:05:39,578
So we'd have xi- mu hat k, which would
just be a single number, and then squared.

86
00:05:40,995 --> 00:05:45,996
And finally, for our cluster proportions,
we simply count the number of

87
00:05:45,996 --> 00:05:51,450
observations in cluster k and divide
by the total number of observations.

88
00:05:51,450 --> 00:05:55,750
And that forms our estimate of
the weight on the kth cluster.

89
00:05:55,750 --> 00:05:58,460
So remember, pi k was the weight
on the kth cluster, and

90
00:05:58,460 --> 00:06:02,520
the hat represents the fact that
this is our estimate of this.

91
00:06:02,520 --> 00:06:05,180
In particular,
our maximum likelihood estimate.

92
00:06:05,180 --> 00:06:10,450
And I want to emphasize here
that the form of pi hat k.

93
00:06:10,450 --> 00:06:12,840
Its not specific to mixtures of Gaussians,

94
00:06:12,840 --> 00:06:16,360
which is what we've been
focusing on in this module.

95
00:06:16,360 --> 00:06:19,770
But it would hold if you had for
example, mixtures of multinomials, or

96
00:06:19,770 --> 00:06:22,060
mixtures of other distributions.

97
00:06:22,060 --> 00:06:24,460
But of course,
when we talked about the mean and

98
00:06:24,460 --> 00:06:26,400
covariance estimates
on the previous slide,

99
00:06:26,400 --> 00:06:30,545
that was specific to having Gaussians
defining each one of our clusters.

100
00:06:30,545 --> 00:06:35,200
So in summary,
if we knew the cluster assignments, so

101
00:06:35,200 --> 00:06:39,140
the assignments of each data
point to a given cluster.

102
00:06:39,140 --> 00:06:43,010
Then computing the estimates of
the cluster parameters is very,

103
00:06:43,010 --> 00:06:44,950
very straightforward.

104
00:06:44,950 --> 00:06:47,840
But again,
we don't know these hard assignments.

105
00:06:47,840 --> 00:06:49,520
So what are we going to do?

106
00:06:49,520 --> 00:06:54,129
[MUSIC]