1
00:00:00,324 --> 00:00:04,576
[MUSIC]

2
00:00:04,576 --> 00:00:09,629
Well, to move towards something that we
might be able to do, let's talk about

3
00:00:09,629 --> 00:00:15,780
what would change if we only knew the soft
assignments instead of hard assignments?

4
00:00:15,780 --> 00:00:19,070
That is, how would we estimate
our cluster parameters

5
00:00:19,070 --> 00:00:23,460
from a set of soft assignments of
observations to our set of clusters?

6
00:00:24,900 --> 00:00:28,850
Well now, instead of having
an observation in a single cluster and

7
00:00:28,850 --> 00:00:33,440
no other cluster, each observation is
really going to be in every cluster.

8
00:00:33,440 --> 00:00:37,610
It's going to have some split allocation
across our different clusters

9
00:00:37,610 --> 00:00:42,156
based on the responsibility vector,
the soft assignment of that observation.

10
00:00:42,156 --> 00:00:46,700
And so we're going to think about
how to update our cluster parameters

11
00:00:46,700 --> 00:00:51,650
based on allocating a single observation
to our entire set of clusters.

12
00:00:52,880 --> 00:00:57,200
So when we go to do maximum likelihood
estimation from soft assignments,

13
00:00:57,200 --> 00:01:00,170
what we're going to do is we're
going to form a data table where for

14
00:01:00,170 --> 00:01:04,440
every observation, every row of this
table, we're going to introduce

15
00:01:04,440 --> 00:01:08,290
a set of weights that correspond
to the responsibility vector.

16
00:01:08,290 --> 00:01:13,090
So this looks very similar to what we did
in boosting with weighted observations.

17
00:01:14,100 --> 00:01:18,380
But now instead of having a single weight
associated with each row, each different

18
00:01:18,380 --> 00:01:22,150
data point, we have a set of weights and
these set of weights sum to one.

19
00:01:22,150 --> 00:01:25,290
But just like in boosting
with weighted observations,

20
00:01:25,290 --> 00:01:28,280
these weights are going to modify our row

21
00:01:28,280 --> 00:01:33,960
operations that we're performing in
doing our maximum likelihood estimation.

22
00:01:33,960 --> 00:01:38,130
But when we have these split
allocations of data points across our

23
00:01:38,130 --> 00:01:43,010
multiple clusters, it's hard to think
about counting observations in a cluster.

24
00:01:43,010 --> 00:01:47,520
So in this case, what we can think about
doing is computing the total weight in

25
00:01:47,520 --> 00:01:52,810
each one of these clusters just some in
over the responsibilities in the cluster.

26
00:01:52,810 --> 00:01:55,700
And think of this as the effective
number of observations

27
00:01:55,700 --> 00:01:58,100
in each one of our clusters.

28
00:01:58,100 --> 00:02:02,010
And then what we're going to do is pretty
similar to what we did when we had hard

29
00:02:02,010 --> 00:02:06,330
assignments where we're going to
form cluster specific data tables.

30
00:02:06,330 --> 00:02:11,060
But now, instead of dividing up our data
table into here are the ones in cluster

31
00:02:11,060 --> 00:02:12,870
one, here are the ones in cluster two,
cluster three,

32
00:02:14,050 --> 00:02:18,300
every observation is going to appear in
each of these data tables because there is

33
00:02:18,300 --> 00:02:22,660
some responsibility taken by each
cluster for that observation.

34
00:02:22,660 --> 00:02:25,120
Of course it could be zero,
but generically,

35
00:02:25,120 --> 00:02:31,140
let's think of some responsibility for
cluster to a given observation.

36
00:02:31,140 --> 00:02:36,080
And so we're going to form a table
where we specify the cluster weights

37
00:02:36,080 --> 00:02:39,420
associated with each one
of our observations.

38
00:02:39,420 --> 00:02:44,570
So here would be the table
associated with cluster one and

39
00:02:44,570 --> 00:02:47,930
here's the table for
cluster two and cluster three.

40
00:02:49,715 --> 00:02:54,220
Then for each one of our clusters, we're
going to compute a maximum likelihood

41
00:02:54,220 --> 00:02:56,560
estimate of the cluster parameter but

42
00:02:56,560 --> 00:03:00,970
with these weights modifying the row
operations that we're doing.

43
00:03:02,660 --> 00:03:07,730
So here are updated maximum likelihood
estimates where we're accounting for

44
00:03:07,730 --> 00:03:10,600
weights on our observation
which we see here.

45
00:03:12,230 --> 00:03:15,828
So in particular we see
that every row operations.

46
00:03:15,828 --> 00:03:19,200
So every time we're touching a given

47
00:03:19,200 --> 00:03:22,940
data element xi it's going
to be multiplied by rik.

48
00:03:24,480 --> 00:03:27,450
And the other thing I want to know is
that when we're computing the total

49
00:03:27,450 --> 00:03:32,200
number of observations in the cluster,
now we're going to call this Nk

50
00:03:33,240 --> 00:03:36,600
soft, based on our set of soft assignment.

51
00:03:36,600 --> 00:03:40,640
And this is going to be the effective
number of observations in that cluster

52
00:03:40,640 --> 00:03:44,650
which is just the sum of
the responsibilities in that cluster.

53
00:03:44,650 --> 00:03:49,180
We do a similar modification when we
go to estimate our cluster proportions.

54
00:03:49,180 --> 00:03:53,960
Where now instead of calculating the total
number of observations in a cluster,

55
00:03:53,960 --> 00:03:58,100
we compute the effective number
of observations in that cluster.

56
00:03:58,100 --> 00:04:02,300
And then we simply divide by the total
number of observations to the data set

57
00:04:02,300 --> 00:04:06,788
which is equivalent to the total
effective number of observation,

58
00:04:06,788 --> 00:04:08,810
just the sum of this vector here.

59
00:04:10,100 --> 00:04:14,870
So an equations are updated
estimate of our cluster

60
00:04:14,870 --> 00:04:21,180
proportions is written as follows where
we simply replace Nk with this Nksoft.

61
00:04:21,180 --> 00:04:25,035
Note that if our response for
these just took values zero or one,

62
00:04:25,035 --> 00:04:29,790
that is if we were making hard
assignments of observations to a cluster.

63
00:04:29,790 --> 00:04:34,930
Then things would just default back to
what we showed in the previous section.

64
00:04:34,930 --> 00:04:38,740
So in particular we can think of
this responsibility vector which

65
00:04:38,740 --> 00:04:43,760
now just has a single one
in one of these clusters

66
00:04:43,760 --> 00:04:47,980
as representing a one-hot encoding
of the cluster assignment.

67
00:04:49,700 --> 00:04:53,120
So to see the equivalence
between what we have here and

68
00:04:53,120 --> 00:04:58,530
what we showed in the previous section
we can look at our set of equations for

69
00:04:58,530 --> 00:05:02,690
our maximum likely or
the estimates based on soft assignments.

70
00:05:02,690 --> 00:05:06,580
And we can show that if we
actually plug in hard assignments

71
00:05:06,580 --> 00:05:11,190
then we get out the set of equations
we presented in section 2A.

72
00:05:11,190 --> 00:05:14,800
So to begin with we can look at our
estimate of the cluster proportions and

73
00:05:14,800 --> 00:05:21,220
note that if rik,
just takes values in the set zero or one.

74
00:05:21,220 --> 00:05:25,915
Then what this sum is going to
be is we're just going to count.

75
00:05:30,693 --> 00:05:37,074
Observation i

76
00:05:37,074 --> 00:05:42,390
In cluster k

77
00:05:42,390 --> 00:05:50,910
if rik equals one.

78
00:05:50,910 --> 00:05:54,870
So this sum here is just going to
default to counting the number

79
00:05:54,870 --> 00:05:57,320
of observations in cluster k.

80
00:05:57,320 --> 00:06:03,960
And that's exactly what we had before, and
now if we go to our estimate of the mean.

81
00:06:05,230 --> 00:06:08,860
And when we think about
multiplying by rik,

82
00:06:08,860 --> 00:06:14,930
we're only going to be
adding xi to this sum,

83
00:06:14,930 --> 00:06:20,644
we remember here our sum is over all
the data points if xi is in cluster k.

84
00:06:20,644 --> 00:06:27,637
So only add xi if i is in k,

85
00:06:27,637 --> 00:06:34,964
that is if rik equals one.

86
00:06:36,960 --> 00:06:42,020
And in this case this equation
would reduce to one over

87
00:06:42,020 --> 00:06:49,600
Nk sum i in cluster k xi which again
is exactly what we had before.

88
00:06:49,600 --> 00:06:54,829
And finally for
estimate of the covariance it's going to

89
00:06:54,829 --> 00:06:59,829
be the same as above where
we're only going to be adding

90
00:06:59,829 --> 00:07:05,070
these outer products if
observation i is in cluster k.

91
00:07:05,070 --> 00:07:09,230
And things are going to
default to the same equation

92
00:07:09,230 --> 00:07:10,610
as in the hard assignment case.

93
00:07:11,630 --> 00:07:16,483
And just to make this explicit,

94
00:07:16,483 --> 00:07:23,128
it's 1 over NK sum i in k,
xi minus mu hat k,

95
00:07:23,128 --> 00:07:27,493
xi minus mu hat k transpose.

96
00:07:27,493 --> 00:07:32,400
So this at least serves as a little sanity
check that the equations that we presented

97
00:07:32,400 --> 00:07:37,360
in this section for our maximum likelihood
estimates based on soft assignments

98
00:07:37,360 --> 00:07:40,970
might at least not be wrong because
we showed that if we plug in a hard

99
00:07:40,970 --> 00:07:43,220
assignment we got out
the maximum likely or

100
00:07:43,220 --> 00:07:46,290
the estimates that we knew
to be true from before.

101
00:07:47,380 --> 00:07:52,134
What we've seen in this is that based
just on soft assignments it's still

102
00:07:52,134 --> 00:07:56,749
straight forward to compute our
estimates of the cluster parameters.

103
00:07:56,749 --> 00:08:01,329
[MUSIC]