1 00:00:00,324 --> 00:00:04,576 [MUSIC] 2 00:00:04,576 --> 00:00:09,629 Well, to move towards something that we might be able to do, let's talk about 3 00:00:09,629 --> 00:00:15,780 what would change if we only knew the soft assignments instead of hard assignments? 4 00:00:15,780 --> 00:00:19,070 That is, how would we estimate our cluster parameters 5 00:00:19,070 --> 00:00:23,460 from a set of soft assignments of observations to our set of clusters? 6 00:00:24,900 --> 00:00:28,850 Well now, instead of having an observation in a single cluster and 7 00:00:28,850 --> 00:00:33,440 no other cluster, each observation is really going to be in every cluster. 8 00:00:33,440 --> 00:00:37,610 It's going to have some split allocation across our different clusters 9 00:00:37,610 --> 00:00:42,156 based on the responsibility vector, the soft assignment of that observation. 10 00:00:42,156 --> 00:00:46,700 And so we're going to think about how to update our cluster parameters 11 00:00:46,700 --> 00:00:51,650 based on allocating a single observation to our entire set of clusters. 12 00:00:52,880 --> 00:00:57,200 So when we go to do maximum likelihood estimation from soft assignments, 13 00:00:57,200 --> 00:01:00,170 what we're going to do is we're going to form a data table where for 14 00:01:00,170 --> 00:01:04,440 every observation, every row of this table, we're going to introduce 15 00:01:04,440 --> 00:01:08,290 a set of weights that correspond to the responsibility vector. 16 00:01:08,290 --> 00:01:13,090 So this looks very similar to what we did in boosting with weighted observations. 17 00:01:14,100 --> 00:01:18,380 But now instead of having a single weight associated with each row, each different 18 00:01:18,380 --> 00:01:22,150 data point, we have a set of weights and these set of weights sum to one. 19 00:01:22,150 --> 00:01:25,290 But just like in boosting with weighted observations, 20 00:01:25,290 --> 00:01:28,280 these weights are going to modify our row 21 00:01:28,280 --> 00:01:33,960 operations that we're performing in doing our maximum likelihood estimation. 22 00:01:33,960 --> 00:01:38,130 But when we have these split allocations of data points across our 23 00:01:38,130 --> 00:01:43,010 multiple clusters, it's hard to think about counting observations in a cluster. 24 00:01:43,010 --> 00:01:47,520 So in this case, what we can think about doing is computing the total weight in 25 00:01:47,520 --> 00:01:52,810 each one of these clusters just some in over the responsibilities in the cluster. 26 00:01:52,810 --> 00:01:55,700 And think of this as the effective number of observations 27 00:01:55,700 --> 00:01:58,100 in each one of our clusters. 28 00:01:58,100 --> 00:02:02,010 And then what we're going to do is pretty similar to what we did when we had hard 29 00:02:02,010 --> 00:02:06,330 assignments where we're going to form cluster specific data tables. 30 00:02:06,330 --> 00:02:11,060 But now, instead of dividing up our data table into here are the ones in cluster 31 00:02:11,060 --> 00:02:12,870 one, here are the ones in cluster two, cluster three, 32 00:02:14,050 --> 00:02:18,300 every observation is going to appear in each of these data tables because there is 33 00:02:18,300 --> 00:02:22,660 some responsibility taken by each cluster for that observation. 34 00:02:22,660 --> 00:02:25,120 Of course it could be zero, but generically, 35 00:02:25,120 --> 00:02:31,140 let's think of some responsibility for cluster to a given observation. 36 00:02:31,140 --> 00:02:36,080 And so we're going to form a table where we specify the cluster weights 37 00:02:36,080 --> 00:02:39,420 associated with each one of our observations. 38 00:02:39,420 --> 00:02:44,570 So here would be the table associated with cluster one and 39 00:02:44,570 --> 00:02:47,930 here's the table for cluster two and cluster three. 40 00:02:49,715 --> 00:02:54,220 Then for each one of our clusters, we're going to compute a maximum likelihood 41 00:02:54,220 --> 00:02:56,560 estimate of the cluster parameter but 42 00:02:56,560 --> 00:03:00,970 with these weights modifying the row operations that we're doing. 43 00:03:02,660 --> 00:03:07,730 So here are updated maximum likelihood estimates where we're accounting for 44 00:03:07,730 --> 00:03:10,600 weights on our observation which we see here. 45 00:03:12,230 --> 00:03:15,828 So in particular we see that every row operations. 46 00:03:15,828 --> 00:03:19,200 So every time we're touching a given 47 00:03:19,200 --> 00:03:22,940 data element xi it's going to be multiplied by rik. 48 00:03:24,480 --> 00:03:27,450 And the other thing I want to know is that when we're computing the total 49 00:03:27,450 --> 00:03:32,200 number of observations in the cluster, now we're going to call this Nk 50 00:03:33,240 --> 00:03:36,600 soft, based on our set of soft assignment. 51 00:03:36,600 --> 00:03:40,640 And this is going to be the effective number of observations in that cluster 52 00:03:40,640 --> 00:03:44,650 which is just the sum of the responsibilities in that cluster. 53 00:03:44,650 --> 00:03:49,180 We do a similar modification when we go to estimate our cluster proportions. 54 00:03:49,180 --> 00:03:53,960 Where now instead of calculating the total number of observations in a cluster, 55 00:03:53,960 --> 00:03:58,100 we compute the effective number of observations in that cluster. 56 00:03:58,100 --> 00:04:02,300 And then we simply divide by the total number of observations to the data set 57 00:04:02,300 --> 00:04:06,788 which is equivalent to the total effective number of observation, 58 00:04:06,788 --> 00:04:08,810 just the sum of this vector here. 59 00:04:10,100 --> 00:04:14,870 So an equations are updated estimate of our cluster 60 00:04:14,870 --> 00:04:21,180 proportions is written as follows where we simply replace Nk with this Nksoft. 61 00:04:21,180 --> 00:04:25,035 Note that if our response for these just took values zero or one, 62 00:04:25,035 --> 00:04:29,790 that is if we were making hard assignments of observations to a cluster. 63 00:04:29,790 --> 00:04:34,930 Then things would just default back to what we showed in the previous section. 64 00:04:34,930 --> 00:04:38,740 So in particular we can think of this responsibility vector which 65 00:04:38,740 --> 00:04:43,760 now just has a single one in one of these clusters 66 00:04:43,760 --> 00:04:47,980 as representing a one-hot encoding of the cluster assignment. 67 00:04:49,700 --> 00:04:53,120 So to see the equivalence between what we have here and 68 00:04:53,120 --> 00:04:58,530 what we showed in the previous section we can look at our set of equations for 69 00:04:58,530 --> 00:05:02,690 our maximum likely or the estimates based on soft assignments. 70 00:05:02,690 --> 00:05:06,580 And we can show that if we actually plug in hard assignments 71 00:05:06,580 --> 00:05:11,190 then we get out the set of equations we presented in section 2A. 72 00:05:11,190 --> 00:05:14,800 So to begin with we can look at our estimate of the cluster proportions and 73 00:05:14,800 --> 00:05:21,220 note that if rik, just takes values in the set zero or one. 74 00:05:21,220 --> 00:05:25,915 Then what this sum is going to be is we're just going to count. 75 00:05:30,693 --> 00:05:37,074 Observation i 76 00:05:37,074 --> 00:05:42,390 In cluster k 77 00:05:42,390 --> 00:05:50,910 if rik equals one. 78 00:05:50,910 --> 00:05:54,870 So this sum here is just going to default to counting the number 79 00:05:54,870 --> 00:05:57,320 of observations in cluster k. 80 00:05:57,320 --> 00:06:03,960 And that's exactly what we had before, and now if we go to our estimate of the mean. 81 00:06:05,230 --> 00:06:08,860 And when we think about multiplying by rik, 82 00:06:08,860 --> 00:06:14,930 we're only going to be adding xi to this sum, 83 00:06:14,930 --> 00:06:20,644 we remember here our sum is over all the data points if xi is in cluster k. 84 00:06:20,644 --> 00:06:27,637 So only add xi if i is in k, 85 00:06:27,637 --> 00:06:34,964 that is if rik equals one. 86 00:06:36,960 --> 00:06:42,020 And in this case this equation would reduce to one over 87 00:06:42,020 --> 00:06:49,600 Nk sum i in cluster k xi which again is exactly what we had before. 88 00:06:49,600 --> 00:06:54,829 And finally for estimate of the covariance it's going to 89 00:06:54,829 --> 00:06:59,829 be the same as above where we're only going to be adding 90 00:06:59,829 --> 00:07:05,070 these outer products if observation i is in cluster k. 91 00:07:05,070 --> 00:07:09,230 And things are going to default to the same equation 92 00:07:09,230 --> 00:07:10,610 as in the hard assignment case. 93 00:07:11,630 --> 00:07:16,483 And just to make this explicit, 94 00:07:16,483 --> 00:07:23,128 it's 1 over NK sum i in k, xi minus mu hat k, 95 00:07:23,128 --> 00:07:27,493 xi minus mu hat k transpose. 96 00:07:27,493 --> 00:07:32,400 So this at least serves as a little sanity check that the equations that we presented 97 00:07:32,400 --> 00:07:37,360 in this section for our maximum likelihood estimates based on soft assignments 98 00:07:37,360 --> 00:07:40,970 might at least not be wrong because we showed that if we plug in a hard 99 00:07:40,970 --> 00:07:43,220 assignment we got out the maximum likely or 100 00:07:43,220 --> 00:07:46,290 the estimates that we knew to be true from before. 101 00:07:47,380 --> 00:07:52,134 What we've seen in this is that based just on soft assignments it's still 102 00:07:52,134 --> 00:07:56,749 straight forward to compute our estimates of the cluster parameters. 103 00:07:56,749 --> 00:08:01,329 [MUSIC]