1 00:00:00,000 --> 00:00:04,755 [MUSIC] 2 00:00:04,755 --> 00:00:07,880 Continuing our Story so far we've shown that if we know 3 00:00:07,880 --> 00:00:12,860 the cluster parameters then the soft assignments are very easy to compute. 4 00:00:12,860 --> 00:00:15,120 Well, now lets consider the alternative. 5 00:00:15,120 --> 00:00:19,050 Let's assume that we know the cluster assignments and 6 00:00:19,050 --> 00:00:22,360 we want to infer our cluster parameters. 7 00:00:22,360 --> 00:00:23,391 And for this part, 8 00:00:23,391 --> 00:00:27,782 we're going to assume that we have hard assignments of observations to clusters. 9 00:00:27,782 --> 00:00:31,778 And in pictures, these hard assignments are showed by the colors of 10 00:00:31,778 --> 00:00:35,154 each data point and when we have these hard assignments. 11 00:00:35,154 --> 00:00:38,827 We just have three different colors in this plot rather than the spectrum of 12 00:00:38,827 --> 00:00:43,910 colors that we showed earlier when we were talking about having soft assignments. 13 00:00:43,910 --> 00:00:47,860 But once we've fixed these cluster assignments, a question is. 14 00:00:47,860 --> 00:00:52,780 Does this data point here, which is assigned to this green cluster influence 15 00:00:52,780 --> 00:00:55,203 our parameter estimation problem for 16 00:00:55,203 --> 00:00:59,531 estimating the parameters of the fuchsia cluster or blue cluster? 17 00:01:01,296 --> 00:01:03,300 And the answer is no. 18 00:01:03,300 --> 00:01:06,730 Only the observations that are assigned to a given cluster 19 00:01:06,730 --> 00:01:09,530 inform the parameters of that cluster. 20 00:01:10,810 --> 00:01:15,153 We saw this in k-means when we were talking about updating the means 21 00:01:15,153 --> 00:01:19,507 just using the observations that were assigned to a given cluster. 22 00:01:19,507 --> 00:01:23,639 But now were going to have a more general form of an update where were not just 23 00:01:23,639 --> 00:01:27,246 updating the centers of these clusters, but also the shapes. 24 00:01:27,246 --> 00:01:32,293 But just to emphasize, fixing the cluster assignments means that 25 00:01:32,293 --> 00:01:37,263 our estimation problem decouples over our different clusters. 26 00:01:37,263 --> 00:01:41,462 Let's go back to our image clustering task and assume that me store our data, 27 00:01:41,462 --> 00:01:43,865 the RGB values associated with each image. 28 00:01:43,865 --> 00:01:49,280 As well as these hard cluster assignments in a table as shown here. 29 00:01:49,280 --> 00:01:52,880 Well, the first thing that we're going to do is we're going to split up this table 30 00:01:52,880 --> 00:01:55,330 based on these hard cluster assignments. 31 00:01:55,330 --> 00:02:00,800 So we're going to have one table of data points that are assigned to cluster three. 32 00:02:00,800 --> 00:02:04,000 One table of data points assigned to cluster two, and 33 00:02:04,000 --> 00:02:07,340 another table of data points assigned to cluster one. 34 00:02:07,340 --> 00:02:10,170 And then we're going to consider each one of these data tables 35 00:02:10,170 --> 00:02:13,560 completely independently when we're forming our parameter estimates for 36 00:02:13,560 --> 00:02:15,140 each one of these clusters. 37 00:02:15,140 --> 00:02:19,630 So let's look for example, at the data points associated with cluster three. 38 00:02:19,630 --> 00:02:22,205 When we're going to form our parameter estimates, 39 00:02:22,205 --> 00:02:25,894 what we're going to do is something called maximum likelihood estimation, 40 00:02:25,894 --> 00:02:28,074 which we saw in the classification course. 41 00:02:28,074 --> 00:02:32,132 Where we're going to search over all possible parameter settings, 42 00:02:32,132 --> 00:02:37,059 and find the settings that maximize the likelihood of our observed data under our 43 00:02:37,059 --> 00:02:38,230 specified model. 44 00:02:39,920 --> 00:02:44,930 So, remember that each one of our clusters is specified with a Gaussian distribution 45 00:02:44,930 --> 00:02:46,450 that has two parameters. 46 00:02:46,450 --> 00:02:47,780 A mean and a covariance. 47 00:02:47,780 --> 00:02:50,940 Let's spend a little bit of time talking about 48 00:02:50,940 --> 00:02:55,750 the form of the maximum likelihood estimate of these two parameters. 49 00:02:55,750 --> 00:02:56,860 Well for the mean, 50 00:02:56,860 --> 00:03:00,980 the maximum likelihood estimate is exactly equal to just the sample mean. 51 00:03:02,090 --> 00:03:08,488 So what we're going to do is simply average data points that are in cluster k. 52 00:03:12,203 --> 00:03:17,509 So here Nk represents the number 53 00:03:17,509 --> 00:03:22,626 of observations in cluster k. 54 00:03:22,626 --> 00:03:27,599 And here we're just summing over the data indices of points in this table that 55 00:03:27,599 --> 00:03:31,740 had this hard assignment to cluster 3, for example. 56 00:03:31,740 --> 00:03:34,560 And we're summing up all these values. 57 00:03:34,560 --> 00:03:37,211 So it would be this vector. 58 00:03:37,211 --> 00:03:45,275 So we would just literally sum these vectors, 59 00:03:45,275 --> 00:03:51,283 and then we would divide by three, 60 00:03:54,846 --> 00:03:56,870 The total number of observations. 61 00:03:59,410 --> 00:04:02,050 Not to be confused with the fact that 62 00:04:02,050 --> 00:04:05,700 it's also the label of this cluster three that's not what I mean. 63 00:04:05,700 --> 00:04:08,230 I mean the fact that there are three rows in this table. 64 00:04:09,610 --> 00:04:13,582 Okay, so that would provide our estimate of the mean, and 65 00:04:13,582 --> 00:04:17,002 we denote our estimate with this little hat here. 66 00:04:20,252 --> 00:04:24,480 And if you look at this equation, you see that it's exactly the same equation 67 00:04:24,480 --> 00:04:28,070 that we used in k means when we went to update the cluster centers. 68 00:04:28,070 --> 00:04:31,710 We looked at all data points assigned to a given cluster and 69 00:04:31,710 --> 00:04:35,880 just computed the average of those data points. 70 00:04:35,880 --> 00:04:36,918 Okay. 71 00:04:36,918 --> 00:04:42,788 But now, we also have this covariance term, which determines the spread and 72 00:04:42,788 --> 00:04:46,328 orientation of the ellipses in the Gaussian. 73 00:04:46,328 --> 00:04:50,046 And for this, the maximum likelihood estimate is also given by what's called 74 00:04:50,046 --> 00:04:51,695 the sample covariance estimate. 75 00:04:51,695 --> 00:04:56,042 Where at first we're going to subtract off from each of our data points the estimated 76 00:04:56,042 --> 00:05:00,530 mean, and then we're going to compute the outer product of these data points. 77 00:05:00,530 --> 00:05:04,910 Only summing over again the data points in cluster k, and 78 00:05:04,910 --> 00:05:08,300 dividing by the total number of observations in cluster k. 79 00:05:08,300 --> 00:05:12,640 So in the Scalar case, if we just have our estimate. 80 00:05:12,640 --> 00:05:17,770 This would be our estimate of the variance in the kth cluster. 81 00:05:17,770 --> 00:05:23,562 This would be equal to 1 over number of observations in that cluster, 82 00:05:23,562 --> 00:05:27,597 summing over observations in cluster little k. 83 00:05:27,597 --> 00:05:31,506 And here instead of this transpose, because this is a vector. 84 00:05:31,506 --> 00:05:32,550 We would just have scalars. 85 00:05:32,550 --> 00:05:39,578 So we'd have xi- mu hat k, which would just be a single number, and then squared. 86 00:05:40,995 --> 00:05:45,996 And finally, for our cluster proportions, we simply count the number of 87 00:05:45,996 --> 00:05:51,450 observations in cluster k and divide by the total number of observations. 88 00:05:51,450 --> 00:05:55,750 And that forms our estimate of the weight on the kth cluster. 89 00:05:55,750 --> 00:05:58,460 So remember, pi k was the weight on the kth cluster, and 90 00:05:58,460 --> 00:06:02,520 the hat represents the fact that this is our estimate of this. 91 00:06:02,520 --> 00:06:05,180 In particular, our maximum likelihood estimate. 92 00:06:05,180 --> 00:06:10,450 And I want to emphasize here that the form of pi hat k. 93 00:06:10,450 --> 00:06:12,840 Its not specific to mixtures of Gaussians, 94 00:06:12,840 --> 00:06:16,360 which is what we've been focusing on in this module. 95 00:06:16,360 --> 00:06:19,770 But it would hold if you had for example, mixtures of multinomials, or 96 00:06:19,770 --> 00:06:22,060 mixtures of other distributions. 97 00:06:22,060 --> 00:06:24,460 But of course, when we talked about the mean and 98 00:06:24,460 --> 00:06:26,400 covariance estimates on the previous slide, 99 00:06:26,400 --> 00:06:30,545 that was specific to having Gaussians defining each one of our clusters. 100 00:06:30,545 --> 00:06:35,200 So in summary, if we knew the cluster assignments, so 101 00:06:35,200 --> 00:06:39,140 the assignments of each data point to a given cluster. 102 00:06:39,140 --> 00:06:43,010 Then computing the estimates of the cluster parameters is very, 103 00:06:43,010 --> 00:06:44,950 very straightforward. 104 00:06:44,950 --> 00:06:47,840 But again, we don't know these hard assignments. 105 00:06:47,840 --> 00:06:49,520 So what are we going to do? 106 00:06:49,520 --> 00:06:54,129 [MUSIC]