1 00:00:00,000 --> 00:00:04,058 [MUSIC] 2 00:00:04,058 --> 00:00:08,258 In our third module, we continued this discussion of clustering, but within 3 00:00:08,258 --> 00:00:13,400 the context of probabilistic model based approaches, specifically mixture models. 4 00:00:13,400 --> 00:00:18,092 And we discuss how mixture models allow us to capture uncertainty in the assignment 5 00:00:18,092 --> 00:00:22,449 of data points to clusters, and we also motivated the use of mixture models as 6 00:00:22,449 --> 00:00:27,028 opposed to k-means, just by describing some of the failure modes of k-means. 7 00:00:27,028 --> 00:00:30,822 Because k-means assume these spherically symmetric clusters and computes 8 00:00:30,822 --> 00:00:34,613 assignments of data point to the clusters just based of the cluster centers, 9 00:00:34,613 --> 00:00:38,520 so there're a number of scenarios in which k-means just really can't capture 10 00:00:38,520 --> 00:00:40,820 the underlying cluster structure very well. 11 00:00:42,080 --> 00:00:45,630 So in our discussion of mixture models, we use the very visually 12 00:00:45,630 --> 00:00:50,290 appealing application of trying to group images into related classes. 13 00:00:50,290 --> 00:00:53,760 So for example, here, we might have an unlabeled, 14 00:00:53,760 --> 00:00:57,880 jumbled set of images that we somehow want to group into, in this case, 15 00:00:57,880 --> 00:01:02,890 images that are all sunset images, forest images, and cloud images. 16 00:01:02,890 --> 00:01:05,850 But we talked about the fact that we just have this data 17 00:01:05,850 --> 00:01:10,010 that's all mixed together and the question is how are we going to handle this. 18 00:01:10,010 --> 00:01:15,220 Well, in our probabilistic approach, we're going to treat the observed quantity, 19 00:01:15,220 --> 00:01:19,820 which we represented just as this RGB vector as a random quantity and 20 00:01:19,820 --> 00:01:22,190 look at distributions on that vector. 21 00:01:23,230 --> 00:01:26,926 So the distribution can have a really complicated form because it's mixing over 22 00:01:26,926 --> 00:01:28,186 these different classes. 23 00:01:28,186 --> 00:01:32,980 But to think about unjumbling this mess, we talked about introducing 24 00:01:32,980 --> 00:01:38,264 cluster specific Gaussian components, or possibly other distributions, 25 00:01:38,264 --> 00:01:42,680 though in this application we're going to looking at Gaussians. 26 00:01:42,680 --> 00:01:44,610 And then, saying that our mixture model, 27 00:01:44,610 --> 00:01:48,790 it's just a weighted collection of these different components, with weights based 28 00:01:48,790 --> 00:01:54,230 on what the relative proportions of each of these classes is, in this data set. 29 00:01:55,420 --> 00:02:00,740 So more formally, we said that our mixture of Gaussian model was defined in terms of 30 00:02:00,740 --> 00:02:04,890 a collection of Gaussian components each defined with a specific mean and 31 00:02:04,890 --> 00:02:08,460 variance as well as a mixture of weight. 32 00:02:08,460 --> 00:02:12,680 So how heavily that class was going to represented in the mixture. 33 00:02:12,680 --> 00:02:17,380 So all of these pictures are just for one dimensional data, but we talked about 34 00:02:17,380 --> 00:02:21,170 how we think of defining mixtures of Gaussians at higher dimensions. 35 00:02:21,170 --> 00:02:24,790 And then, we turn back to our document clustering task. 36 00:02:24,790 --> 00:02:27,080 Where here, remember our document representation, 37 00:02:27,080 --> 00:02:32,610 if we think of using TFIDF or just simply word count factors, has a representation 38 00:02:32,610 --> 00:02:38,920 that has dimension equal to capital V, the number of words in our vocabulary. 39 00:02:38,920 --> 00:02:43,890 So we talked about some of the challenges of scaling our mixture models to 40 00:02:43,890 --> 00:02:45,980 these high dimensional scenarios. 41 00:02:45,980 --> 00:02:49,180 And in particular instead of allowing arbitrary 42 00:02:49,180 --> 00:02:52,680 Gaussian components in a mixture model, we said that in practice, 43 00:02:52,680 --> 00:02:57,430 one would often want to specify just a diagonal form to the Gaussian 44 00:02:57,430 --> 00:03:01,550 covariance matrix to dramatically reduce the number of parameters. 45 00:03:01,550 --> 00:03:04,985 But we said that this form is still much more flexible than what you have in 46 00:03:04,985 --> 00:03:07,940 k-means and in particular we're still going to be learning 47 00:03:07,940 --> 00:03:13,690 the weightings on these different dimensions in our mixture Gaussian model. 48 00:03:13,690 --> 00:03:17,450 And then, having specified our mixture Gaussian model so 49 00:03:17,450 --> 00:03:20,830 understanding the components of this probabilistic model, 50 00:03:20,830 --> 00:03:22,838 return to how to do inference in this model. 51 00:03:22,838 --> 00:03:27,254 So remember we're just going to present the algorithm with a data set, so 52 00:03:27,254 --> 00:03:32,187 a set of points that are unlabeled, and we discussed an algorithm in particular 53 00:03:32,187 --> 00:03:35,816 called, expectation maximization or the Algorithm. 54 00:03:35,816 --> 00:03:40,117 That allows us to jointly infer both the model parameters as well as these soft 55 00:03:40,117 --> 00:03:44,484 assignments that describe our uncertainty in the assignment of data points to 56 00:03:44,484 --> 00:03:45,689 specific clusters. 57 00:03:45,689 --> 00:03:50,658 So in particular this Algorithm is an iterative algorithm where 58 00:03:50,658 --> 00:03:55,627 the first step which is the e step, forms these soft assignments, 59 00:03:55,627 --> 00:03:59,819 based on our current estimate of the model parameters. 60 00:03:59,819 --> 00:04:04,340 Then the second step, which is called the M-step, computes the maximum likelihood 61 00:04:04,340 --> 00:04:08,863 estimate of our model perimeters given the soft assignments at the current iteration 62 00:04:08,863 --> 00:04:13,390 and then we reiterate again and again between these steps until convergence. 63 00:04:13,390 --> 00:04:14,881 And just like k-means, 64 00:04:14,881 --> 00:04:19,070 we showed that Can be describes as a coordinate to send algorithms. 65 00:04:19,070 --> 00:04:20,710 So we get the same types of properties, 66 00:04:20,710 --> 00:04:23,620 where you get convergence to a local mode of the objective. 67 00:04:25,090 --> 00:04:28,610 So k-means means initialization matters a lot. 68 00:04:28,610 --> 00:04:32,027 And typically people rerun the algorithm many times, and choose the best solution. 69 00:04:32,027 --> 00:04:36,466 But another thing that we described was that for our mixture models we have 70 00:04:36,466 --> 00:04:40,831 a lot more parameters than we do for k-means but we present in some ways to 71 00:04:40,831 --> 00:04:45,288 think about handling or at least emulating these issues of over feeding. 72 00:04:45,288 --> 00:04:50,978 Well, in pictures we present a set of illustration showing In practice 73 00:04:50,978 --> 00:04:57,494 where over iterations of the Steps we see how we're able to learn soft assignments 74 00:04:57,494 --> 00:05:03,220 of observations to specific clusters as well as the cluster parameters. 75 00:05:03,220 --> 00:05:07,745 Where even at convergence of algorithm we can still get these uncertainties in 76 00:05:07,745 --> 00:05:12,280 whether a given data point is assigned to one cluster versus another cluster. 77 00:05:12,280 --> 00:05:14,800 And this is actually something we want. 78 00:05:14,800 --> 00:05:19,090 We want these uncertainties as part of the output of this algorithm. 79 00:05:19,090 --> 00:05:24,450 And finally, at the end of the third module we related very formally 80 00:05:24,450 --> 00:05:29,940 the expectation maximization algorithm for mixtures of Gaussians to k-means. 81 00:05:29,940 --> 00:05:34,404 In particular, we said if we look at a mixture of Gaussians with 82 00:05:34,404 --> 00:05:39,548 spherically symmetric Gaussians, so diagonal co-variants with same 83 00:05:39,548 --> 00:05:44,438 elements along the diagonal and then we shrink that variance to 0 and 84 00:05:44,438 --> 00:05:49,357 we run our Algorithm, the output is exactly the same as k-means. 85 00:05:49,357 --> 00:05:54,007 Because what we're going to end up doing are making hard assignments of data 86 00:05:54,007 --> 00:05:59,600 points to specific clusters just based on the distance to the cluster center. 87 00:05:59,600 --> 00:06:01,286 So indeed looking at For 88 00:06:01,286 --> 00:06:06,351 mixtures of Gaussians is really a generalization of the k-means algorithm. 89 00:06:06,351 --> 00:06:11,569 [MUSIC]