1 00:00:00,000 --> 00:00:00,930 [MUSIC] 2 00:00:00,930 --> 00:00:06,554 In the previous video we decided to solve the clustering 3 00:00:06,554 --> 00:00:11,951 problem by making a probabilistic model of our data. 4 00:00:11,951 --> 00:00:13,718 But how? How can we model our 5 00:00:13,718 --> 00:00:16,193 data probabilistically? 6 00:00:16,193 --> 00:00:19,030 Well we don't know that many distributions so far. 7 00:00:19,030 --> 00:00:21,419 But we know Gaussians, right? 8 00:00:21,419 --> 00:00:25,711 So in week one we learned how to feed the parameters of the Gaussian 9 00:00:25,711 --> 00:00:27,904 distribution into the data set. 10 00:00:27,904 --> 00:00:31,596 And we can do it here, it's a reasonable thing to do. 11 00:00:31,596 --> 00:00:36,470 But it turns out that Gaussian is not the very best model for this kind of data. 12 00:00:36,470 --> 00:00:41,230 Recall that we decided that our data consist of several clusters or 13 00:00:41,230 --> 00:00:44,471 groups which may be far away from each other. 14 00:00:44,471 --> 00:00:49,479 And even with this simple example, we can see that Gaussian has 15 00:00:49,479 --> 00:00:54,502 to model all the data points as one big circle or maybe ellipse. 16 00:00:54,502 --> 00:00:59,239 And in this case, It just has to assign hyper mobility to the center of 17 00:00:59,239 --> 00:01:02,273 the circle, it's the way Gaussian works. 18 00:01:02,273 --> 00:01:07,552 And you can see here that the center of the Gaussian kind of fall into 19 00:01:07,552 --> 00:01:13,691 the region in between the clusters where there are not that many data points. 20 00:01:13,691 --> 00:01:18,337 And this is problem with modeling this kind of data with one Gaussian. 21 00:01:18,337 --> 00:01:22,235 We have to model everything with one big circle, and 22 00:01:22,235 --> 00:01:28,004 we have some regions in between the clusters where there are few data points. 23 00:01:28,004 --> 00:01:31,312 But the Gaussian just has to assign hypergrowth to these regions. 24 00:01:31,312 --> 00:01:34,042 So, what else can we do? 25 00:01:34,042 --> 00:01:38,150 Are there any better probabilistic models we can use here? 26 00:01:38,150 --> 00:01:43,709 Well, if one Gaussian doesn't work let's use several of them like three for 27 00:01:43,709 --> 00:01:44,487 example. 28 00:01:44,487 --> 00:01:50,099 So in this case we are kind of assuming that each data point came from one of 29 00:01:50,099 --> 00:01:56,447 the three Gaussians, and each Gaussian explains one cluster of the data points. 30 00:01:56,447 --> 00:02:01,429 Or if you put it more formally then the density of each data point 31 00:02:01,429 --> 00:02:06,223 equals to the weighted sum of three Gaussian densities, and 32 00:02:06,223 --> 00:02:10,077 the weights by are some negative number which sum 33 00:02:10,077 --> 00:02:14,327 up to 1 to make an actual probability distribution. 34 00:02:14,327 --> 00:02:18,333 So the parameters theta here are the weights by. 35 00:02:18,333 --> 00:02:23,437 And three groups of parameters of three Gaussians. 36 00:02:23,437 --> 00:02:28,096 Their locations, the mu vectors, and their shapes, or 37 00:02:28,096 --> 00:02:31,017 their covariance matrices sigma. 38 00:02:31,017 --> 00:02:37,030 Notice that if we succeed in fitting this kind of model into our data, 39 00:02:37,030 --> 00:02:40,150 we will solve clustering problem. 40 00:02:40,150 --> 00:02:42,788 Because for each data point, 41 00:02:42,788 --> 00:02:48,286 we may now find from which Gaussian this data point came from. 42 00:02:48,286 --> 00:02:54,010 And this is exactly the alternative to finding the cluster index. 43 00:02:54,010 --> 00:02:57,929 So we may say that all the points that came from one 44 00:02:57,929 --> 00:03:02,047 Gaussian is the points of one particular cluster. 45 00:03:02,047 --> 00:03:06,860 This models is sometimes called Gaussian Mixture Model, or GMM for short. 46 00:03:06,860 --> 00:03:12,804 So great, we have found model which we would like to use. 47 00:03:12,804 --> 00:03:16,761 What are some positive and negative features of this model, 48 00:03:16,761 --> 00:03:19,490 compared to the just plain one Gaussian? 49 00:03:19,490 --> 00:03:23,200 Well obviously, Gaussian is much less flexible. 50 00:03:23,200 --> 00:03:29,747 So Gaussian Mixture Model allowed us to fit our complicated dataset, and 51 00:03:29,747 --> 00:03:35,556 it actually turns out that you may fit just almost any probability 52 00:03:35,556 --> 00:03:42,433 distribution with Gaussian Mixture Model with arbitrarily high accuracy. 53 00:03:42,433 --> 00:03:46,734 Well of course as always happens with this kind of theorems so in the worst case you 54 00:03:46,734 --> 00:03:50,783 may have to use exponentially many Gaussians, so it's not a very practical 55 00:03:50,783 --> 00:03:55,236 theorem, but anyway Gaussian Mixture Model is very powerful and flexible model. 56 00:03:55,236 --> 00:03:59,225 The downside is obviously the number of parameters. 57 00:03:59,225 --> 00:04:02,445 So, as you increase the number of Gaussians you use for 58 00:04:02,445 --> 00:04:06,864 your data you increase the number of parameters by the same texture right? 59 00:04:06,864 --> 00:04:08,249 Okay great. 60 00:04:08,249 --> 00:04:11,781 We decided to use this kind of Gaussian mixture model. 61 00:04:11,781 --> 00:04:13,337 But how can we fit it? 62 00:04:13,337 --> 00:04:15,768 How can we find its parameters? 63 00:04:15,768 --> 00:04:19,683 So it's bi, mu and sigma vectors and matrices. 64 00:04:19,683 --> 00:04:24,278 Well the simplest way to fit a probability distribution is to use 65 00:04:24,278 --> 00:04:27,182 maximum likelihood estimation right? 66 00:04:27,182 --> 00:04:33,090 So we want to find the values of parameters which maximise the likelihood, 67 00:04:33,090 --> 00:04:37,708 which is the density of our data set given the parameters. 68 00:04:37,708 --> 00:04:41,840 And we want to maximise this thing with respect to the parameters. 69 00:04:41,840 --> 00:04:47,438 It's usually much in learning, we will assume that the data set consists 70 00:04:47,438 --> 00:04:52,327 of n data points, which are independent given our parameters. 71 00:04:52,327 --> 00:04:56,021 Which basically means that we can factorize the likelihood. 72 00:04:56,021 --> 00:04:59,880 So the likelihood equals the product of likelihood of individual objects. 73 00:05:02,061 --> 00:05:06,412 Well, One more thing we can do to 74 00:05:06,412 --> 00:05:10,841 simplify this expression, to understand how to maximize it, 75 00:05:10,841 --> 00:05:15,771 is to substitute the definition of the marginal likelihood of xi given 76 00:05:15,771 --> 00:05:20,385 the parameters by using the definition from a few slides before. 77 00:05:20,385 --> 00:05:25,468 So each data point density is a mixture of Gaussian densities, right? 78 00:05:25,468 --> 00:05:29,594 Okay, so now we have this kind of optimization problem. 79 00:05:29,594 --> 00:05:34,165 And just one more thing we forgot here is, we have some constraints, right? 80 00:05:34,165 --> 00:05:39,946 We have to say that the weights pi are non-negative, and that they sum up to 1. 81 00:05:39,946 --> 00:05:43,690 Because otherwise, it will not be an actual probability distribution. 82 00:05:43,690 --> 00:05:46,614 But now it seems like we're good to go. 83 00:05:46,614 --> 00:05:51,611 Now we may use your favorite stochastic optimization algorithm from data flow, 84 00:05:51,611 --> 00:05:54,329 like item or whatever you would like to use, 85 00:05:54,329 --> 00:05:58,752 and we can optimize this thing to find the optimal parameters, right? 86 00:05:58,752 --> 00:06:04,171 Well, it turns out that we kind of forgot one important set of constraints here. 87 00:06:04,171 --> 00:06:07,940 The covariance [INAUDIBLE] as a sigma cannot be arbitrary. 88 00:06:07,940 --> 00:06:09,609 Imagine that you have, 89 00:06:09,609 --> 00:06:15,626 that your optimization logarithm proposed to use covariance matrix with all zeros. 90 00:06:15,626 --> 00:06:18,733 It just doesn't work. It doesn't define a proper Gaussian 91 00:06:18,733 --> 00:06:19,956 distribution. 92 00:06:19,956 --> 00:06:24,750 Because in Gaussian distribution definition you have to Invert this matrix, 93 00:06:24,750 --> 00:06:28,391 and you have to compute its determinant, and divide by it. 94 00:06:28,391 --> 00:06:30,484 So if you have a matrix which is all 0s, 95 00:06:30,484 --> 00:06:34,365 you will have lots of problems like division by 0, and stuff like that. 96 00:06:34,365 --> 00:06:40,633 So it's not a good idea to assume that the covariance matrix can be anything. 97 00:06:40,633 --> 00:06:48,640 And actually, the set of valid covariance matrices is something called Positive. 98 00:06:48,640 --> 00:06:52,247 Don't worry if you don't know what it is, it's not the important right now. 99 00:06:52,247 --> 00:06:57,134 The important part is that it's a really hard constraint to follow, 100 00:06:57,134 --> 00:07:01,766 so it's hard to adapt your favorite stochastic rate in descent to 101 00:07:01,766 --> 00:07:04,729 always follow of this kind of restraint. 102 00:07:04,729 --> 00:07:09,513 So to maintain this property that the matrices are always positive, 103 00:07:09,513 --> 00:07:15,206 semi-definite I don't know how to do it efficiently, so we have a problem here, 104 00:07:15,206 --> 00:07:21,244 we don't know how to optimize this thing, at least with stochastic rate and descent. 105 00:07:21,244 --> 00:07:26,105 Well, it turns out that even if you get this constraint, so 106 00:07:26,105 --> 00:07:28,750 if you consider a simpler model, for 107 00:07:28,750 --> 00:07:33,956 example if you say that all the covariance matrices are diagonal, which 108 00:07:33,956 --> 00:07:39,794 means that the ellipsoids that corresponds to the Gaussians cannot be rotated. 109 00:07:39,794 --> 00:07:42,269 They have to be aligned with the axis. 110 00:07:42,269 --> 00:07:46,081 In this case it's much easier to maintain this constraint. 111 00:07:46,081 --> 00:07:48,857 And you can actually use some stochastic optimization here. 112 00:07:48,857 --> 00:07:52,291 So for example, in this example I used Adam and 113 00:07:52,291 --> 00:07:55,911 I tuned it learned grade to optimize this thing. 114 00:07:55,911 --> 00:08:00,133 And you can see that it's doing a reasonable job here. 115 00:08:00,133 --> 00:08:06,595 So the blue curve is the, Performance of the item here. 116 00:08:06,595 --> 00:08:12,214 On the x-axis we see epochs, and on the y-axis we see log likelihood, 117 00:08:12,214 --> 00:08:14,842 which we are trying to maximize. 118 00:08:14,842 --> 00:08:16,902 And so, Adam is doing a good job, right? 119 00:08:16,902 --> 00:08:23,761 It's like ten e-books optimizing this thing to to something reasonable. 120 00:08:23,761 --> 00:08:28,535 I think real line here is the ground for which came from minor because I know 121 00:08:28,535 --> 00:08:32,770 from which probability distribution I generate this data, so, 122 00:08:32,770 --> 00:08:35,938 I know the optimal value for the log likelihood. 123 00:08:35,938 --> 00:08:41,196 But it turns out that even in this case where we don't have this very complicated 124 00:08:41,196 --> 00:08:46,862 constraints you can make so much better by exploiting the structure of your problem. 125 00:08:46,862 --> 00:08:50,821 And this is something we're going to discuss in the rest of the week, 126 00:08:50,821 --> 00:08:53,910 is something called the expectation maximization. 127 00:08:53,910 --> 00:08:57,310 And if you apply it here, it just works so much better. 128 00:08:57,310 --> 00:09:00,234 In like a few iterations it found the value, 129 00:09:00,234 --> 00:09:05,054 which is better than the Ground Truth, which probably is overfitting, 130 00:09:05,054 --> 00:09:08,070 but anyway it works good on the test set also. 131 00:09:08,070 --> 00:09:12,906 So to summarize, we may have two reasons to not 132 00:09:12,906 --> 00:09:17,870 to use this stochastic gradient descent here. 133 00:09:17,870 --> 00:09:22,572 First of all, it may be hard to follow some constraints which you may care about, 134 00:09:22,572 --> 00:09:25,634 like positive semidefinite covariance matrices. 135 00:09:25,634 --> 00:09:30,049 And second of all, expectation maximization algorithm, 136 00:09:30,049 --> 00:09:33,756 which can exploit the structure of your program, 137 00:09:33,756 --> 00:09:37,210 sometimes is much more faster and efficient. 138 00:09:37,210 --> 00:09:42,809 So as a general summary we discussed that the Gaussian mixture model is a flexible 139 00:09:42,809 --> 00:09:47,835 probability distribution, which can solve the clustering problem for 140 00:09:47,835 --> 00:09:51,888 you if you fit your data into this Gaussian mixture model. 141 00:09:51,888 --> 00:09:56,423 And sometimes it's hard to optimize for the stochastic gradient descent, but 142 00:09:56,423 --> 00:10:00,359 there is this alternative which we'll talk about in the next video. 143 00:10:01,792 --> 00:10:11,792 [MUSIC]