1 00:00:03,540 --> 00:00:06,065 On the M-step of 2 00:00:06,065 --> 00:00:11,035 the Expectation Maximization Algorithm for restricting the Gaussian mixture model, 3 00:00:11,035 --> 00:00:15,410 we have to maximize the following expression. 4 00:00:15,410 --> 00:00:23,819 So we have to find the maximum of sum with respect to individual objects in our data set, 5 00:00:23,819 --> 00:00:27,875 training objects of expected values 6 00:00:27,875 --> 00:00:32,462 with respect to the variational distributions q of t_i of 7 00:00:32,462 --> 00:00:41,850 the joined log likelihood of x_i and t_i given the parameters, 8 00:00:41,850 --> 00:00:43,360 and here with the parameter suggest mu, 9 00:00:43,360 --> 00:00:48,988 so given mu and maximize this expression with respect to mu. 10 00:00:48,988 --> 00:00:52,640 Mu one to mu number of clusters. 11 00:00:52,640 --> 00:00:57,422 We can maximize this thing explicitly by finally get zero, 12 00:00:57,422 --> 00:01:00,194 setting it to zero, 13 00:01:00,194 --> 00:01:02,255 but recall that on the previous video, 14 00:01:02,255 --> 00:01:06,057 we maximize the same expression for Gaussian mixture model. 15 00:01:06,057 --> 00:01:12,725 We can now reuse the results from that video in this one. 16 00:01:12,725 --> 00:01:14,960 Recall that the optimum parameter for μ number C, 17 00:01:14,960 --> 00:01:22,020 for example, was the following one. 18 00:01:22,020 --> 00:01:25,580 It was the average with respect to all the data 19 00:01:25,580 --> 00:01:29,240 or all the data points now are training to data set times 20 00:01:29,240 --> 00:01:33,825 the variational distribution q of the weights of 21 00:01:33,825 --> 00:01:38,457 ti equals to C. The weights according to the card and so the Gaussian C, 22 00:01:38,457 --> 00:01:42,585 that is the data points xi, 23 00:01:42,585 --> 00:01:46,490 divided by the normalization. 24 00:01:46,490 --> 00:01:59,825 So now, the sum of all the weights q of ti equals to C. 25 00:01:59,825 --> 00:02:03,151 Recall that in our particular case, on the E-step, 26 00:02:03,151 --> 00:02:08,420 we are using delta functions instead of the full variational distribution q. 27 00:02:08,420 --> 00:02:12,255 Our q looks like this, 28 00:02:12,255 --> 00:02:23,190 q of ti equals to one if ti equals to ci, 29 00:02:23,190 --> 00:02:27,030 ci-star, which we have already pre-computed on 30 00:02:27,030 --> 00:02:34,352 the E-step and zero otherwise. 31 00:02:34,352 --> 00:02:37,580 We can continue this expression and 32 00:02:37,580 --> 00:02:42,585 simplify it by using these definition of q to ti, in this particular case. 33 00:02:42,585 --> 00:02:50,895 This thing will be just again sum with respect to all the subjects but all the objects 34 00:02:50,895 --> 00:02:55,520 that doesn't have ti equals to ci-star or 35 00:02:55,520 --> 00:03:01,555 to c. They will not contribute to the sum, right? 36 00:03:01,555 --> 00:03:06,410 So any object which didn't came from the Gaussian number c, 37 00:03:06,410 --> 00:03:09,095 it will have the weight q being zero, 38 00:03:09,095 --> 00:03:13,035 both the numerator and denominator and so, 39 00:03:13,035 --> 00:03:16,350 it will not contribute to this sum. 40 00:03:16,350 --> 00:03:18,725 Thus, we can write down like this. 41 00:03:18,725 --> 00:03:22,225 We can say that this thing will be equal to sum with respect to 42 00:03:22,225 --> 00:03:27,710 all objects from the Gaussian number c. All i such 43 00:03:27,710 --> 00:03:33,470 that ci-star equals to c will arrange 44 00:03:33,470 --> 00:03:42,000 these data points and divide them by normalization by the number of such data points. 45 00:03:42,000 --> 00:03:50,941 It's number of i such that ci-star equals to c, 46 00:03:50,941 --> 00:03:59,410 and this is exactly the formula for the K-means algorithm on the second subset. 47 00:03:59,410 --> 00:04:04,470 To summarize, we have just proven that 48 00:04:04,470 --> 00:04:09,320 the M-step of the expectation maximization algorithm applied to 49 00:04:09,320 --> 00:04:14,420 this restricted Gaussian mixture model without some parameters 50 00:04:14,420 --> 00:04:20,405 where we leave just the locations of mu of the Gaussian parameters. 51 00:04:20,405 --> 00:04:22,600 If we use the q, 52 00:04:22,600 --> 00:04:26,975 the restricted q or being the delta function from the restricted E-step, 53 00:04:26,975 --> 00:04:30,730 then on the M-step, we obtain this kind of formula. 54 00:04:30,730 --> 00:04:35,183 We have to find, 55 00:04:35,183 --> 00:04:40,365 so we have to update the mu as being the centers of the all the data points 56 00:04:40,365 --> 00:04:47,175 assigned to the cluster number c. We use again exactly as we had in K-means. 57 00:04:47,175 --> 00:04:51,500 So to summarize, K-means is actually 58 00:04:51,500 --> 00:04:53,105 a special case for applying 59 00:04:53,105 --> 00:04:57,328 the expectation maximization algorithm to Gaussian mixture model, 60 00:04:57,328 --> 00:04:58,715 but first of all, 61 00:04:58,715 --> 00:05:03,020 these Gaussian mixture model has to be some kind of restricted one where we said 62 00:05:03,020 --> 00:05:08,830 covariance matrices are to be identical and the prior Gaussian c to be uniform. 63 00:05:08,830 --> 00:05:12,170 We also have to use 64 00:05:12,170 --> 00:05:15,050 simplified E-step so instead of considering 65 00:05:15,050 --> 00:05:19,265 the full posterior distributions on the E-step, 66 00:05:19,265 --> 00:05:25,065 we're approximating them with the delta function. 67 00:05:25,065 --> 00:05:30,600 Which kind of makes E-step and M-step faster but also restrict 68 00:05:30,600 --> 00:05:37,650 the flexibility of the model and efficiency of the EM algorithm. 69 00:05:37,650 --> 00:05:41,670 But anyway, this kind of restriction still gives you 70 00:05:41,670 --> 00:05:47,540 a valid EM algorithm which still is guaranteed to converge and has nice properties. 71 00:05:47,540 --> 00:05:54,570 It's really cool. We have just took your usual machine learning algorithm for clustering, 72 00:05:54,570 --> 00:05:58,035 which is not probabilistic in any case in any sense. 73 00:05:58,035 --> 00:06:01,170 We show that it's 74 00:06:01,170 --> 00:06:06,310 a special case of the EM algorithm applied to some restricted Gaussian mixture model. 75 00:06:06,310 --> 00:06:11,185 Using this knowledge, we can, for example, 76 00:06:11,185 --> 00:06:15,390 change something in K-means and we 77 00:06:15,390 --> 00:06:20,520 know how to adapt the the training algorithm to these changes. 78 00:06:20,520 --> 00:06:22,630 For example, we can introduce missing data. 79 00:06:22,630 --> 00:06:24,930 This way, we can build the missing data K-means 80 00:06:24,930 --> 00:06:29,311 without heuristics like setting missing data zero, 81 00:06:29,311 --> 00:06:31,020 but by just extending 82 00:06:31,020 --> 00:06:36,185 this restricted EM algorithm and handling these missing data naturally. 83 00:06:36,185 --> 00:06:42,780 Actually, this kind of trick to simplify the E-step by considering 84 00:06:42,780 --> 00:06:44,445 a restricted set of distributions 85 00:06:44,445 --> 00:06:50,640 is strongly connected to something called variational difference, 86 00:06:50,640 --> 00:06:52,990 which we'll discuss and week four.