1 00:00:03,140 --> 00:00:07,540 In the previous video we sketched an algorithm 2 00:00:07,540 --> 00:00:12,080 to fit Gaussian Mixture Model into a data set. 3 00:00:12,080 --> 00:00:17,330 Now let's see an example of how it can look like on some particular set of data. 4 00:00:17,330 --> 00:00:21,645 So assume this data again this one dimensional data set of points. 5 00:00:21,645 --> 00:00:29,280 Let's try to find two Gaussians which explain this data well. 6 00:00:29,280 --> 00:00:32,220 So let's start with initializing, 7 00:00:32,220 --> 00:00:41,565 initialization here means to choose two random Gaussians and place them somewhere. 8 00:00:41,565 --> 00:00:45,733 So let's say we choose this one Gaussian around, will you, 9 00:00:45,733 --> 00:00:53,555 175 and the other one is around 180 and the variances are around, I don't know, around 5. 10 00:00:53,555 --> 00:00:57,730 Around that view. So it's random initialization. 11 00:00:57,730 --> 00:01:00,850 So on the next step we have to estimate the sources. 12 00:01:00,850 --> 00:01:04,790 We have to estimate the posterior probabilities of the latent variability. 13 00:01:04,790 --> 00:01:09,570 So from which Gaussian each data point came. Let's do it here. 14 00:01:09,570 --> 00:01:14,530 So, you can see that for example the rightmost data points, 15 00:01:14,530 --> 00:01:16,200 they are certainly orange. 16 00:01:16,200 --> 00:01:19,155 Because the orange Gaussian explains them 17 00:01:19,155 --> 00:01:23,150 much better than the blue one right? It makes sense. 18 00:01:23,150 --> 00:01:29,440 The data points in between our randomly initialized Gaussians are kind of not settled. 19 00:01:29,440 --> 00:01:31,730 They are half blue half orange. 20 00:01:31,730 --> 00:01:38,515 We're not sure about their exact assignment to these Gaussians. 21 00:01:38,515 --> 00:01:43,335 The interesting story is about the leftmost Gaussian, the leftmost point. 22 00:01:43,335 --> 00:01:45,585 So on the one hand, 23 00:01:45,585 --> 00:01:49,245 both Gaussians think that this data point is kind of weird. 24 00:01:49,245 --> 00:01:54,045 Because, it has really low density according to both Gaussians. 25 00:01:54,045 --> 00:01:59,545 And you can expect that they will both kind of not expect it to appear here. 26 00:01:59,545 --> 00:02:04,560 But on the other hand the blue Gaussian is so much, 27 00:02:04,560 --> 00:02:07,875 the blue Gaussian probability is so much higher, 28 00:02:07,875 --> 00:02:11,115 that this data point's almost completely blue. 29 00:02:11,115 --> 00:02:15,660 So you can see here that the blue Gaussian densities are around .01 here, 30 00:02:15,660 --> 00:02:18,160 which is low, it's almost zero. 31 00:02:18,160 --> 00:02:22,601 But the orange Gaussian density is around .00002, 32 00:02:22,601 --> 00:02:24,160 which is much lower. 33 00:02:24,160 --> 00:02:26,690 So the overall probability is around 1. 34 00:02:26,690 --> 00:02:31,235 Which means that these data points are almost certainly blue. 35 00:02:31,235 --> 00:02:37,175 And so I'll know both Gaussians do not think that these data points should exist. 36 00:02:37,175 --> 00:02:40,365 The blue Gaussian explains these data points much better, 37 00:02:40,365 --> 00:02:42,485 just because it's closer. 38 00:02:42,485 --> 00:02:47,450 Okay so it is the end of the third subtype of the first iteration. 39 00:02:47,450 --> 00:02:50,025 We have this kind of softer sense, 40 00:02:50,025 --> 00:02:55,780 for which data point we know its latent variability or at least a distribution of that. 41 00:02:55,780 --> 00:02:59,625 So now let's repeat our Gaussians, 42 00:02:59,625 --> 00:03:03,730 let's re-find our parameters. 43 00:03:03,730 --> 00:03:06,085 Well it will look something like this. 44 00:03:06,085 --> 00:03:09,930 So we feed a Gaussian into all the blue points and then we feed the Gaussian 45 00:03:09,930 --> 00:03:14,290 in all the orange points and we get some two Gaussians. 46 00:03:14,290 --> 00:03:20,220 Note that the blue Gaussian is now much shifted to the left just because, 47 00:03:20,220 --> 00:03:24,980 you were assigned lots of points on the left hand side of this plot. 48 00:03:24,980 --> 00:03:31,910 And it just has to explain them so it's shifted to be around these blue datapoints. 49 00:03:31,910 --> 00:03:37,565 So this is the end of iteration one and it's already a reasonable solution, 50 00:03:37,565 --> 00:03:42,330 so we are not converged yet but it is already an approximation of the solution. 51 00:03:42,330 --> 00:03:46,385 And from this I think we can already get some reasonable clustering, 52 00:03:46,385 --> 00:03:50,130 so we can now call all the orange points one cluster, 53 00:03:50,130 --> 00:03:52,260 all the blue points the other cluster, 54 00:03:52,260 --> 00:03:58,590 and all the points in between we can say like we're not certain about them and so, 55 00:03:58,590 --> 00:04:04,780 we can't say anything certain about these half blue half orange datapoints. 56 00:04:04,780 --> 00:04:07,425 But anyway let's proceed a few iterations more. 57 00:04:07,425 --> 00:04:11,965 So on the next iteration, we will re-assign. 58 00:04:11,965 --> 00:04:15,755 We'll compute these assignments of latent variables, right? 59 00:04:15,755 --> 00:04:18,680 So the colors of the data points will change. 60 00:04:18,680 --> 00:04:20,320 So look carefully here. 61 00:04:20,320 --> 00:04:22,410 They change just a little bit. 62 00:04:22,410 --> 00:04:27,291 And you can see that 63 00:04:27,291 --> 00:04:33,378 some of the data points which were half blue half orange, 64 00:04:33,378 --> 00:04:37,935 they shifted to be just a little bit of blue and much more orange. 65 00:04:37,935 --> 00:04:46,320 It's because now our blue Gaussian's more to the left and so it's explained less. 66 00:04:46,320 --> 00:04:57,075 It explains the data points that are around 175 height a little bit worse. 67 00:04:57,075 --> 00:05:00,690 So these data points became more orange than blue. 68 00:05:00,690 --> 00:05:02,920 But overall the situation didn't change. 69 00:05:02,920 --> 00:05:05,650 So we're really close to convergence. 70 00:05:05,650 --> 00:05:10,660 So on the next subtype of the second situation we estimate the parameters. 71 00:05:10,660 --> 00:05:13,450 And the Gaussians become a little bit more certain about 72 00:05:13,450 --> 00:05:17,740 their location and their variances reduce. 73 00:05:17,740 --> 00:05:19,085 And finally one more alteration. 74 00:05:19,085 --> 00:05:22,105 So look carefully at the color of the data points, 75 00:05:22,105 --> 00:05:24,305 they shift just a little bit more. 76 00:05:24,305 --> 00:05:26,875 And again we re-estimate the Gaussians. 77 00:05:26,875 --> 00:05:29,590 And we convert this kind of solution to 78 00:05:29,590 --> 00:05:33,460 our original Gaussian mixture model fitting problem. 79 00:05:33,460 --> 00:05:37,690 So this was an expectation maximisation algorithm 80 00:05:37,690 --> 00:05:41,650 for Gaussian Mixture Model on this one dimensional data. 81 00:05:41,650 --> 00:05:45,650 Note that however this algorithm doesn't give you an optimal solution. 82 00:05:45,650 --> 00:05:48,820 So it doesn't give you the global maxim of your likelihood. 83 00:05:48,820 --> 00:05:50,955 And actually it turns out that the problem of finding 84 00:05:50,955 --> 00:05:53,475 global maxim of this likelihood is NB hard. 85 00:05:53,475 --> 00:05:58,910 So you can't expect anyone to give you an optimal solution in reasonable time. 86 00:05:58,910 --> 00:06:03,415 But well it gives you something reasonable right? 87 00:06:03,415 --> 00:06:09,015 But sometimes this expectation maximisation suffers from local maxim a lot. 88 00:06:09,015 --> 00:06:11,505 We can see an example here. 89 00:06:11,505 --> 00:06:14,790 Let's initialize our Gaussians somewhere else. 90 00:06:14,790 --> 00:06:20,025 Not as we initialized them in the first demo but somewhere else. 91 00:06:20,025 --> 00:06:25,175 Let's say we initialized the Gaussians on the left hand side of the plot. 92 00:06:25,175 --> 00:06:27,622 The first substep of the first iteration, 93 00:06:27,622 --> 00:06:29,160 we estimate the sources, 94 00:06:29,160 --> 00:06:35,400 we estimate the latent distribution and latent variability and looks like this. 95 00:06:35,400 --> 00:06:43,610 So all the points that are more to the right than 165 height, 96 00:06:43,610 --> 00:06:47,220 they will be certainly orange because the orange explains them better. 97 00:06:47,220 --> 00:06:51,230 And the others will certainly be blue. 98 00:06:51,230 --> 00:06:54,935 It's not a very good solution but it's not a solution yet right? 99 00:06:54,935 --> 00:06:57,290 It is just the first subset of the first iteration. 100 00:06:57,290 --> 00:07:01,915 It shouldn't hurt us too much. 101 00:07:01,915 --> 00:07:04,435 But it turns out that because of this mistake, 102 00:07:04,435 --> 00:07:07,724 because we assign so many points to be orange, 103 00:07:07,724 --> 00:07:10,680 on the next substep we will have to 104 00:07:10,680 --> 00:07:15,195 estimate the variance of the orange Gaussian to be high like this. 105 00:07:15,195 --> 00:07:20,940 Just because it has so many points both on the right hand side and on the left hand side. 106 00:07:20,940 --> 00:07:23,250 And now we have a problem. 107 00:07:23,250 --> 00:07:24,720 So at the start of iteration two, 108 00:07:24,720 --> 00:07:27,105 look at the leftmost point. 109 00:07:27,105 --> 00:07:30,495 It will become a little bit orange. 110 00:07:30,495 --> 00:07:37,215 Just because now it is explained by the orange Gaussian as well as the blue Gaussian. 111 00:07:37,215 --> 00:07:41,255 Because now the orange Gaussian is high in this region. 112 00:07:41,255 --> 00:07:43,200 This is a disaster. 113 00:07:43,200 --> 00:07:49,450 Because now we'll repeat the Gaussians and the orange Gaussian will 114 00:07:49,450 --> 00:07:55,490 think that now it not only has lots of orange points on the left and on the right, 115 00:07:55,490 --> 00:07:59,355 but also it has this one leftmost point. 116 00:07:59,355 --> 00:08:03,485 Well not fully, just a little chunk of this leftmost point, 117 00:08:03,485 --> 00:08:07,585 but anyway it contributes to the variance estimation, 118 00:08:07,585 --> 00:08:11,070 and the variance estimation of the orange Gaussian will be really high. 119 00:08:11,070 --> 00:08:18,430 It thinks that it owns all the points on this plane and this will make 120 00:08:18,430 --> 00:08:22,750 the orange Gaussian variance estimation stay high and 121 00:08:22,750 --> 00:08:26,320 the blue Gaussian variance estimation will 122 00:08:26,320 --> 00:08:30,640 reduce a little bit because it becomes more certain in its estimation, 123 00:08:30,640 --> 00:08:32,880 and we're stuck in this kind of situation. 124 00:08:32,880 --> 00:08:35,200 So we can do a few more iterations, 125 00:08:35,200 --> 00:08:37,880 and we converged this local maximum, 126 00:08:37,880 --> 00:08:40,635 which is certainly not optimal. 127 00:08:40,635 --> 00:08:42,790 But the bottom line here is that 128 00:08:42,790 --> 00:08:46,983 the expectation maximisation suffers from local maximums. 129 00:08:46,983 --> 00:08:49,064 And as usual in optimization, 130 00:08:49,064 --> 00:08:54,595 the way to deal with it is to start several times from different random initializations, 131 00:08:54,595 --> 00:08:57,130 and choose the best run. 132 00:08:57,130 --> 00:09:02,425 The best thing here is that you can track the log likelihood, 133 00:09:02,425 --> 00:09:07,165 and you can find the best run 134 00:09:07,165 --> 00:09:10,530 according to the training log likelihood performance after all. 135 00:09:10,530 --> 00:09:16,140 So at least you can compare different logs to show and choose the best one. 136 00:09:16,140 --> 00:09:20,030 Summary. Gaussian Mixture Model is 137 00:09:20,030 --> 00:09:25,040 a flexible probabilistic model which you can fit into your data, 138 00:09:25,040 --> 00:09:28,690 and it allows you to solve the clustering problem, 139 00:09:28,690 --> 00:09:32,405 and also gives you a probabilistic model of your data. 140 00:09:32,405 --> 00:09:36,080 So you can for example sample from this model new data points. 141 00:09:36,080 --> 00:09:41,965 Expectation Maximisation algorithm is something to train this Gaussian Mixture Model, 142 00:09:41,965 --> 00:09:44,303 and other latent variables models, 143 00:09:44,303 --> 00:09:47,475 as we will see in the next few videos. 144 00:09:47,475 --> 00:09:50,320 But for now it's just for Gaussian Mixture Model. 145 00:09:50,320 --> 00:09:53,776 And sometimes it can be faster than Stochastic Gradient Descent, 146 00:09:53,776 --> 00:09:58,570 and it also helps you to handle 147 00:09:58,570 --> 00:10:01,930 complicated constraints like making 148 00:10:01,930 --> 00:10:06,160 the covariance matrix positive semi definite on each iteration. 149 00:10:06,160 --> 00:10:10,015 And finally expectation maximisation suffers from local maxima. 150 00:10:10,015 --> 00:10:13,390 But you know it's kind of expected because the overall problem was NB hard, 151 00:10:13,390 --> 00:10:15,510 and so thus Stochastic Gradient Descent. 152 00:10:15,510 --> 00:10:22,980 So the optimal solution in a reasonable time is not possible.