1 00:00:04,609 --> 00:00:09,371 Let's now spend some time getting into the nitty gritty details of And 2 00:00:09,371 --> 00:00:14,320 also provide a comparison between For mixtures of calciums and K-mines. 3 00:00:14,320 --> 00:00:19,210 Well, to begin with, we talked about the converge solution of 4 00:00:19,210 --> 00:00:22,550 In the series of pie chart plots that we just showed. 5 00:00:22,550 --> 00:00:26,760 Well, the question is do we know that the Algorithm is going to converge? 6 00:00:26,760 --> 00:00:29,220 Well, you can actually recast the E and 7 00:00:29,220 --> 00:00:35,390 M steps of the Algorithm as alternating maximization of an objective. 8 00:00:35,390 --> 00:00:40,290 And from this we see that Is actually a coordinate-ascent algorithm. 9 00:00:40,290 --> 00:00:41,570 And under mild conditions, 10 00:00:41,570 --> 00:00:45,700 you can prove that Indeed converges to a local mode of the objective. 11 00:00:46,860 --> 00:00:47,760 And in your assignment, 12 00:00:47,760 --> 00:00:52,120 you're going to monitor convergence of the Algorithm by computing the log likelihood 13 00:00:52,120 --> 00:00:58,140 of the data given the current estimates of our model parameters and responsibilities. 14 00:00:58,140 --> 00:01:02,342 Well, based on the nonconvexed landscape and convergence to a local mode, 15 00:01:02,342 --> 00:01:07,004 how we think of initializing the Algorithm can be really critical to the quality of 16 00:01:07,004 --> 00:01:11,555 the local mode that we find, as well as to the convergence rates of our algorithm. 17 00:01:11,555 --> 00:01:15,220 And there are many ways we can think about initializing. 18 00:01:15,220 --> 00:01:19,950 One is we can think about just randomly selecting some data points as our initial 19 00:01:19,950 --> 00:01:24,327 centers in doing hard assignments of observations to the nearest cluster 20 00:01:24,327 --> 00:01:24,906 center. 21 00:01:24,906 --> 00:01:30,261 And then using those assignments to form our initial parameter estimates and 22 00:01:30,261 --> 00:01:33,649 then our Algorithm would iterate from there. 23 00:01:33,649 --> 00:01:38,963 Or, another alternative would be to pick the cluster centers sequentially, 24 00:01:38,963 --> 00:01:42,666 like how we described in our k-means++ algorithm, 25 00:01:42,666 --> 00:01:47,509 instead of just randomly choosing K different data points all at once. 26 00:01:47,509 --> 00:01:51,832 Or, we could think about actually running k-means and using the final 27 00:01:51,832 --> 00:01:56,316 solution of that as our cluster centers for this initialization scheme. 28 00:01:56,316 --> 00:01:58,480 And there are also other approaches. 29 00:01:58,480 --> 00:02:03,926 For example, in speech applications, it's really popular to do things where 30 00:02:03,926 --> 00:02:09,877 you gradually grow the mixture model, by splitting and sometimes merging clusters. 31 00:02:09,877 --> 00:02:12,290 Until you have K clusters formed. 32 00:02:13,630 --> 00:02:14,970 Having discussed convergence and 33 00:02:14,970 --> 00:02:18,255 initialization, let's now spend a little bit of time discussing 34 00:02:18,255 --> 00:02:21,690 a potential pit fall of the Algorithm that we've described. 35 00:02:21,690 --> 00:02:24,983 And this all stems from the fact that maximum likelihood estimation 36 00:02:24,983 --> 00:02:26,090 can over fit the data. 37 00:02:27,760 --> 00:02:31,570 So as an example of this let's imagine that we just have two clusters. 38 00:02:31,570 --> 00:02:34,688 The green cluster and the blue cluster. 39 00:02:34,688 --> 00:02:38,832 And let's assume that it's some iteration of our algorithm only one 40 00:02:38,832 --> 00:02:41,761 observation is assigned to the green cluster and 41 00:02:41,761 --> 00:02:45,930 all the other observations are assigned to the blue cluster. 42 00:02:45,930 --> 00:02:49,420 And when I'm speaking now, I'm talking in terms of hard assignments. 43 00:02:49,420 --> 00:02:51,920 But this same thing that I'm going to talk about holds if we're 44 00:02:51,920 --> 00:02:53,320 using soft assignments instead. 45 00:02:54,350 --> 00:02:58,910 Before this single observation in this green cluster, what's 46 00:02:58,910 --> 00:03:04,150 the thing that's going to maximize the likelihood of the data in that cluster? 47 00:03:04,150 --> 00:03:08,220 Well, for the green cluster, we're simply going to 48 00:03:08,220 --> 00:03:12,560 set the mean of the cluster exactly equal to that data point, and then we're 49 00:03:12,560 --> 00:03:17,500 going to shrink as a variance all the way to zero, right around that data point. 50 00:03:18,630 --> 00:03:22,230 And for the blue cluster in this example that I've shown here, 51 00:03:22,230 --> 00:03:23,900 we're going to shift it's mean over and 52 00:03:23,900 --> 00:03:27,540 increase it's variance to capture all the data points that were shown. 53 00:03:29,490 --> 00:03:33,800 Okay so, this doesn't seem like a great solution here. 54 00:03:33,800 --> 00:03:36,643 Right? Because we have this one data point with 55 00:03:36,643 --> 00:03:39,335 this tiny, tiny, tiny, tiny, tiny, 56 00:03:39,335 --> 00:03:43,858 infinitesimally small cluster tied right to the data point value. 57 00:03:43,858 --> 00:03:48,883 And than, these two sets of observations which to me looked pretty separable, 58 00:03:48,883 --> 00:03:52,340 all lumped together in one much more diffuse cluster. 59 00:03:53,340 --> 00:03:58,150 But, the likelihood of this assignment is actually infinite. 60 00:03:58,150 --> 00:04:03,022 All driven by the fact that the likelihood of the green observation, 61 00:04:03,022 --> 00:04:07,978 under a Gaussian that's absolutely certain that the only value it can 62 00:04:07,978 --> 00:04:11,597 take is the value that was observed, is infinite. 63 00:04:11,597 --> 00:04:16,117 And so that's going to completely dominate the overall likelihood of this set of 64 00:04:16,117 --> 00:04:20,720 assignment and we're going to see that this is actually the preferred solution. 65 00:04:22,000 --> 00:04:27,600 Luckily in practice you almost never get into assignments and this 66 00:04:27,600 --> 00:04:35,520 very simplicity case of having just that one green observation in that cluster, 67 00:04:35,520 --> 00:04:39,300 but this is something that really can happen and you have to think carefully 68 00:04:39,300 --> 00:04:42,700 about it especially when you're thinking about ways to initialize your algorithm. 69 00:04:43,810 --> 00:04:47,460 But the problem can actually be much much worse in high dimensions, so 70 00:04:47,460 --> 00:04:51,730 to think about this let's go back to our document cluster in example. 71 00:04:51,730 --> 00:04:56,350 So here we are going to assume that we have some large vocabulary and 72 00:04:56,350 --> 00:04:59,440 every document is represented by maybe just a word count factor, 73 00:04:59,440 --> 00:05:02,765 or a TFTIF over that vocabulary. 74 00:05:02,765 --> 00:05:07,544 And let's imagine that for a given cluster, only one document in that 75 00:05:07,544 --> 00:05:13,310 cluster, again I'm talking in terms of hard assignments but things generalized. 76 00:05:13,310 --> 00:05:20,020 But if there's only one document in that cluster that has the word w in it or 77 00:05:20,020 --> 00:05:24,892 more generally if all of the documents in that cluster 78 00:05:24,892 --> 00:05:29,457 happen to have the same count of a specific word w. 79 00:05:29,457 --> 00:05:34,133 Then what's the Maximum 80 00:05:34,133 --> 00:05:38,164 likelihood estimate of the parameters in that cluster for that word W? 81 00:05:38,164 --> 00:05:40,370 So just looking at that dimension, 82 00:05:40,370 --> 00:05:44,045 well, we simply set the mean equal to that observed count, 83 00:05:44,045 --> 00:05:49,570 either from that one document, or all these documents that have the same count. 84 00:05:49,570 --> 00:05:54,130 And then we shrink the variance to zero about that value. 85 00:05:54,130 --> 00:05:56,688 So this isn't actually that contrived of an example because if 86 00:05:56,688 --> 00:05:58,390 we're in a very high dimensional phase. 87 00:05:58,390 --> 00:06:02,619 So a very large vocabulary, we really could imagine cases where all 88 00:06:02,619 --> 00:06:06,997 the documents in a given cluster especially if we have enough clusters 89 00:06:06,997 --> 00:06:11,669 to partition up our data, all the observations in a given cluster might not 90 00:06:11,669 --> 00:06:16,080 have a word at all or there might just be one document that has the word. 91 00:06:17,090 --> 00:06:22,593 So this kind of set up really, really can happen in high dimensions and 92 00:06:22,593 --> 00:06:27,910 what ends up happening as a result of this is that in later iterations 93 00:06:27,910 --> 00:06:33,446 we go to compute the likelihood of some other document that has word w. 94 00:06:33,446 --> 00:06:38,722 And some other count associated with word w than what this cluster is specified, 95 00:06:38,722 --> 00:06:41,360 then it's going to have zero likelihood or 96 00:06:41,360 --> 00:06:46,872 zero responsibility assumed by this given cluster because this cluster says nope, 97 00:06:46,872 --> 00:06:52,230 you can only have this value that I've observed in my document so far. 98 00:06:52,230 --> 00:06:55,080 And so what we're going to see is that our responsibilities are going to 99 00:06:55,080 --> 00:06:58,550 shoot to either zero responsibilities if 100 00:06:58,550 --> 00:07:03,170 a document doesn't agree on just this one dimension this one vocabulary word. 101 00:07:03,170 --> 00:07:07,570 Or it's going to be one if we're in that cluster and 102 00:07:07,570 --> 00:07:08,990 we have agreement on that dimension. 103 00:07:10,500 --> 00:07:13,330 And I want to actually emphasize that really the way that 104 00:07:13,330 --> 00:07:16,790 you end up seeing this in practice is with an underflow issue. 105 00:07:18,410 --> 00:07:21,302 So you're going to get these covariance matrices and 106 00:07:21,302 --> 00:07:26,079 you can't compute their inverses even when you're shrinking the variances very, 107 00:07:26,079 --> 00:07:29,454 very small and not even setting them exactly equal to zero. 108 00:07:29,454 --> 00:07:34,314 So this can be just a practical issue where your codes going to break 109 00:07:34,314 --> 00:07:38,013 if you don't correct for this potential issue. 110 00:07:38,013 --> 00:07:42,513 And there is a very simple fix that we can consider which we're going to call either 111 00:07:42,513 --> 00:07:47,220 smoothing the parameter estimate or regularizing our Updates. 112 00:07:47,220 --> 00:07:52,504 And this simple fix is just not to let our variance parameters go to zero. 113 00:07:52,504 --> 00:07:57,211 And to do this, every time we go to estimate our covariance we're 114 00:07:57,211 --> 00:08:02,175 simply going to add a very small diagonal term to that covariance matrix 115 00:08:02,175 --> 00:08:05,530 to ensure that none of the variances are zero. 116 00:08:08,680 --> 00:08:13,410 As an alternative approach, you can take a more formal Bayesian approach and 117 00:08:13,410 --> 00:08:16,560 place priors over the model parameters. 118 00:08:16,560 --> 00:08:20,870 So the cluster proportions, the means, and covariances of the mixture model. 119 00:08:22,070 --> 00:08:25,460 And what you can view this as is placing 120 00:08:25,460 --> 00:08:29,120 what are called pseudo-observations in each one of the clusters. 121 00:08:29,120 --> 00:08:34,030 So it's as if you've fixed some set of observations with some values in 122 00:08:34,030 --> 00:08:35,540 every cluster. 123 00:08:35,540 --> 00:08:39,910 And that's going to bias our maximum likelihood estimates, 124 00:08:39,910 --> 00:08:42,520 to account for those observations. 125 00:08:42,520 --> 00:08:46,670 And because of that structure, it's going to avoid cases where you 126 00:08:46,670 --> 00:08:51,400 degenerate by not having any actual observations of a certain value. 127 00:08:53,370 --> 00:08:57,450 And so we can think of this as smoothing our parameters estimates but 128 00:08:57,450 --> 00:09:01,850 in this Bayesian approach it would be smoothing the cluster proportion and 129 00:09:01,850 --> 00:09:05,870 the means and the covariances rather than the simple approach that we've 130 00:09:05,870 --> 00:09:09,300 specify here where we just smoothing our covariance estimate. 131 00:09:10,440 --> 00:09:14,067 And you're going to see this type of smoothing, 132 00:09:14,067 --> 00:09:18,998 when you go to do For document clustering in the assignment. 133 00:09:18,998 --> 00:09:23,039 [MUSIC]