1 00:00:04,840 --> 00:00:07,800 To wrap up our discussion on mixture models, I want to discuss how we 2 00:00:07,800 --> 00:00:13,090 can think about applying this mixture model to our document clustering task. 3 00:00:13,090 --> 00:00:17,140 This is something else that you guys are going to explore in the assignment. 4 00:00:17,140 --> 00:00:21,869 And remember, our goal is to discover groups of related articles, 5 00:00:22,870 --> 00:00:26,460 but when reducing our mixture model, what we're aiming to do is to account for 6 00:00:26,460 --> 00:00:30,290 uncertainty in the assignments of data points to these clusters 7 00:00:30,290 --> 00:00:34,830 like this article that lies between this blue and orange cluster. 8 00:00:34,830 --> 00:00:40,170 And for our document representation, we're going to use our standard td-idf vector. 9 00:00:40,170 --> 00:00:43,230 So, it's going to be a vector with the number of 10 00:00:43,230 --> 00:00:47,450 elements of this vector equal to the size of the vocabulary. 11 00:00:47,450 --> 00:00:50,430 And so, then based on this document representation, 12 00:00:50,430 --> 00:00:54,320 we can think of embedding our documents in r to the v. 13 00:00:54,320 --> 00:00:58,553 So, the real space with v different dimensions where v is 14 00:00:58,553 --> 00:01:00,679 the size of the vocabulary. 15 00:01:00,679 --> 00:01:04,431 And this picture here, we're just showing two dimensions of course, so 16 00:01:04,431 --> 00:01:08,140 it's as if we had a vocabulary, which is two different words. 17 00:01:08,140 --> 00:01:13,870 And then, based on this representation, our goal of using mixtures of 18 00:01:13,870 --> 00:01:18,760 Gaussians to discover clusters that might look like something like the following. 19 00:01:18,760 --> 00:01:22,290 And again, our mixtures of Gaussians are going to be able to provide 20 00:01:22,290 --> 00:01:25,530 soft assignments of data points to these clusters. 21 00:01:25,530 --> 00:01:28,673 And in the next session we're going to describe algorithmically how 22 00:01:28,673 --> 00:01:31,400 we provide these soft assignments. 23 00:01:31,400 --> 00:01:34,940 Let's think a little bit more about applying just a vanilla mixture of 24 00:01:34,940 --> 00:01:37,910 Gaussian to this application, 25 00:01:37,910 --> 00:01:42,090 in particular, this high dimensional document representation. 26 00:01:42,090 --> 00:01:45,480 Because in two dimensions, when we talked about 27 00:01:45,480 --> 00:01:49,900 parameters of the Galician reset there was a mean vector that has 28 00:01:49,900 --> 00:01:54,050 two parameters that we're going to need to learn, which center is this distribution. 29 00:01:54,050 --> 00:01:56,614 But then, there is this covariance metric and 30 00:01:56,614 --> 00:01:59,058 in two dimensions we have four parameters. 31 00:01:59,058 --> 00:02:04,371 There was the variance terms so the spread along each one of our two dimensions and 32 00:02:04,371 --> 00:02:10,040 then the off-diagonal term, which captured the correlation structure. 33 00:02:10,040 --> 00:02:14,319 And we, very quickly, mention that it's a symmetric matrix so it turns out these 34 00:02:14,319 --> 00:02:17,992 off-diagonal terms, sigma one, two and sigma two, one are equal. 35 00:02:17,992 --> 00:02:21,961 So really there are three unique parameters that we need to specify in two 36 00:02:21,961 --> 00:02:23,718 dimensions for this Gaussian, 37 00:02:23,718 --> 00:02:27,380 or three parameters that we're going to need to learn from the data. 38 00:02:28,900 --> 00:02:31,440 But when we're looking at V different dimensions, and 39 00:02:31,440 --> 00:02:36,870 imagine V is really large, if we allowed for a full covariance matrix, 40 00:02:36,870 --> 00:02:41,980 we would have V(V+1) over 2 unique parameters to learn. 41 00:02:41,980 --> 00:02:46,090 And that's a lot on the order of V squared parameters. 42 00:02:46,090 --> 00:02:50,450 And the way that we get V(V+1) over 2 instead of V squared 43 00:02:50,450 --> 00:02:52,950 is by the fact that it's a symmetric matrix. 44 00:02:54,490 --> 00:02:56,660 So it's not quite V squared parameters. 45 00:02:57,910 --> 00:03:02,300 Okay, so instead of specifying these full covariance matrices, 46 00:03:02,300 --> 00:03:06,480 what we're going to do is assume a diagonal form to the covariance matrix. 47 00:03:07,580 --> 00:03:13,460 So there's going to be a variance term for each one of our different dimensions. 48 00:03:13,460 --> 00:03:19,510 One to capital V and then all of the off diagonal terms are going to be fixed to 0. 49 00:03:19,510 --> 00:03:23,440 And so we're just going to be learning the variances along the different dimensions. 50 00:03:24,730 --> 00:03:31,480 And, so, what this means is that, instead of having an arbitrarily aligned ellipse, 51 00:03:31,480 --> 00:03:36,490 we're going to constrain ourselves just to looking at access aligned ellipses. 52 00:03:37,850 --> 00:03:40,558 And this represents a restrictive assumption, but 53 00:03:40,558 --> 00:03:42,720 I want to mention a few things. 54 00:03:42,720 --> 00:03:47,220 The first is that we can actually learn the weights on the different 55 00:03:47,220 --> 00:03:51,220 dimensions and these can be different between the different clusters. 56 00:03:51,220 --> 00:03:56,380 So, in one cluster, we might learn that one word is more important 57 00:03:56,380 --> 00:04:02,370 than the other word in forming the score of the observation under that cluster and 58 00:04:02,370 --> 00:04:04,590 that could be different than any other clusters. 59 00:04:06,160 --> 00:04:09,840 So this represents a formulation that's still a lot more flexible than what you 60 00:04:09,840 --> 00:04:11,123 had in k-means. 61 00:04:11,123 --> 00:04:12,718 because remembering k-means, 62 00:04:12,718 --> 00:04:17,108 the typical formulation would just assume spherically symmetric clusters. 63 00:04:17,108 --> 00:04:21,793 That's a specific case of axis-aligned ellipses where the variance terms 64 00:04:21,793 --> 00:04:26,380 are all the same, they're all sigma squared along the diagonal. 65 00:04:26,380 --> 00:04:30,640 And we said we could also do a weighted version of Euclidean distance but 66 00:04:30,640 --> 00:04:35,440 that would give us these axis-aligned ellipses that are exactly the same 67 00:04:35,440 --> 00:04:37,590 between different clusters. 68 00:04:37,590 --> 00:04:42,720 And furthermore, in the case of k-means, we had to specify those weights. 69 00:04:42,720 --> 00:04:46,837 Whereas, in the next section, we're going to to show that we can actually 70 00:04:46,837 --> 00:04:49,199 learn the weights of this mixture model so 71 00:04:49,199 --> 00:04:53,270 the weights on these different dimensions that are cluster specific. 72 00:04:53,270 --> 00:04:58,355 And in addition to learning the parameters of these different Gaussians 73 00:04:58,355 --> 00:05:04,271 from the data, we're also going to learn the soft assignments, so, the probability 74 00:05:04,271 --> 00:05:09,383 that any observation is associated with any of the different clusters. 75 00:05:09,383 --> 00:05:13,749 [MUSIC]