1 00:00:00,079 --> 00:00:05,061 [MUSIC] 2 00:00:05,061 --> 00:00:08,002 So we've talked about the k-meansw clustering algorithm and 3 00:00:08,002 --> 00:00:11,060 then a smart initialization to the algorithm. 4 00:00:11,060 --> 00:00:15,680 But now let's turn to assessing how do we measure the quality of the clustering and 5 00:00:15,680 --> 00:00:18,940 what's a method for choosing the number of clusters? 6 00:00:18,940 --> 00:00:23,840 So as an example, here are two of the clusterings that we found 7 00:00:23,840 --> 00:00:27,261 based on two different initializations of our k-means algorithm. 8 00:00:27,261 --> 00:00:30,150 And a question is, which one do I prefer? 9 00:00:30,150 --> 00:00:31,060 Which one is better? 10 00:00:32,600 --> 00:00:35,680 Well to assess this, we can return to, what is the k-means objective? 11 00:00:35,680 --> 00:00:37,610 What are we trying to minimize here? 12 00:00:37,610 --> 00:00:42,640 Well, k-means is trying to minimize the sum of the square distances 13 00:00:42,640 --> 00:00:46,730 between our observations and the cluster centers. 14 00:00:46,730 --> 00:00:51,840 So we're minimizing over the set of all 15 00:00:53,260 --> 00:00:57,440 cluster indicators for all of our observations Xi. 16 00:00:57,440 --> 00:01:01,360 And the set of all cluster centers mu j. 17 00:01:01,360 --> 00:01:06,000 So all possible values of assignments of observations to clusters and 18 00:01:06,000 --> 00:01:10,820 all possible values of where we place those cluster centers, 19 00:01:10,820 --> 00:01:13,790 the sum of these square distances. 20 00:01:13,790 --> 00:01:19,218 So here this sum is sum 21 00:01:19,218 --> 00:01:28,280 of squared distances in cluster j. 22 00:01:28,280 --> 00:01:35,844 And here, we're summing over all clusters. 23 00:01:38,611 --> 00:01:43,960 So we can think about using this as a measure of quality for a given clustering. 24 00:01:43,960 --> 00:01:49,470 So in particular, if the sum of the squared distances is small, 25 00:01:49,470 --> 00:01:52,180 then that's going to represent a better 26 00:01:52,180 --> 00:01:56,887 clustering according to the k-means objective than one that's large. 27 00:01:56,887 --> 00:02:04,210 So to see this here, we have all these distances in the pink or 28 00:02:04,210 --> 00:02:09,650 fuchsia cluster and all of these little distances in the green cluster. 29 00:02:09,650 --> 00:02:13,670 But when we look at the blue cluster, we have really, really, 30 00:02:13,670 --> 00:02:16,200 really big distances that are getting. 31 00:02:18,060 --> 00:02:21,660 We have some small distances, but a bunch of really big distances. 32 00:02:21,660 --> 00:02:25,252 So when we're summing over the square of all these distances, 33 00:02:25,252 --> 00:02:27,364 this is going to be a very large number. 34 00:02:27,364 --> 00:02:32,760 Whereas here We're summing, 35 00:02:32,760 --> 00:02:36,806 in each case for each cluster, over 36 00:02:38,669 --> 00:02:42,371 Smaller numbers on average. 37 00:02:46,200 --> 00:02:54,639 So we're going to say that this cluster here is more heterogeneous, 38 00:02:57,852 --> 00:03:02,874 Meaning that there are more 39 00:03:02,874 --> 00:03:11,180 dissimilar objects within a given cluster. 40 00:03:15,400 --> 00:03:18,500 Whereas here, we end up with tighter clusters. 41 00:03:20,980 --> 00:03:23,440 Where things are more homogenous. 42 00:03:26,400 --> 00:03:31,840 So, we're going to refer to this metric as cluster heterogeneity. 43 00:03:33,280 --> 00:03:38,370 So, what's the sum of the heterogeneitiy's over all of our different clusters and 44 00:03:38,370 --> 00:03:39,630 we want this to be small so 45 00:03:39,630 --> 00:03:45,450 that where in examples like this one where we have these tight, nice clusters. 46 00:03:45,450 --> 00:03:46,390 And just to be clear, 47 00:03:46,390 --> 00:03:51,030 even though we're calling this a measure of quality, lower is better. 48 00:03:51,030 --> 00:03:53,970 We want smaller sums of square distances. 49 00:03:53,970 --> 00:03:58,030 Okay, so let's just spend some time thinking about what happens to this 50 00:03:58,030 --> 00:04:03,710 heterogeneity metric as k, the number of clusters, increases. 51 00:04:03,710 --> 00:04:08,360 Well, when there are more clusters available to the algorithm. 52 00:04:08,360 --> 00:04:12,420 We can refine these clusters more and more to our observed data. 53 00:04:12,420 --> 00:04:17,080 And so this can lead to our classic notion of over-fitting, but not quite classic, 54 00:04:17,080 --> 00:04:19,912 because now we're in this unsupervised setting. 55 00:04:19,912 --> 00:04:24,458 But it's a similar kind of idea where the model is very, 56 00:04:24,458 --> 00:04:28,327 very specified to the data in a way that might not 57 00:04:28,327 --> 00:04:33,710 be descriptive in general for what we're thinking about. 58 00:04:33,710 --> 00:04:38,671 So as an extreme case of this, imagine that we define the number of clusters 59 00:04:38,671 --> 00:04:41,844 as equal to the number of observed data points. 60 00:04:44,045 --> 00:04:48,354 Then what we can do is for each one of our observations, 61 00:04:48,354 --> 00:04:51,640 we can define that to be a cluster center. 62 00:04:53,080 --> 00:04:56,853 So what's the heterogeneity of that clustering? 63 00:04:58,591 --> 00:04:59,652 It's 0. 64 00:05:02,036 --> 00:05:04,220 And why is it 0? 65 00:05:04,220 --> 00:05:08,920 Because it's, we're looking at the sum over all of our clusters 66 00:05:08,920 --> 00:05:11,730 of the square distance of the observation to the cluster center. 67 00:05:11,730 --> 00:05:13,940 And what's the distance to the cluster center? 68 00:05:13,940 --> 00:05:14,640 It's 0. 69 00:05:14,640 --> 00:05:17,390 Each cluster center is the observation itself. 70 00:05:23,460 --> 00:05:28,095 So, in general, the lowest possible heterogeneity that 71 00:05:28,095 --> 00:05:31,508 we can achieve decreases as we increase k. 72 00:05:31,508 --> 00:05:36,030 But, I want to emphasize that for any given run of our k-means algorithm. 73 00:05:36,030 --> 00:05:41,110 Let's say, we run out with k +1 clusters, well, the heterogeneity for 74 00:05:41,110 --> 00:05:45,980 that converge solution, could actually be larger than the heterogeneity for 75 00:05:45,980 --> 00:05:49,920 a converged solution, where we ran algorithm with just k clusters. 76 00:05:49,920 --> 00:05:54,300 But in general, when we're thinking about over the set of possible heterogeneity's 77 00:05:54,300 --> 00:05:57,460 we have, the trend is that it decreases with k. 78 00:05:57,460 --> 00:06:03,120 So a question is, how do we think about choosing the number of clusters? 79 00:06:03,120 --> 00:06:08,200 Well we said that our heterogeneity metric or the lowest possible heterogeneity 80 00:06:08,200 --> 00:06:13,770 we can achieve decreases as we increase the number of clusters. 81 00:06:13,770 --> 00:06:18,690 And we said lower heterogeneity is better, but then we also said well, 82 00:06:18,690 --> 00:06:23,446 we don't want too many clusters because that won't really describe 83 00:06:23,446 --> 00:06:26,318 groupings that are very natural. 84 00:06:26,318 --> 00:06:29,630 We're just going to over partition the data into these little groups and 85 00:06:29,630 --> 00:06:31,050 get very low heterogeneity, 86 00:06:31,050 --> 00:06:35,540 but not a good description or not a good structure extracted from the data. 87 00:06:35,540 --> 00:06:38,751 So somehow we need to trade off between these two things. 88 00:06:38,751 --> 00:06:43,278 And so what we can do is we can think about a heuristic, 89 00:06:43,278 --> 00:06:49,417 where we're just going to look at something called the elbow of the curve, 90 00:06:49,417 --> 00:06:55,085 where it's a little bit unclear whether the elbow is here or here. 91 00:06:55,085 --> 00:07:00,507 So I'll just say, choose one of these 92 00:07:00,507 --> 00:07:05,759 values of k where maybe I would prefer 93 00:07:05,759 --> 00:07:11,181 this value of k because the difference 94 00:07:11,181 --> 00:07:17,452 in heterogeneity relative to values much, 95 00:07:17,452 --> 00:07:23,370 much larger, it's not very big. 96 00:07:23,370 --> 00:07:27,700 But this should provide a more parsimonious description 97 00:07:27,700 --> 00:07:30,460 of my data in using this much, much larger k. 98 00:07:32,700 --> 00:07:36,780 But this is just a heuristic. 99 00:07:39,820 --> 00:07:44,038 So, in general, there's really no answer for what's the right value of k? 100 00:07:44,038 --> 00:07:47,977 And you're going to explore this in depth in the assignment 101 00:07:47,977 --> 00:07:50,117 associated with this module. 102 00:07:50,117 --> 00:07:53,331 There's one last thing I wanted to mention here, which is so 103 00:07:53,331 --> 00:07:56,619 far this plot is a plot of the lowest possible heterogeneity. 104 00:07:56,619 --> 00:08:00,561 Of course we're not going to compute all possible clusterings for 105 00:08:00,561 --> 00:08:02,897 every value of k to define this curve. 106 00:08:02,897 --> 00:08:04,822 So in practice what people do, 107 00:08:04,822 --> 00:08:08,525 is they just do multiple random restarts of our algorithm, 108 00:08:08,525 --> 00:08:14,640 even if we're using k-means plus, plus, or random initializations, mix those in. 109 00:08:14,640 --> 00:08:17,120 Do a bunch of different restarts for every value of k, 110 00:08:17,120 --> 00:08:20,810 and then just look at the clustering that provides 111 00:08:20,810 --> 00:08:25,370 the lowest measure of heterogeneity and plot that as a function of k. 112 00:08:25,370 --> 00:08:27,440 And then choose k based on that. 113 00:08:28,700 --> 00:08:31,850 But again, that itself is very computationally intensive. 114 00:08:31,850 --> 00:08:36,130 So in practice, people deal wide range of things, but 115 00:08:36,130 --> 00:08:39,580 it's worth mentioning that how you choose k of course really, 116 00:08:39,580 --> 00:08:42,770 really matters for what the resulting clusters represent. 117 00:08:42,770 --> 00:08:46,940 If you choose k = 2, you're just dividing your cadence into two groups. 118 00:08:46,940 --> 00:08:48,872 If they're really inherently, 119 00:08:48,872 --> 00:08:52,830 very clearly separated clusters you're not going to capture that. 120 00:08:52,830 --> 00:08:56,800 But if you choose a million clusters, you're going to get very, 121 00:08:56,800 --> 00:09:00,170 very fine tuned clusters, obviously, depending on the size of your data, 122 00:09:00,170 --> 00:09:02,230 I'm trying to give an example that's very large. 123 00:09:02,230 --> 00:09:03,230 So, bear with me. 124 00:09:03,230 --> 00:09:08,030 But the point is that you really need to look at the data, 125 00:09:08,030 --> 00:09:11,110 think about what these things mean in your application, and 126 00:09:11,110 --> 00:09:15,250 when it's hard to look at the data, you can start doing things like the heuristic 127 00:09:15,250 --> 00:09:18,340 we're talking about here where you select the elbow of this curve. 128 00:09:18,340 --> 00:09:22,980 But again, you'll get a lot of hands on practice with it in the assignment. 129 00:09:22,980 --> 00:09:27,239 [MUSIC]