1 00:00:00,000 --> 00:00:05,025 [MUSIC] 2 00:00:05,025 --> 00:00:07,801 K-means might seem like heuristic algorithm but 3 00:00:07,801 --> 00:00:11,911 you can actually view it as an example of a coordinate descent algorithm. 4 00:00:11,911 --> 00:00:15,070 In particular K-means iterate between two steps. 5 00:00:15,070 --> 00:00:18,590 One is assigning observations to the nearest cluster center. 6 00:00:18,590 --> 00:00:21,520 And the other is updating the cluster centers. 7 00:00:21,520 --> 00:00:24,390 Well, this first step where we assign observations to 8 00:00:24,390 --> 00:00:28,130 the nearest cluster center, we can clearly see from this equation is equivalent 9 00:00:28,130 --> 00:00:29,940 to minimizing the given objective. 10 00:00:30,950 --> 00:00:33,960 For the second step, it's a little bit less clear. 11 00:00:33,960 --> 00:00:37,726 But we can actually rewrite this second step as follows, 12 00:00:37,726 --> 00:00:40,873 in terms of this minimization of an objective. 13 00:00:40,873 --> 00:00:44,727 In particular minimizing the sum of the square distances 14 00:00:44,727 --> 00:00:47,537 because remember that computing the mean 15 00:00:47,537 --> 00:00:52,433 of a set of observations is equivalent to computing its center of mass and 16 00:00:52,433 --> 00:00:57,827 that's exactly what minimizing the sum of square distances is equivalent to. 17 00:00:57,827 --> 00:01:03,507 So now what we see is that we're iterating between two different steps 18 00:01:03,507 --> 00:01:09,180 where each step represents a minimization of some objective. 19 00:01:09,180 --> 00:01:12,210 And so from this we can see that this is an example of a coordinate 20 00:01:12,210 --> 00:01:14,060 descent algorithm. 21 00:01:14,060 --> 00:01:19,040 And if the notion of a coordinate descent algorithm isn't sounding very familiar 22 00:01:19,040 --> 00:01:21,630 that's probably because you didn't take the regression course. 23 00:01:21,630 --> 00:01:22,650 That's really, really sad. 24 00:01:22,650 --> 00:01:24,130 Go back, take a regression course. 25 00:01:24,130 --> 00:01:24,680 No. 26 00:01:24,680 --> 00:01:26,380 You don't have to take the regression course, but 27 00:01:26,380 --> 00:01:30,750 if you're not familiar with coordinate, a center coordinate descent, 28 00:01:30,750 --> 00:01:34,990 then you can refer to that course, and the videos in that course, 29 00:01:36,050 --> 00:01:39,770 to get a more detailed explanation of these types of algorithms. 30 00:01:39,770 --> 00:01:43,870 But the idea here in particular is that we're iterating between 31 00:01:43,870 --> 00:01:47,780 minimizing our z variables, given our mews. 32 00:01:47,780 --> 00:01:52,610 So minimizing an objective over 33 00:01:52,610 --> 00:01:56,430 our cluster indicators, given the cluster centers. 34 00:01:56,430 --> 00:02:01,400 And then we're looking at a minimization to find our cluster centers. 35 00:02:01,400 --> 00:02:05,850 Given are assigned cluster indicators. 36 00:02:05,850 --> 00:02:09,680 Okay, so now let's get back to this idea of converges of k-means. 37 00:02:09,680 --> 00:02:15,419 So first, does this algorithm converge and if it does, what does it converge to? 38 00:02:15,419 --> 00:02:19,176 Well, the first thing to note is that the k-means objective landscape, 39 00:02:19,176 --> 00:02:21,490 is this really complex non-complex thing. 40 00:02:22,550 --> 00:02:26,950 So because of that, we can rule out converging to a global optimum or, 41 00:02:26,950 --> 00:02:30,570 at least, being guaranteed, obviously, to converge to a global optimum. 42 00:02:30,570 --> 00:02:34,510 But a question now is do we converge at least to a local optimum? 43 00:02:34,510 --> 00:02:38,745 And because we can cast k-means as a core net descent algorithm, 44 00:02:38,745 --> 00:02:43,400 then we know that we are indeed converging to a local optimum. 45 00:02:43,400 --> 00:02:45,450 But in a lot of problems, like I said, 46 00:02:45,450 --> 00:02:48,530 there's a really complicated landscape to this k-means objective. 47 00:02:48,530 --> 00:02:51,020 So there are lots and lots of local modes and 48 00:02:51,020 --> 00:02:55,630 the quality of the solutions between these local modes can vary quite dramatically. 49 00:02:55,630 --> 00:02:57,240 So let's look at an example of this. 50 00:02:59,690 --> 00:03:03,440 Here's a data set, the black points are data points. 51 00:03:03,440 --> 00:03:06,293 And what we're showing by these crosses, 52 00:03:06,293 --> 00:03:10,770 these colored crosses are our initialized centers. 53 00:03:10,770 --> 00:03:14,344 So here is one initialization of our k-means algorithm. 54 00:03:14,344 --> 00:03:19,297 And then what we're showing in the other plot with the colored circles, 55 00:03:19,297 --> 00:03:24,190 are the labels of the clusters, after convergence of this algorithm. 56 00:03:24,190 --> 00:03:25,500 And convergents of this algorithm, 57 00:03:25,500 --> 00:03:30,870 one way to assess that, is just when our assignments stop changing. 58 00:03:30,870 --> 00:03:35,031 And what we see is that we've taken this little group of 59 00:03:35,031 --> 00:03:38,569 observations in the bottom left hand corner. 60 00:03:38,569 --> 00:03:43,659 And we've divided those into two clusters and then we have this big huge cluster, 61 00:03:43,659 --> 00:03:48,610 this blue cluster, capturing everything else in the data set. 62 00:03:48,610 --> 00:03:53,860 But if we initialize our algorithm with different cluster centers, 63 00:03:53,860 --> 00:03:56,480 then we've converge to a completely different solution. 64 00:03:56,480 --> 00:03:59,570 And this one might look a little bit more reasonable, 65 00:03:59,570 --> 00:04:03,730 depending on what the structure of the data might be. 66 00:04:03,730 --> 00:04:08,510 Where we end up with three clusters, each of which forms a bit tighter of a group, 67 00:04:08,510 --> 00:04:11,340 but lets look at another initialization, 68 00:04:11,340 --> 00:04:14,990 where here, the groupings actually look similar, so if we flip back and 69 00:04:14,990 --> 00:04:19,100 forth between these two initializations the groupings look similar. 70 00:04:19,100 --> 00:04:23,811 But the cluster labels that we associated with each of these have swapped around. 71 00:04:23,811 --> 00:04:27,820 The pink clusters become the blue and the blue has become the pink. 72 00:04:27,820 --> 00:04:30,320 Well there's nothing specific to any of our cluster labeled. 73 00:04:30,320 --> 00:04:36,220 Cluster one, two or three, over different initializations, you know 74 00:04:36,220 --> 00:04:41,630 what observation gets assigned to cluster one, or cluster two or cluster three. 75 00:04:41,630 --> 00:04:46,550 Doesn't matter what actually is significant are the groups of observations 76 00:04:46,550 --> 00:04:47,650 that are grouped together. 77 00:04:47,650 --> 00:04:53,090 So what matters is observation xi and xj assigned to the same group, 78 00:04:53,090 --> 00:04:55,330 whether it's labeled group one, group two, or group three. 79 00:04:56,500 --> 00:05:00,280 Okay, but the other thing to note here are these observations here, 80 00:05:00,280 --> 00:05:04,300 highlighted with this orange circle where there's really a lot of uncertainty 81 00:05:04,300 --> 00:05:07,280 whether they should be in the blue cluster or this pink cluster. 82 00:05:07,280 --> 00:05:11,075 And if you look back at our other solution. 83 00:05:11,075 --> 00:05:14,040 And we're going to flip back and forth between these. 84 00:05:14,040 --> 00:05:19,610 We see that the assignment of those observations has changed. 85 00:05:19,610 --> 00:05:24,680 There's no longer consistency in which grouping those observations appear. 86 00:05:24,680 --> 00:05:30,070 In one case here they are appearing in the second cluster, the middle cluster, 87 00:05:30,070 --> 00:05:34,690 and in the other case they are appearing in that top right hand side cluster. 88 00:05:34,690 --> 00:05:39,100 And this notion of uncertainty in our clustering is really important and 89 00:05:39,100 --> 00:05:41,780 it's something we're going to come back to in the next module. 90 00:05:42,900 --> 00:05:47,621 But the point that we want to emphasize here is that k-means is very sensitive to 91 00:05:47,621 --> 00:05:52,557 initialization and you can actually get very crazy solutions, even crazier than 92 00:05:52,557 --> 00:05:57,086 what we've shown in these few examples just by converging to a local mode. 93 00:05:57,086 --> 00:06:01,439 [MUSIC]