1 00:00:00,077 --> 00:00:04,698 [MUSIC] 2 00:00:04,698 --> 00:00:08,140 So, we've discussed sensitivity to initialization. 3 00:00:08,140 --> 00:00:11,364 Now, let's turn to a method that provides a smarter and 4 00:00:11,364 --> 00:00:14,394 more reliable form of initializing our algorithm. 5 00:00:14,394 --> 00:00:18,070 And this is referred to as k-means++. 6 00:00:18,070 --> 00:00:22,960 So, how we think about initializing our k-means algorithm is really, 7 00:00:22,960 --> 00:00:27,306 really, really critical to the quality of the solution found. 8 00:00:27,306 --> 00:00:31,440 And so one way to think about a smart initialization is as follows. 9 00:00:31,440 --> 00:00:36,600 The first step is to select the first cluster center, just randomly 10 00:00:36,600 --> 00:00:40,300 amongst all our data points, to find it to be one of those data points at random. 11 00:00:41,400 --> 00:00:46,271 Then the second step is for each one of our observations, 12 00:00:46,271 --> 00:00:50,851 compute the distance to the nearest cluster center. 13 00:00:50,851 --> 00:00:54,615 And so, for the first pass through there's only one cluster center, so 14 00:00:54,615 --> 00:00:57,160 it's just the distance to that cluster center. 15 00:00:57,160 --> 00:01:01,833 But if there are multiple cluster centers as we go through this initialization 16 00:01:01,833 --> 00:01:06,742 scheme, then we're going to maintain the distance to the closest cluster center. 17 00:01:06,742 --> 00:01:12,201 And then when we go to sample, our next cluster center, 18 00:01:12,201 --> 00:01:18,730 we're going to choose it proportional to those distances squared. 19 00:01:18,730 --> 00:01:23,130 So we're going to bias towards selecting an observation that's far from 20 00:01:23,130 --> 00:01:24,810 existing cluster centers. 21 00:01:25,910 --> 00:01:29,430 And then we're going to repeat the steps until we've sampled 22 00:01:29,430 --> 00:01:31,250 k different cluster centers. 23 00:01:32,490 --> 00:01:35,130 So let's visualize what this means. 24 00:01:35,130 --> 00:01:38,970 So again, here are the data points that we wish to cluster, 25 00:01:38,970 --> 00:01:41,270 the same example that we've been using. 26 00:01:41,270 --> 00:01:45,040 And the first step of our k-mean is ++ algorithm is just to select one of these 27 00:01:45,040 --> 00:01:48,620 data points at random and call this the first cluster center. 28 00:01:48,620 --> 00:01:55,181 Then the next step of the algorithm is to compute distances to this cluster center. 29 00:02:07,235 --> 00:02:11,832 And then based on the square of these distances we're 30 00:02:11,832 --> 00:02:14,905 going to sample a new cluster center. 31 00:02:14,905 --> 00:02:19,178 So, remember that we're going to be more likely 32 00:02:22,281 --> 00:02:26,152 To select a data point 33 00:02:29,934 --> 00:02:33,763 As a cluster center 34 00:02:36,860 --> 00:02:41,756 If that data point is far away. 35 00:02:44,799 --> 00:02:49,021 And the fact that in the algorithm we're looking at the distant squares, 36 00:02:49,021 --> 00:02:51,077 makes that even more exaggerated. 37 00:02:56,818 --> 00:03:01,728 So we're very likely to sample one of the data points that's 38 00:03:01,728 --> 00:03:06,360 far from this cluster center as the next cluster center. 39 00:03:06,360 --> 00:03:11,149 And so, maybe we select this observation as the next cluster center. 40 00:03:11,149 --> 00:03:14,670 So, that's going to be some green cluster. 41 00:03:14,670 --> 00:03:17,440 Now we repeat this process where we compute 42 00:03:17,440 --> 00:03:20,850 distance to the nearest cluster center. 43 00:03:20,850 --> 00:03:27,422 So, maybe these are going to be the distances that we compute. 44 00:03:27,422 --> 00:03:30,632 It gets a little bit hard to tell, I think. 45 00:03:30,632 --> 00:03:36,420 This is a smaller distance than this, perhaps not, but you get the point. 46 00:03:38,120 --> 00:03:41,390 And then based of of this we're going to, again, 47 00:03:41,390 --> 00:03:45,720 compute the square of each of these distances and then sample a point. 48 00:03:45,720 --> 00:03:51,220 And this one here is really far away from the nearest cluster center, 49 00:03:51,220 --> 00:03:53,820 so it's very likely to get selected. 50 00:03:53,820 --> 00:03:57,140 And, indeed, in this example, we select that. 51 00:03:57,140 --> 00:04:02,180 So our initialization of three cluster centers for 52 00:04:02,180 --> 00:04:07,813 our k-means algorithm are going to be at these locations. 53 00:04:07,813 --> 00:04:12,488 So running k-means++ to initialize our k-means algorithm is definitely more 54 00:04:12,488 --> 00:04:17,109 computationally costly than just randomly selecting a set of cluster centers. 55 00:04:17,109 --> 00:04:22,001 But the subsequent k-means algorithm that we're going to run often 56 00:04:22,001 --> 00:04:23,928 converges more rapidly. 57 00:04:23,928 --> 00:04:31,350 And so, k-means++ tends to improve the quality of the local optimum of found. 58 00:04:31,350 --> 00:04:36,745 And in addition, reduce the runtime because it converges more rapid 59 00:04:36,745 --> 00:04:39,105 to that high quality solution. 60 00:04:39,105 --> 00:04:44,807 So in practice, running k-means++ tends to be a really, really helpful thing to do. 61 00:04:44,807 --> 00:04:48,949 [MUSIC]