1 00:00:00,170 --> 00:00:01,340 In this video, I'd like 2 00:00:01,450 --> 00:00:03,230 to talk about how to initialize 3 00:00:04,580 --> 00:00:05,970 K-means and more importantly, this will 4 00:00:06,170 --> 00:00:07,240 lead into a discussion of 5 00:00:07,550 --> 00:00:10,210 how to make K-means avoid local optima as well. 6 00:00:10,740 --> 00:00:12,390 Here's the K-means clustering algorithm 7 00:00:12,950 --> 00:00:14,420 that we talked about earlier. 8 00:00:15,760 --> 00:00:16,760 One step that we never really 9 00:00:17,260 --> 00:00:18,350 talked much about was this step 10 00:00:18,820 --> 00:00:21,560 of how you randomly initialize the cluster centroids. 11 00:00:22,390 --> 00:00:23,490 There are few different ways that 12 00:00:23,710 --> 00:00:25,350 one can imagine using to randomly 13 00:00:25,960 --> 00:00:26,860 initialize the cluster centroids. 14 00:00:27,510 --> 00:00:28,580 But, it turns out that 15 00:00:28,720 --> 00:00:29,820 there is one method that is 16 00:00:30,050 --> 00:00:31,700 much more recommended than most 17 00:00:32,080 --> 00:00:33,830 of the other options one might think about. 18 00:00:34,400 --> 00:00:35,250 So, let me tell you about 19 00:00:35,590 --> 00:00:38,160 that option since it's what often seems to work best. 20 00:00:39,550 --> 00:00:42,210 Here's how I usually initialize my cluster centroids. 21 00:00:43,300 --> 00:00:44,710 When running K-means, you should have 22 00:00:45,140 --> 00:00:47,160 the number of cluster centroids, K, 23 00:00:47,430 --> 00:00:48,520 set to be less than the 24 00:00:48,590 --> 00:00:50,090 number of training examples M. It 25 00:00:50,170 --> 00:00:51,210 would be really weird to run 26 00:00:51,430 --> 00:00:52,600 K-means with a number 27 00:00:52,870 --> 00:00:54,270 of cluster centroids that's, you know, 28 00:00:54,520 --> 00:00:55,790 equal or greater than the number of examples you have, right? 29 00:00:58,080 --> 00:00:59,010 So the way I 30 00:00:59,150 --> 00:01:00,510 usually initialize K-means is, 31 00:01:00,770 --> 00:01:02,510 I would randomly pick k training 32 00:01:02,990 --> 00:01:05,170 examples. So, and, what 33 00:01:05,610 --> 00:01:06,730 I do is then set Mu1 34 00:01:06,850 --> 00:01:09,320 of MuK equal to these k examples. 35 00:01:10,610 --> 00:01:11,470 Let me show you a concrete example. 36 00:01:12,560 --> 00:01:14,190 Lets say that k is 37 00:01:14,470 --> 00:01:16,600 equal to 2 and so 38 00:01:17,070 --> 00:01:19,520 on this example on the right let's say I want to find two clusters. 39 00:01:21,170 --> 00:01:22,060 So, what I'm going to 40 00:01:22,200 --> 00:01:23,350 do in order to initialize 41 00:01:23,770 --> 00:01:25,340 my cluster centroids is, I'm 42 00:01:25,470 --> 00:01:27,320 going to randomly pick a couple examples. 43 00:01:27,760 --> 00:01:28,960 And let's say, I pick 44 00:01:29,230 --> 00:01:31,060 this one and I pick that one. 45 00:01:31,230 --> 00:01:32,320 And the way I'm going 46 00:01:32,380 --> 00:01:34,100 to initialize my cluster centroids 47 00:01:34,310 --> 00:01:35,190 is, I'm just going to initialize 48 00:01:36,200 --> 00:01:38,930 my cluster centroids to be right on top of those examples. 49 00:01:39,530 --> 00:01:40,430 So that's my first cluster centroid 50 00:01:41,410 --> 00:01:43,230 and that's my second cluster centroid, and 51 00:01:43,390 --> 00:01:45,770 that's one random initialization of K-means. 52 00:01:48,540 --> 00:01:50,480 The one I drew looks like a particularly good one. 53 00:01:50,890 --> 00:01:51,810 And sometimes I might get less 54 00:01:52,040 --> 00:01:53,370 lucky and maybe I'll end 55 00:01:53,510 --> 00:01:54,900 up picking that as my first 56 00:01:55,330 --> 00:01:58,420 random initial example, and that as my second one. 57 00:01:59,050 --> 00:02:01,380 And here I'm picking two examples because k equals 2. 58 00:02:01,590 --> 00:02:03,590 Some we have randomly picked two 59 00:02:03,890 --> 00:02:05,030 training examples and if 60 00:02:05,100 --> 00:02:06,660 I chose those two then I'll 61 00:02:06,830 --> 00:02:08,040 end up with, may be 62 00:02:08,250 --> 00:02:09,200 this as my first cluster 63 00:02:09,510 --> 00:02:10,980 centroid and that as 64 00:02:11,140 --> 00:02:13,560 my second initial location of the cluster centroid. 65 00:02:14,150 --> 00:02:15,690 So, that's how you can randomly 66 00:02:16,070 --> 00:02:17,560 initialize the cluster centroids. 67 00:02:17,810 --> 00:02:19,670 And so at initialization, your 68 00:02:19,860 --> 00:02:21,110 first cluster centroid Mu1 will 69 00:02:21,270 --> 00:02:23,350 be equal to x(i) for 70 00:02:23,520 --> 00:02:25,870 some randomly value of i and 71 00:02:26,980 --> 00:02:27,660 Mu2 will be equal to x(j) 72 00:02:29,240 --> 00:02:30,980 for some different randomly chosen value 73 00:02:31,380 --> 00:02:32,830 of j and so on, 74 00:02:32,910 --> 00:02:34,440 if you have more clusters and more cluster centroid. 75 00:02:35,680 --> 00:02:37,540 And sort of the side common. 76 00:02:38,110 --> 00:02:39,240 I should say that in the 77 00:02:39,320 --> 00:02:40,840 earlier video where I first 78 00:02:41,150 --> 00:02:43,030 illustrated K-means with the animation. 79 00:02:44,310 --> 00:02:45,070 In that set of slides. 80 00:02:45,900 --> 00:02:46,890 Only for the purpose of illustration. 81 00:02:47,590 --> 00:02:48,690 I actually used a different 82 00:02:49,240 --> 00:02:51,750 method of initialization for my cluster centroids. 83 00:02:52,460 --> 00:02:53,790 But the method described on 84 00:02:53,900 --> 00:02:55,940 this slide, this is really the recommended way. 85 00:02:56,430 --> 00:02:58,850 And the way that you should probably use, when you implement K-means. 86 00:03:00,090 --> 00:03:01,560 So, as they suggested perhaps 87 00:03:02,070 --> 00:03:04,090 by these two illustrations on the right. 88 00:03:04,930 --> 00:03:06,050 You might really guess that K-means 89 00:03:06,530 --> 00:03:08,130 can end up converging to 90 00:03:08,260 --> 00:03:10,150 different solutions depending on 91 00:03:10,860 --> 00:03:12,470 exactly how the clusters 92 00:03:12,990 --> 00:03:15,170 were initialized, and so, depending on the random initialization. 93 00:03:16,280 --> 00:03:18,180 K-means can end up at different solutions. 94 00:03:18,930 --> 00:03:22,560 And, in particular, K-means can actually end up at local optima. 95 00:03:23,650 --> 00:03:24,920 If you're given the data sale like this. 96 00:03:25,400 --> 00:03:26,370 Well, it looks like, you know, there 97 00:03:26,660 --> 00:03:28,340 are three clusters, and so, 98 00:03:28,780 --> 00:03:30,090 if you run K-means and if 99 00:03:30,150 --> 00:03:31,380 it ends up at a good 100 00:03:31,820 --> 00:03:32,910 local optima this might be 101 00:03:33,040 --> 00:03:35,830 really the global optima, you might end up with that cluster ring. 102 00:03:36,820 --> 00:03:38,440 But if you had a particularly 103 00:03:39,110 --> 00:03:41,630 unlucky, random initialization, K-means 104 00:03:42,100 --> 00:03:43,660 can also get stuck at different 105 00:03:44,180 --> 00:03:45,740 local optima. So, in 106 00:03:45,850 --> 00:03:47,330 this example on the left 107 00:03:47,620 --> 00:03:48,700 it looks like this blue cluster has captured 108 00:03:49,470 --> 00:03:51,700 a lot of points of the left and then the they were on the green clusters 109 00:03:52,050 --> 00:03:54,810 each is captioned on the relatively small number of points. 110 00:03:55,020 --> 00:03:56,480 And so, this corresponds to 111 00:03:56,640 --> 00:03:58,470 a bad local optima because it 112 00:03:58,530 --> 00:04:00,060 has basically taken these two 113 00:04:00,470 --> 00:04:01,560 clusters and used them into 114 00:04:01,780 --> 00:04:03,440 1 and furthermore, has 115 00:04:04,150 --> 00:04:06,070 split the second cluster into 116 00:04:06,580 --> 00:04:09,170 two separate sub-clusters like 117 00:04:09,380 --> 00:04:10,270 so, and it has also 118 00:04:10,720 --> 00:04:12,280 taken the second cluster and 119 00:04:12,540 --> 00:04:14,220 split it into two 120 00:04:14,460 --> 00:04:16,630 separate sub-clusters like so, and 121 00:04:16,760 --> 00:04:17,880 so, both of these 122 00:04:18,000 --> 00:04:18,970 examples on the lower 123 00:04:19,220 --> 00:04:20,890 right correspond to different local 124 00:04:21,250 --> 00:04:22,440 optima of K-means and in fact, 125 00:04:22,890 --> 00:04:24,440 in this example here, 126 00:04:25,070 --> 00:04:26,150 the cluster, the red cluster 127 00:04:26,550 --> 00:04:27,870 has captured only a single optima example. 128 00:04:28,380 --> 00:04:29,810 And the term local 129 00:04:30,200 --> 00:04:31,000 optima, by the way, refers 130 00:04:31,490 --> 00:04:32,930 to local optima of this 131 00:04:33,190 --> 00:04:35,940 distortion function J, and 132 00:04:36,320 --> 00:04:38,380 what these solutions on the 133 00:04:38,590 --> 00:04:39,830 lower left, what these local 134 00:04:40,120 --> 00:04:41,420 optima correspond to is 135 00:04:41,530 --> 00:04:42,880 really solutions where K-means 136 00:04:43,330 --> 00:04:44,050 has gotten stuck to the local 137 00:04:44,600 --> 00:04:45,940 optima and it's not doing 138 00:04:46,170 --> 00:04:47,940 a very good job minimizing this 139 00:04:48,110 --> 00:04:50,030 distortion function J. So, 140 00:04:50,540 --> 00:04:52,250 if you're worried about K-means getting 141 00:04:52,540 --> 00:04:53,810 stuck in local optima, if 142 00:04:53,970 --> 00:04:55,110 you want to increase the odds 143 00:04:55,330 --> 00:04:56,950 of K-means finding the best 144 00:04:57,230 --> 00:04:58,480 possible clustering, like that shown 145 00:04:58,730 --> 00:05:00,290 on top here, what we 146 00:05:00,350 --> 00:05:02,820 can do, is try multiple, random initializations. 147 00:05:03,580 --> 00:05:04,820 So, instead of just initializing K-means 148 00:05:05,430 --> 00:05:06,460 once and hopping that that 149 00:05:06,670 --> 00:05:07,680 works, what we can do 150 00:05:08,040 --> 00:05:10,020 is, initialize K-means lots of 151 00:05:10,130 --> 00:05:10,990 times and run K-means lots of 152 00:05:11,890 --> 00:05:12,870 times, and use that to 153 00:05:12,950 --> 00:05:13,840 try to make sure we get 154 00:05:14,110 --> 00:05:15,640 as good a solution, as 155 00:05:15,800 --> 00:05:18,380 good a local or global optima as possible. 156 00:05:19,480 --> 00:05:22,460 Concretely, here's how you could go about doing that. 157 00:05:22,720 --> 00:05:23,500 Let's say, I decide to run 158 00:05:23,700 --> 00:05:24,800 K-meanss a hundred times 159 00:05:25,160 --> 00:05:26,790 so I'll execute this loop 160 00:05:27,060 --> 00:05:28,900 a hundred times and it's 161 00:05:29,330 --> 00:05:30,830 fairly typical a number of 162 00:05:30,920 --> 00:05:31,910 times when came to will be 163 00:05:32,160 --> 00:05:33,670 something from 50 up to may be 1000. 164 00:05:35,090 --> 00:05:36,730 So, let's say you decide to say K-means one hundred times. 165 00:05:38,220 --> 00:05:39,100 So what that means is that 166 00:05:39,170 --> 00:05:41,490 we would randomnly initialize K-means. 167 00:05:42,350 --> 00:05:43,250 And for each of 168 00:05:43,340 --> 00:05:44,710 these one hundred random intializations 169 00:05:45,370 --> 00:05:47,040 we would run K-means and 170 00:05:47,220 --> 00:05:48,200 that would give us a set 171 00:05:48,430 --> 00:05:50,270 of clusteringings, and a set of cluster 172 00:05:50,590 --> 00:05:51,940 centroids, and then we 173 00:05:52,040 --> 00:05:53,760 would then compute the distortion J, 174 00:05:54,500 --> 00:05:55,600 that is compute this cause function on 175 00:05:56,910 --> 00:05:58,260 the set of cluster assignments 176 00:05:58,720 --> 00:05:59,910 and cluster centroids that we got. 177 00:06:01,000 --> 00:06:03,470 Finally, having done this whole procedure a hundred times. 178 00:06:04,450 --> 00:06:06,330 You will have a hundred different ways 179 00:06:06,710 --> 00:06:08,990 of clustering the data and then 180 00:06:09,240 --> 00:06:10,310 finally what you do 181 00:06:10,590 --> 00:06:11,510 is all of these hundred 182 00:06:11,820 --> 00:06:13,210 ways you have found of clustering the data, 183 00:06:13,800 --> 00:06:16,050 just pick one, that gives us the lowest cost. 184 00:06:16,400 --> 00:06:18,480 That gives us the lowest distortion. 185 00:06:18,960 --> 00:06:20,610 And it turns out that 186 00:06:21,170 --> 00:06:22,490 if you are running K-means with 187 00:06:22,670 --> 00:06:24,520 a fairly small number of 188 00:06:24,630 --> 00:06:25,260 clusters , so you know if the number 189 00:06:25,520 --> 00:06:26,700 of clusters is anywhere from 190 00:06:26,760 --> 00:06:28,180 two up to maybe 10 - 191 00:06:28,980 --> 00:06:30,650 then doing multiple random initializations 192 00:06:31,460 --> 00:06:32,880 can often, can sometimes make 193 00:06:32,990 --> 00:06:34,430 sure that you find a better local optima. 194 00:06:34,690 --> 00:06:37,680 Make sure you find the better clustering data. 195 00:06:37,870 --> 00:06:38,930 But if K is very large, so, if 196 00:06:39,080 --> 00:06:40,000 K is much greater than 10, 197 00:06:40,160 --> 00:06:41,010 certainly if K were, you 198 00:06:41,080 --> 00:06:42,340 know, if you were trying to 199 00:06:42,400 --> 00:06:44,050 find hundreds of clusters, then, 200 00:06:45,840 --> 00:06:47,310 having multiple random initializations is 201 00:06:47,940 --> 00:06:49,220 less likely to make a huge difference 202 00:06:49,360 --> 00:06:50,400 and there is a much 203 00:06:50,590 --> 00:06:51,910 higher chance that your first 204 00:06:52,320 --> 00:06:53,610 random initialization will give 205 00:06:53,730 --> 00:06:55,380 you a pretty decent solution already 206 00:06:56,590 --> 00:06:58,070 and doing, doing multiple random 207 00:06:58,680 --> 00:07:00,060 initializations will probably give 208 00:07:00,260 --> 00:07:02,500 you a slightly better solution but, but maybe not that much. 209 00:07:02,780 --> 00:07:04,230 But it's really in the regime of where 210 00:07:04,540 --> 00:07:05,810 you have a relatively small number 211 00:07:06,090 --> 00:07:07,740 of clusters, especially if you 212 00:07:08,040 --> 00:07:09,080 have, maybe 2 or 3 213 00:07:09,150 --> 00:07:10,550 or 4 clusters that random 214 00:07:11,140 --> 00:07:13,790 initialization could make a huge difference in terms of 215 00:07:14,190 --> 00:07:15,090 making sure you do a good 216 00:07:15,170 --> 00:07:16,920 job minimizing the distortion 217 00:07:17,560 --> 00:07:18,730 function and giving you a good clustering. 218 00:07:21,390 --> 00:07:22,560 So, that's K-means 219 00:07:22,640 --> 00:07:23,300 with random initialization. 220 00:07:24,350 --> 00:07:25,570 If you're trying to learn a 221 00:07:25,710 --> 00:07:26,950 clustering with a relatively small 222 00:07:27,310 --> 00:07:28,250 number of clusters, 2, 3, 223 00:07:28,400 --> 00:07:30,540 4, 5, maybe, 6, 7, using 224 00:07:31,660 --> 00:07:34,040 multiple random initializations can 225 00:07:34,380 --> 00:07:36,830 sometimes, help you find much better clustering of the data. 226 00:07:37,680 --> 00:07:39,650 But, even if you are learning a large number of clusters, the initialization, the random 227 00:07:40,350 --> 00:07:43,280 initialization method that I describe here. 228 00:07:43,520 --> 00:07:45,110 That should give K-means a 229 00:07:45,370 --> 00:07:46,680 reasonable starting point to start 230 00:07:47,030 --> 00:07:48,580 from for finding a good set of clusters.