1 00:00:00,300 --> 00:00:02,220 In the clustering problem we are 2 00:00:02,360 --> 00:00:03,630 given an unlabeled data 3 00:00:03,950 --> 00:00:05,040 set and we would like 4 00:00:05,200 --> 00:00:06,480 to have an algorithm automatically 5 00:00:07,320 --> 00:00:08,700 group the data into coherent 6 00:00:09,340 --> 00:00:11,000 subsets or into coherent clusters for us. 7 00:00:12,380 --> 00:00:14,160 The K Means algorithm is by 8 00:00:14,310 --> 00:00:15,860 far the most popular, by 9 00:00:16,090 --> 00:00:17,410 far the most widely used clustering 10 00:00:17,780 --> 00:00:19,380 algorithm, and in this 11 00:00:19,550 --> 00:00:20,320 video I would like to tell 12 00:00:20,570 --> 00:00:23,400 you what the K Means Algorithm is and how it works. 13 00:00:27,000 --> 00:00:29,310 The K means clustering algorithm is best illustrated in pictures. 14 00:00:29,960 --> 00:00:30,770 Let's say I want to take 15 00:00:31,080 --> 00:00:32,330 an unlabeled data set like 16 00:00:32,490 --> 00:00:34,040 the one shown here, and I 17 00:00:34,100 --> 00:00:36,450 want to group the data into two clusters. 18 00:00:37,710 --> 00:00:38,740 If I run the K Means clustering 19 00:00:39,080 --> 00:00:41,560 algorithm, here is what 20 00:00:41,910 --> 00:00:44,190 I'm going to do. The first step is to randomly initialize two 21 00:00:44,410 --> 00:00:45,920 points, called the cluster centroids. 22 00:00:46,700 --> 00:00:48,170 So, these two crosses here, 23 00:00:49,010 --> 00:00:51,730 these are called the Cluster Centroids 24 00:00:53,270 --> 00:00:54,320 and I have two of them 25 00:00:55,100 --> 00:00:57,840 because I want to group my data into two clusters. 26 00:00:59,130 --> 00:01:02,400 K Means is an iterative algorithm and it does two things. 27 00:01:03,480 --> 00:01:04,790 First is a cluster assignment 28 00:01:05,330 --> 00:01:07,800 step, and second is a move centroid step. 29 00:01:08,360 --> 00:01:09,630 So, let me tell you what those things mean. 30 00:01:11,170 --> 00:01:12,520 The first of the two steps in the 31 00:01:12,700 --> 00:01:14,930 loop of K means, is this cluster assignment step. 32 00:01:15,840 --> 00:01:17,070 What that means is that, it's 33 00:01:17,220 --> 00:01:18,360 going through each of the 34 00:01:18,700 --> 00:01:19,880 examples, each of these green 35 00:01:20,170 --> 00:01:22,120 dots shown here and depending 36 00:01:22,580 --> 00:01:24,140 on whether it's closer to the 37 00:01:24,350 --> 00:01:25,530 red cluster centroid or the 38 00:01:25,620 --> 00:01:27,390 blue cluster centroid, it is going 39 00:01:27,560 --> 00:01:28,570 to assign each of the 40 00:01:28,670 --> 00:01:30,670 data points to one of the two cluster centroids. 41 00:01:32,040 --> 00:01:33,350 Specifically, what I mean 42 00:01:33,460 --> 00:01:34,610 by that, is to go through your 43 00:01:34,730 --> 00:01:36,930 data set and color each 44 00:01:37,130 --> 00:01:38,510 of the points either red or 45 00:01:38,810 --> 00:01:39,890 blue, depending on whether 46 00:01:40,160 --> 00:01:41,060 it is closer to the red 47 00:01:41,170 --> 00:01:42,150 cluster centroid or the blue 48 00:01:42,470 --> 00:01:45,210 cluster centroid, and I've done that in this diagram here. 49 00:01:46,930 --> 00:01:48,700 So, that was the cluster assignment step. 50 00:01:49,780 --> 00:01:52,270 The other part of K means, in the 51 00:01:52,410 --> 00:01:53,390 loop of K means, is the move 52 00:01:53,590 --> 00:01:54,860 centroid step, and what 53 00:01:55,020 --> 00:01:55,730 we are going to do is, we 54 00:01:55,800 --> 00:01:56,890 are going to take the two cluster centroids, 55 00:01:57,390 --> 00:01:58,550 that is, the red cross and 56 00:01:58,830 --> 00:02:00,270 the blue cross, and we are 57 00:02:00,420 --> 00:02:01,420 going to move them to the average 58 00:02:02,070 --> 00:02:03,900 of the points colored the same colour. 59 00:02:04,880 --> 00:02:05,700 So what we are going 60 00:02:05,730 --> 00:02:06,510 to do is look at all the 61 00:02:06,630 --> 00:02:07,810 red points and compute the 62 00:02:08,240 --> 00:02:09,520 average, really the mean 63 00:02:10,080 --> 00:02:11,500 of the location of all the red points, 64 00:02:11,650 --> 00:02:13,690 and we are going to move the red cluster centroid there. 65 00:02:14,190 --> 00:02:15,260 And the same things for the 66 00:02:15,460 --> 00:02:16,370 blue cluster centroid, look at all 67 00:02:16,560 --> 00:02:17,720 the blue dots and compute their 68 00:02:17,840 --> 00:02:19,710 mean, and then move the blue cluster centroid there. 69 00:02:20,320 --> 00:02:20,880 So, let me do that now. 70 00:02:21,170 --> 00:02:22,990 We're going to move the cluster centroids as follows 71 00:02:24,590 --> 00:02:27,350 and I've now moved them to their new means. 72 00:02:28,300 --> 00:02:29,760 The red one moved like that 73 00:02:29,820 --> 00:02:31,350 and the blue one moved 74 00:02:31,510 --> 00:02:34,460 like that and the red one moved like that. 75 00:02:34,620 --> 00:02:35,460 And then we go back to another cluster 76 00:02:35,910 --> 00:02:36,920 assignment step, so we're again 77 00:02:37,190 --> 00:02:38,090 going to look at all of 78 00:02:38,160 --> 00:02:39,670 my unlabeled examples and depending 79 00:02:40,090 --> 00:02:42,840 on whether it's closer the red or the blue cluster centroid, 80 00:02:43,340 --> 00:02:45,150 I'm going to color them either red or blue. 81 00:02:45,640 --> 00:02:47,160 I'm going to assign each point 82 00:02:47,530 --> 00:02:48,550 to one of the two cluster centroids, so let me do that now. 83 00:02:51,450 --> 00:02:52,260 And so the colors of some of the points just changed. 84 00:02:53,400 --> 00:02:55,690 And then I'm going to do another move centroid step. 85 00:02:56,040 --> 00:02:56,810 So I'm going to compute the 86 00:02:57,070 --> 00:02:57,880 average of all the blue points, 87 00:02:58,110 --> 00:02:59,000 compute the average of all 88 00:02:59,040 --> 00:03:00,360 the red points and move my 89 00:03:00,480 --> 00:03:03,770 cluster centroids like this, and 90 00:03:03,930 --> 00:03:05,650 so, let's do that again. 91 00:03:06,160 --> 00:03:07,810 Let me do one more cluster assignment step. 92 00:03:08,320 --> 00:03:09,450 So colour each point red 93 00:03:09,620 --> 00:03:10,840 or blue, based on what it's 94 00:03:11,170 --> 00:03:13,070 closer to and then 95 00:03:13,310 --> 00:03:20,000 do another move centroid step and we're done. 96 00:03:20,350 --> 00:03:21,230 And in fact if you 97 00:03:21,290 --> 00:03:23,250 keep running additional iterations of 98 00:03:23,500 --> 00:03:26,020 K means from here the 99 00:03:26,160 --> 00:03:27,240 cluster centroids will not change 100 00:03:27,540 --> 00:03:28,770 any further and the colours of 101 00:03:28,830 --> 00:03:29,760 the points will not change any 102 00:03:29,940 --> 00:03:31,520 further. And so, this is 103 00:03:31,810 --> 00:03:33,520 the, at this point, 104 00:03:33,770 --> 00:03:35,290 K means has converged and it's 105 00:03:35,400 --> 00:03:36,430 done a pretty good job finding 106 00:03:37,470 --> 00:03:38,750 the two clusters in this data. 107 00:03:39,360 --> 00:03:40,310 Let's write out the K means algorithm more formally. 108 00:03:42,150 --> 00:03:43,930 The K means algorithm takes two inputs. 109 00:03:44,570 --> 00:03:46,200 One is a parameter K, 110 00:03:46,450 --> 00:03:47,260 which is the number of clusters 111 00:03:47,830 --> 00:03:48,900 you want to find in the data. 112 00:03:49,640 --> 00:03:50,820 I'll later say how we might 113 00:03:51,170 --> 00:03:53,290 go about trying to choose k, but 114 00:03:53,470 --> 00:03:54,600 for now let's just say that 115 00:03:55,110 --> 00:03:56,210 we've decided we want a 116 00:03:56,360 --> 00:03:57,600 certain number of clusters and we're 117 00:03:57,690 --> 00:03:58,810 going to tell the algorithm how many 118 00:03:59,040 --> 00:04:00,730 clusters we think there are in the data set. 119 00:04:01,170 --> 00:04:02,120 And then K means also 120 00:04:02,490 --> 00:04:03,430 takes as input this sort 121 00:04:03,880 --> 00:04:05,060 of unlabeled training set of 122 00:04:05,250 --> 00:04:06,530 just the Xs and 123 00:04:06,710 --> 00:04:08,430 because this is unsupervised learning, we 124 00:04:08,520 --> 00:04:10,690 don't have the labels Y anymore. 125 00:04:10,980 --> 00:04:12,470 And for unsupervised learning of 126 00:04:12,740 --> 00:04:14,020 the K means I'm going to 127 00:04:14,550 --> 00:04:16,160 use the convention that XI 128 00:04:16,420 --> 00:04:17,750 is an RN dimensional vector. 129 00:04:18,280 --> 00:04:19,190 And that's why my training examples 130 00:04:19,750 --> 00:04:22,460 are now N dimensional rather N plus one dimensional vectors. 131 00:04:24,340 --> 00:04:25,430 This is what the K means algorithm does. 132 00:04:27,180 --> 00:04:28,630 The first step is that it 133 00:04:28,790 --> 00:04:31,170 randomly initializes k cluster 134 00:04:31,570 --> 00:04:33,550 centroids which we will 135 00:04:33,820 --> 00:04:34,610 call mu 1, mu 2, up 136 00:04:34,840 --> 00:04:36,250 to mu k. And so 137 00:04:36,650 --> 00:04:38,450 in the earlier diagram, the 138 00:04:38,550 --> 00:04:40,770 cluster centroids corresponded to the 139 00:04:41,060 --> 00:04:42,240 location of the red cross 140 00:04:42,660 --> 00:04:44,020 and the location of the blue cross. 141 00:04:44,410 --> 00:04:45,640 So there we had two cluster 142 00:04:45,960 --> 00:04:47,000 centroids, so maybe the red 143 00:04:47,170 --> 00:04:48,470 cross was mu 1 144 00:04:48,650 --> 00:04:49,940 and the blue cross was mu 145 00:04:50,300 --> 00:04:51,360 2, and more generally we would have 146 00:04:51,820 --> 00:04:53,830 k cluster centroids rather than just 2. 147 00:04:54,520 --> 00:04:56,240 Then the inner loop 148 00:04:56,520 --> 00:04:57,360 of k means does the following, 149 00:04:57,830 --> 00:04:59,020 we're going to repeatedly do the following. 150 00:05:00,070 --> 00:05:01,950 First for each of 151 00:05:02,160 --> 00:05:03,920 my training examples, I'm going 152 00:05:04,110 --> 00:05:05,950 to set this variable CI 153 00:05:06,790 --> 00:05:07,960 to be the index 1 through 154 00:05:08,170 --> 00:05:10,520 K of the cluster centroid closest to XI. 155 00:05:11,170 --> 00:05:13,810 So this was my cluster assignment 156 00:05:14,330 --> 00:05:16,870 step, where we 157 00:05:17,000 --> 00:05:18,680 took each of my examples and 158 00:05:18,980 --> 00:05:20,740 coloured it either red 159 00:05:21,050 --> 00:05:22,050 or blue, depending on which 160 00:05:22,380 --> 00:05:23,940 cluster centroid it was closest to. 161 00:05:24,140 --> 00:05:25,090 So CI is going to be 162 00:05:25,280 --> 00:05:26,280 a number from 1 to 163 00:05:26,380 --> 00:05:27,680 K that tells us, you 164 00:05:27,780 --> 00:05:28,760 know, is it closer to the 165 00:05:28,920 --> 00:05:29,820 red cross or is it 166 00:05:29,900 --> 00:05:31,170 closer to the blue cross, 167 00:05:32,200 --> 00:05:33,210 and another way of writing this 168 00:05:33,580 --> 00:05:35,350 is I'm going to, 169 00:05:35,620 --> 00:05:37,820 to compute Ci, I'm 170 00:05:37,890 --> 00:05:39,120 going to take my Ith 171 00:05:39,380 --> 00:05:41,170 example Xi and and 172 00:05:41,360 --> 00:05:42,670 I'm going to measure it's distance 173 00:05:43,900 --> 00:05:44,860 to each of my cluster centroids, 174 00:05:45,410 --> 00:05:46,690 this is mu and then 175 00:05:47,060 --> 00:05:48,640 lower-case k, right, so 176 00:05:48,890 --> 00:05:50,630 capital K is the total 177 00:05:50,910 --> 00:05:51,900 number centroids and I'm going 178 00:05:52,100 --> 00:05:53,160 to use lower case k here 179 00:05:53,770 --> 00:05:55,140 to index into the different centroids. 180 00:05:56,240 --> 00:05:58,470 But so, Ci is going to, I'm going 181 00:05:58,550 --> 00:06:00,110 to minimize over my values 182 00:06:00,550 --> 00:06:01,930 of k and find the 183 00:06:02,120 --> 00:06:03,650 value of K that minimizes this 184 00:06:03,900 --> 00:06:04,750 distance between Xi and the 185 00:06:04,800 --> 00:06:06,130 cluster centroid, and then, 186 00:06:06,340 --> 00:06:08,990 you know, the 187 00:06:09,070 --> 00:06:10,350 value of k that minimizes 188 00:06:10,940 --> 00:06:12,160 this, that's what gets set in 189 00:06:12,300 --> 00:06:14,100 Ci. So, here's 190 00:06:14,360 --> 00:06:16,470 another way of writing out what Ci is. 191 00:06:18,050 --> 00:06:19,150 If I write the norm between 192 00:06:19,270 --> 00:06:21,500 Xi minus Mu-k, 193 00:06:23,000 --> 00:06:24,120 then this is the distance between 194 00:06:24,630 --> 00:06:26,040 my ith training example 195 00:06:26,180 --> 00:06:27,350 Xi and the cluster centroid 196 00:06:28,140 --> 00:06:30,280 Mu subscript K, this is--this 197 00:06:31,150 --> 00:06:32,830 here, that's a lowercase K. So uppercase 198 00:06:33,320 --> 00:06:34,710 K is going to be 199 00:06:34,980 --> 00:06:36,210 used to denote the total 200 00:06:36,450 --> 00:06:38,020 number of cluster centroids, 201 00:06:38,770 --> 00:06:40,430 and this lowercase K's 202 00:06:40,790 --> 00:06:41,840 a number between one and 203 00:06:41,960 --> 00:06:42,940 capital K. I'm just using 204 00:06:43,210 --> 00:06:44,450 lower case K to index 205 00:06:44,930 --> 00:06:45,990 into my different cluster centroids. 206 00:06:47,130 --> 00:06:49,020 Next is lower case k. So 207 00:06:50,050 --> 00:06:51,330 that's the distance between the example and the cluster centroid 208 00:06:51,490 --> 00:06:52,810 and so what I'm going to 209 00:06:53,050 --> 00:06:54,330 do is find the value 210 00:06:55,250 --> 00:06:56,390 of K, of lower case 211 00:06:56,710 --> 00:06:58,900 k that minimizes this, and 212 00:06:59,080 --> 00:07:00,320 so the value of 213 00:07:00,480 --> 00:07:02,100 k that minimizes you know, 214 00:07:02,280 --> 00:07:03,610 that's what I'm going to 215 00:07:04,000 --> 00:07:06,560 set as Ci, and 216 00:07:06,760 --> 00:07:07,850 by convention here I've written 217 00:07:08,190 --> 00:07:09,430 the distance between Xi and 218 00:07:09,480 --> 00:07:11,310 the cluster centroid, by convention 219 00:07:11,820 --> 00:07:13,330 people actually tend to write this as the squared distance. 220 00:07:13,780 --> 00:07:15,370 So we think of Ci as picking 221 00:07:15,660 --> 00:07:16,860 the cluster centroid with the smallest 222 00:07:17,450 --> 00:07:20,110 squared distance to my training example Xi. 223 00:07:20,750 --> 00:07:22,080 But of course minimizing squared distance, 224 00:07:22,500 --> 00:07:23,700 and minimizing distance that should 225 00:07:23,880 --> 00:07:25,670 give you the same value of Ci, 226 00:07:25,830 --> 00:07:26,670 but we usually put in the 227 00:07:26,750 --> 00:07:28,120 square there, just as the 228 00:07:28,430 --> 00:07:31,020 convention that people use for K means. 229 00:07:31,170 --> 00:07:32,320 So that was the cluster assignment step. 230 00:07:33,480 --> 00:07:34,750 The other in the loop 231 00:07:34,980 --> 00:07:37,740 of K means does the move centroid step. 232 00:07:40,540 --> 00:07:41,750 And what that does is for 233 00:07:42,160 --> 00:07:43,460 each of my cluster centroids, so 234 00:07:43,550 --> 00:07:44,740 for lower case k equals 1 through 235 00:07:44,870 --> 00:07:46,190 K, it sets Mu-k equals 236 00:07:46,710 --> 00:07:48,460 to the average of the points assigned to cluster. 237 00:07:49,270 --> 00:07:50,720 So as a concrete example, let's say 238 00:07:50,910 --> 00:07:52,100 that one of my cluster 239 00:07:52,340 --> 00:07:53,420 centroids, let's say cluster centroid 240 00:07:53,750 --> 00:07:55,030 two, has training examples, 241 00:07:55,820 --> 00:08:02,390 you know, 1, 5, 6, and 10 assigned to it. 242 00:08:03,220 --> 00:08:04,270 And what this means is, 243 00:08:04,470 --> 00:08:05,510 really this means that C1 equals 244 00:08:06,560 --> 00:08:09,180 to C5 equals to 245 00:08:10,690 --> 00:08:12,180 C6 equals to and 246 00:08:12,300 --> 00:08:13,730 similarly well c10 equals, too, right? 247 00:08:14,970 --> 00:08:17,070 If we got that 248 00:08:17,160 --> 00:08:18,940 from the cluster assignment step, then 249 00:08:19,190 --> 00:08:21,250 that means examples 1,5,6 and 250 00:08:21,450 --> 00:08:22,960 10 were assigned to the cluster centroid two. 251 00:08:24,020 --> 00:08:25,210 Then in this move centroid step, 252 00:08:25,540 --> 00:08:26,580 what I'm going to do is just 253 00:08:27,180 --> 00:08:29,290 compute the average of these four things. 254 00:08:31,340 --> 00:08:33,950 So X1 plus X5 plus X6 255 00:08:34,270 --> 00:08:35,620 plus X10. 256 00:08:35,890 --> 00:08:37,190 And now I'm going 257 00:08:37,380 --> 00:08:38,630 to average them so here I 258 00:08:38,920 --> 00:08:40,020 have four points assigned to 259 00:08:40,100 --> 00:08:41,700 this cluster centroid, just take 260 00:08:42,280 --> 00:08:43,240 one quarter of that. 261 00:08:43,980 --> 00:08:45,890 And now Mu2 is going to 262 00:08:46,100 --> 00:08:47,910 be an n-dimensional vector. 263 00:08:48,420 --> 00:08:49,480 Because each of these 264 00:08:49,700 --> 00:08:51,050 example x1, x5, x6, x10 265 00:08:52,160 --> 00:08:53,170 each of them were an n-dimensional 266 00:08:53,700 --> 00:08:55,150 vector, and I'm going to 267 00:08:55,240 --> 00:08:56,270 add up these things and, you 268 00:08:56,550 --> 00:08:57,870 know, divide by four because I 269 00:08:57,940 --> 00:08:59,320 have four points assigned to this 270 00:08:59,490 --> 00:09:00,730 cluster centroid, I end up 271 00:09:01,280 --> 00:09:02,770 with my move centroid step, 272 00:09:03,870 --> 00:09:05,260 for my cluster centroid mu-2. 273 00:09:05,870 --> 00:09:06,850 This has the effect of moving 274 00:09:07,210 --> 00:09:08,950 mu-2 to the average of 275 00:09:09,130 --> 00:09:10,620 the four points listed here. 276 00:09:12,430 --> 00:09:13,850 One thing that I've asked is, well here we said, let's 277 00:09:14,080 --> 00:09:16,600 let mu-k be the average of the points assigned to the cluster. 278 00:09:17,500 --> 00:09:18,900 But what if there is 279 00:09:18,960 --> 00:09:21,310 a cluster centroid no points 280 00:09:21,690 --> 00:09:23,000 with zero points assigned to it. 281 00:09:23,280 --> 00:09:24,300 In that case the more common 282 00:09:24,650 --> 00:09:25,720 thing to do is to just 283 00:09:26,140 --> 00:09:27,220 eliminate that cluster centroid. 284 00:09:27,830 --> 00:09:28,630 And if you do that, you end 285 00:09:28,840 --> 00:09:30,260 up with K minus one clusters 286 00:09:31,350 --> 00:09:33,840 instead of k clusters. 287 00:09:34,400 --> 00:09:35,620 Sometimes if you really need 288 00:09:35,830 --> 00:09:37,380 k clusters, then the other 289 00:09:37,490 --> 00:09:38,220 thing you can do if you 290 00:09:38,290 --> 00:09:39,530 have a cluster centroid with no 291 00:09:39,740 --> 00:09:41,170 points assigned to it is you can 292 00:09:41,260 --> 00:09:42,590 just randomly reinitialize that cluster 293 00:09:43,450 --> 00:09:44,870 centroid, but it's more 294 00:09:45,170 --> 00:09:46,590 common to just eliminate a 295 00:09:46,670 --> 00:09:48,210 cluster if somewhere during 296 00:09:48,410 --> 00:09:49,690 K means it with no points 297 00:09:50,290 --> 00:09:52,020 assigned to that cluster centroid, and 298 00:09:52,140 --> 00:09:53,340 that can happen, altthough in practice 299 00:09:53,820 --> 00:09:55,590 it happens not that often. 300 00:09:55,810 --> 00:09:57,280 So that's the K means Algorithm. 301 00:09:59,330 --> 00:10:00,220 Before wrapping up this video 302 00:10:00,620 --> 00:10:01,290 I just want to tell you 303 00:10:01,350 --> 00:10:02,710 about one other common application 304 00:10:03,350 --> 00:10:04,680 of K Means and that's 305 00:10:04,920 --> 00:10:06,840 to the problems with non well separated clusters. 306 00:10:08,160 --> 00:10:08,640 Here's what I mean. 307 00:10:09,280 --> 00:10:10,320 So far we've been picturing K Means 308 00:10:10,950 --> 00:10:12,090 and applying it to data 309 00:10:12,330 --> 00:10:13,520 sets like that shown here where 310 00:10:14,150 --> 00:10:15,590 we have three pretty 311 00:10:15,900 --> 00:10:17,380 well separated clusters, and we'd 312 00:10:17,670 --> 00:10:19,930 like an algorithm to find maybe the 3 clusters for us. 313 00:10:20,750 --> 00:10:21,840 But it turns out that 314 00:10:21,980 --> 00:10:23,180 very often K Means is also 315 00:10:23,600 --> 00:10:24,860 applied to data sets that 316 00:10:25,170 --> 00:10:26,240 look like this where there may 317 00:10:26,330 --> 00:10:28,300 not be several very 318 00:10:28,550 --> 00:10:30,250 well separated clusters. 319 00:10:30,830 --> 00:10:32,960 Here is an example application, to t-shirt sizing. 320 00:10:34,070 --> 00:10:34,650 Let's say you are a t-shirt 321 00:10:35,270 --> 00:10:37,360 manufacturer you've done is you've gone 322 00:10:38,030 --> 00:10:39,310 to the population that you 323 00:10:39,380 --> 00:10:40,520 want to sell t-shirts to, and 324 00:10:40,800 --> 00:10:42,190 you've collected a number of 325 00:10:42,580 --> 00:10:43,770 examples of the height and weight 326 00:10:44,270 --> 00:10:45,350 of these people in your 327 00:10:45,680 --> 00:10:46,740 population and so, well I 328 00:10:47,220 --> 00:10:48,280 guess height and weight tend to 329 00:10:48,370 --> 00:10:50,310 be positively highlighted so maybe 330 00:10:50,540 --> 00:10:51,160 you end up with a data 331 00:10:51,430 --> 00:10:52,590 set like this, you know, with 332 00:10:52,830 --> 00:10:53,910 a sample or set of 333 00:10:53,980 --> 00:10:56,000 examples of different peoples heights and weight. 334 00:10:56,530 --> 00:10:57,880 Let's say you want to size your t shirts. 335 00:10:58,570 --> 00:10:59,810 Let's say I want to design 336 00:11:00,330 --> 00:11:01,480 and sell t shirts of three 337 00:11:01,660 --> 00:11:03,970 sizes, small, medium and large. 338 00:11:04,660 --> 00:11:06,420 So how big should I make my small one? 339 00:11:06,550 --> 00:11:07,320 How big should I my medium? 340 00:11:07,690 --> 00:11:09,300 And how big should I make my large t-shirts. 341 00:11:10,370 --> 00:11:11,290 One way to do that would 342 00:11:11,410 --> 00:11:12,050 to be to run my k means 343 00:11:12,270 --> 00:11:13,570 clustering logarithm on this data 344 00:11:13,830 --> 00:11:14,640 set that I have shown on 345 00:11:14,820 --> 00:11:16,570 the right and maybe what 346 00:11:16,770 --> 00:11:18,270 K Means will do is group 347 00:11:18,600 --> 00:11:20,460 all of these points into one 348 00:11:20,660 --> 00:11:22,120 cluster and group all 349 00:11:22,340 --> 00:11:24,150 of these points into a 350 00:11:24,190 --> 00:11:25,530 second cluster and group 351 00:11:25,740 --> 00:11:28,080 all of those points into a third cluster. 352 00:11:28,520 --> 00:11:29,870 So, even though the data, you 353 00:11:30,020 --> 00:11:30,960 know, before hand it didn't 354 00:11:31,060 --> 00:11:31,990 seem like we had 3 355 00:11:32,050 --> 00:11:33,920 well separated clusters, K Means will 356 00:11:34,050 --> 00:11:36,870 kind of separate out the data 357 00:11:37,320 --> 00:11:38,560 into multiple pluses for you. 358 00:11:39,270 --> 00:11:40,220 And what you can do is then 359 00:11:40,420 --> 00:11:42,010 look at this first population of 360 00:11:42,130 --> 00:11:44,340 people and look at 361 00:11:44,500 --> 00:11:45,590 them and, you know, look 362 00:11:45,780 --> 00:11:46,810 at the height and weight, and 363 00:11:46,900 --> 00:11:47,900 try to design a small 364 00:11:48,350 --> 00:11:49,540 t-shirt so that it kind 365 00:11:49,710 --> 00:11:51,160 of fits this first population of 366 00:11:51,310 --> 00:11:52,830 people well and then design 367 00:11:53,150 --> 00:11:55,800 a medium t-shirt and design a large t-shirt. 368 00:11:56,510 --> 00:11:57,860 And this is in fact kind of 369 00:11:57,990 --> 00:11:59,740 an example of market segmentation 370 00:12:01,140 --> 00:12:02,800 where you're using K Means to separate your 371 00:12:02,940 --> 00:12:04,320 market into 3 different segments. 372 00:12:05,220 --> 00:12:06,500 So you can design a product 373 00:12:07,000 --> 00:12:08,260 separately that is a small, medium, and large t-shirts, 374 00:12:09,880 --> 00:12:11,230 that tries to suit 375 00:12:11,650 --> 00:12:12,770 the needs of each of your 376 00:12:12,920 --> 00:12:15,310 3 separate sub-populations well. 377 00:12:16,220 --> 00:12:17,570 So that's the K Means algorithm. 378 00:12:18,240 --> 00:12:19,080 And by now you should 379 00:12:19,300 --> 00:12:20,500 know how to implement the K 380 00:12:20,630 --> 00:12:22,510 Means Algorithm and kind of get it to work for some problems. 381 00:12:23,420 --> 00:12:24,380 But in the next few videos 382 00:12:24,860 --> 00:12:26,430 what I want to do is 383 00:12:26,580 --> 00:12:27,690 really get more deeply into the nuts 384 00:12:28,010 --> 00:12:29,210 and bolts of K means and 385 00:12:29,510 --> 00:12:32,080 to talk a bit about how to actually get this to work really well.