1 00:00:00,200 --> 00:00:01,390 In this video I'd like to 2 00:00:01,570 --> 00:00:02,780 talk about one last detail 3 00:00:03,350 --> 00:00:04,950 of K-means clustering which is 4 00:00:05,450 --> 00:00:06,680 how to choose the number of 5 00:00:06,770 --> 00:00:07,890 clusters, or how to choose 6 00:00:08,290 --> 00:00:09,160 the value of the parameter 7 00:00:10,230 --> 00:00:12,310 capsule K. To be 8 00:00:12,390 --> 00:00:13,690 honest, there actually isn't a 9 00:00:13,760 --> 00:00:15,420 great way of answering this 10 00:00:15,680 --> 00:00:17,150 or doing this automatically and 11 00:00:17,820 --> 00:00:18,930 by far the most common way 12 00:00:19,110 --> 00:00:20,380 of choosing the number of clusters, 13 00:00:20,520 --> 00:00:22,040 is still choosing it manually 14 00:00:22,710 --> 00:00:24,380 by looking at visualizations or by 15 00:00:24,450 --> 00:00:26,070 looking at the output of the clustering algorithm or something else. 16 00:00:27,340 --> 00:00:28,270 But I do get asked 17 00:00:28,600 --> 00:00:29,460 this question quite a lot of 18 00:00:29,650 --> 00:00:30,510 how do you choose the number of 19 00:00:30,810 --> 00:00:31,930 clusters, and so I just want 20 00:00:32,240 --> 00:00:33,650 to tell you know what 21 00:00:33,850 --> 00:00:35,020 are peoples' current thinking on 22 00:00:35,230 --> 00:00:36,480 it although, the most 23 00:00:36,740 --> 00:00:38,060 common thing is actually to 24 00:00:38,180 --> 00:00:40,130 choose the number of clusters by hand. 25 00:00:42,230 --> 00:00:43,680 A large part of 26 00:00:43,800 --> 00:00:45,020 why it might not always 27 00:00:45,390 --> 00:00:46,530 be easy to choose the 28 00:00:46,640 --> 00:00:47,940 number of clusters is that 29 00:00:48,190 --> 00:00:51,920 it is often generally ambiguous how many clusters there are in the data. 30 00:00:52,940 --> 00:00:53,890 Looking at this data set 31 00:00:54,080 --> 00:00:55,110 some of you may see 32 00:00:55,380 --> 00:00:56,830 four clusters and that 33 00:00:57,020 --> 00:00:59,440 would suggest using K equals 4. 34 00:00:59,620 --> 00:01:00,650 Or some of you may 35 00:01:00,870 --> 00:01:02,620 see two clusters and 36 00:01:02,730 --> 00:01:04,460 that will suggest K equals 37 00:01:04,870 --> 00:01:06,630 2 and now this may see three clusters. 38 00:01:08,070 --> 00:01:09,710 And so, looking at the 39 00:01:09,820 --> 00:01:10,750 data set like this, the 40 00:01:10,920 --> 00:01:12,390 true number of clusters, it actually 41 00:01:12,810 --> 00:01:14,560 seems genuinely ambiguous to me, 42 00:01:14,690 --> 00:01:17,160 and I don't think there is one right answer. 43 00:01:18,100 --> 00:01:19,500 And this is part of our supervised learning. 44 00:01:20,250 --> 00:01:21,450 We are aren't given labels, and 45 00:01:21,550 --> 00:01:23,950 so there isn't always a clear cut answer. 46 00:01:24,830 --> 00:01:25,730 And this is one of the 47 00:01:25,850 --> 00:01:26,710 things that makes it more difficult 48 00:01:27,340 --> 00:01:28,530 to say, have an automatic 49 00:01:29,160 --> 00:01:30,860 algorithm for choosing how many clusters to have. 50 00:01:32,100 --> 00:01:33,250 When people talk about ways of 51 00:01:33,320 --> 00:01:34,270 choosing the number of clusters, 52 00:01:34,840 --> 00:01:36,050 one method that people sometimes 53 00:01:36,440 --> 00:01:39,150 talk about is something called the Elbow Method. 54 00:01:39,630 --> 00:01:40,490 Let me just tell you a little bit about that, 55 00:01:40,800 --> 00:01:43,760 and then mention some of its advantages but also shortcomings. 56 00:01:44,690 --> 00:01:45,980 So the Elbow Method, 57 00:01:46,420 --> 00:01:47,570 what we're going to do is vary 58 00:01:48,340 --> 00:01:49,860 K, which is the total number of clusters. 59 00:01:50,250 --> 00:01:51,570 So, we're going to run K-means 60 00:01:52,050 --> 00:01:53,340 with one cluster, that means really, 61 00:01:53,630 --> 00:01:54,840 everything gets grouped into a 62 00:01:54,980 --> 00:01:56,530 single cluster and compute the 63 00:01:56,660 --> 00:01:57,850 cost function or compute the distortion 64 00:01:58,460 --> 00:01:59,490 J and plot that here. 65 00:02:00,410 --> 00:02:01,090 And then we're going to run K 66 00:02:01,310 --> 00:02:03,270 means with two clusters, maybe 67 00:02:03,610 --> 00:02:05,430 with multiple random initial agents, maybe not. 68 00:02:06,140 --> 00:02:07,150 But then, you know, 69 00:02:07,280 --> 00:02:08,280 with two clusters we should 70 00:02:08,500 --> 00:02:09,510 get, hopefully, a smaller distortion, 71 00:02:10,710 --> 00:02:11,820 and so plot that there. 72 00:02:11,950 --> 00:02:13,100 And then run K-means with three 73 00:02:13,310 --> 00:02:14,590 clusters, hopefully, you get even 74 00:02:14,760 --> 00:02:16,680 smaller distortion and plot that there. 75 00:02:16,990 --> 00:02:19,710 I'm gonna run K-means with four, five and so on. 76 00:02:19,780 --> 00:02:20,790 And so we end up with 77 00:02:20,970 --> 00:02:22,840 a curve showing how the 78 00:02:23,240 --> 00:02:24,560 distortion, you know, goes 79 00:02:24,800 --> 00:02:27,170 down as we increase the number of clusters. 80 00:02:27,440 --> 00:02:29,870 And so we get a curve that maybe looks like this. 81 00:02:31,390 --> 00:02:32,210 And if you look at this 82 00:02:32,300 --> 00:02:33,400 curve, what the Elbow Method does 83 00:02:33,720 --> 00:02:35,770 it says "Well, let's look at this plot. 84 00:02:36,450 --> 00:02:39,340 Looks like there's a clear elbow there". 85 00:02:40,230 --> 00:02:41,620 Right, this is, would be by 86 00:02:41,830 --> 00:02:43,210 analogy to the human arm where, 87 00:02:43,550 --> 00:02:44,620 you know, if you imagine that 88 00:02:45,370 --> 00:02:46,460 you reach out your arm, 89 00:02:47,240 --> 00:02:48,940 then, this is your 90 00:02:49,160 --> 00:02:50,340 shoulder joint, this is your 91 00:02:50,550 --> 00:02:52,960 elbow joint and I guess, your hand is at the end over here. 92 00:02:53,260 --> 00:02:54,170 And so this is the Elbow Method. 93 00:02:54,490 --> 00:02:55,930 Then you find this sort of pattern 94 00:02:56,250 --> 00:02:57,630 where the distortion goes down rapidly 95 00:02:58,550 --> 00:02:59,120 from 1 to 2, and 2 to 96 00:02:59,280 --> 00:03:01,330 3, and then you reach an 97 00:03:01,520 --> 00:03:03,160 elbow at 3, and then 98 00:03:03,330 --> 00:03:05,260 the distortion goes down very slowly after that. 99 00:03:05,430 --> 00:03:06,520 And then it looks like, you 100 00:03:06,580 --> 00:03:08,700 know what, maybe using three 101 00:03:08,960 --> 00:03:09,920 clusters is the right 102 00:03:10,040 --> 00:03:11,340 number of clusters, because that's 103 00:03:12,020 --> 00:03:14,430 the elbow of this curve, right? 104 00:03:14,700 --> 00:03:16,040 That it goes down, distortion goes 105 00:03:16,250 --> 00:03:17,290 down rapidly until K equals 106 00:03:17,610 --> 00:03:19,700 3, really goes down very slowly after that. 107 00:03:19,820 --> 00:03:20,850 So let's pick K equals 3. 108 00:03:23,460 --> 00:03:24,570 If you apply the Elbow Method, 109 00:03:25,110 --> 00:03:26,240 and if you get a plot 110 00:03:26,540 --> 00:03:27,450 that actually looks like this, 111 00:03:27,890 --> 00:03:29,120 then, that's pretty good, and 112 00:03:29,240 --> 00:03:30,160 this would be a reasonable way 113 00:03:30,700 --> 00:03:32,590 of choosing the number of clusters. 114 00:03:33,620 --> 00:03:34,600 It turns out the Elbow Method 115 00:03:35,040 --> 00:03:37,170 isn't used that often, and one 116 00:03:37,340 --> 00:03:38,270 reason is that, if you 117 00:03:38,350 --> 00:03:39,470 actually use this on 118 00:03:39,720 --> 00:03:41,060 a clustering problem, it turns out that 119 00:03:41,210 --> 00:03:42,640 fairly often, you know, 120 00:03:42,740 --> 00:03:43,610 you end up with a curve 121 00:03:43,910 --> 00:03:46,940 that looks much more ambiguous, maybe something like this. 122 00:03:47,700 --> 00:03:48,370 And if you look at this, 123 00:03:48,920 --> 00:03:50,220 I don't know, maybe there's 124 00:03:50,390 --> 00:03:51,580 no clear elbow, but it 125 00:03:51,720 --> 00:03:53,090 looks like distortion continuously goes down, 126 00:03:53,440 --> 00:03:54,570 maybe 3 is a 127 00:03:54,620 --> 00:03:55,680 good number, maybe 4 is 128 00:03:55,750 --> 00:03:58,180 a good number, maybe 5 is also not bad. 129 00:03:58,390 --> 00:03:59,190 And so, if you actually 130 00:03:59,600 --> 00:04:00,710 do this in a practice, you know, 131 00:04:00,820 --> 00:04:02,690 if your plot looks like the one on the left and that's great. 132 00:04:03,400 --> 00:04:04,990 It gives you a clear answer, but 133 00:04:05,490 --> 00:04:06,550 just as often, you end 134 00:04:06,740 --> 00:04:07,580 up with a plot that looks 135 00:04:07,750 --> 00:04:09,020 like the one on the right and 136 00:04:09,110 --> 00:04:11,000 is not clear where the 137 00:04:11,790 --> 00:04:13,230 ready location of the elbow 138 00:04:13,490 --> 00:04:14,440 is. It makes it harder to 139 00:04:14,640 --> 00:04:16,700 choose a number of clusters using this method. 140 00:04:17,370 --> 00:04:18,220 So maybe the quick summary 141 00:04:18,700 --> 00:04:20,500 of the Elbow Method is that is worth the shot 142 00:04:21,010 --> 00:04:22,350 but I wouldn't necessarily, 143 00:04:23,610 --> 00:04:24,700 you know, have a very high 144 00:04:24,870 --> 00:04:27,360 expectation of it working for any particular problem. 145 00:04:29,880 --> 00:04:31,030 Finally, here's one other way 146 00:04:31,300 --> 00:04:32,850 of how, thinking about how 147 00:04:32,990 --> 00:04:33,980 you choose the value of K, 148 00:04:34,930 --> 00:04:36,030 very often people are running 149 00:04:36,310 --> 00:04:37,380 K-means in order you 150 00:04:37,530 --> 00:04:38,770 get clusters for some later 151 00:04:39,240 --> 00:04:40,880 purpose, or for some sort of downstream purpose. 152 00:04:41,460 --> 00:04:42,900 Maybe you want to use K-means 153 00:04:43,380 --> 00:04:44,460 in order to do market segmentation, 154 00:04:45,310 --> 00:04:47,600 like in the T-shirt sizing example that we talked about. 155 00:04:48,140 --> 00:04:50,570 Maybe you want K-means to organize 156 00:04:51,130 --> 00:04:52,300 a computer cluster better, or 157 00:04:52,480 --> 00:04:53,430 maybe a learning cluster for some 158 00:04:53,630 --> 00:04:55,070 different purpose, and so, 159 00:04:55,450 --> 00:04:57,020 if that later, downstream purpose, 160 00:04:57,510 --> 00:04:59,050 such as market segmentation. If 161 00:04:59,180 --> 00:05:00,420 that gives you an evaluation metric, 162 00:05:01,310 --> 00:05:02,670 then often, a better 163 00:05:02,800 --> 00:05:03,890 way to determine the number of 164 00:05:03,960 --> 00:05:05,680 clusters, is to see 165 00:05:06,010 --> 00:05:07,740 how well different numbers of 166 00:05:07,930 --> 00:05:10,140 clusters serve that later downstream purpose. 167 00:05:11,230 --> 00:05:13,050 Let me step through a specific example. 168 00:05:14,190 --> 00:05:15,080 Let me go through the T-shirt 169 00:05:15,440 --> 00:05:17,420 size example again, and I'm 170 00:05:17,570 --> 00:05:19,700 trying to decide, do I want three T-shirt sizes? 171 00:05:20,330 --> 00:05:22,320 So, I choose K equals 3, then 172 00:05:22,560 --> 00:05:25,360 I might have small, medium and large T-shirts. 173 00:05:26,320 --> 00:05:27,250 Or maybe, I want to choose 174 00:05:27,470 --> 00:05:28,240 K equals 5, and then I 175 00:05:29,030 --> 00:05:30,140 might have, you know, extra 176 00:05:30,390 --> 00:05:33,130 small, small, medium, large 177 00:05:33,620 --> 00:05:35,070 and extra large T-shirt sizes. 178 00:05:35,860 --> 00:05:38,580 So, you can have like 3 T-shirt sizes or four or five T-shirt sizes. 179 00:05:39,270 --> 00:05:40,100 We could also have four T-shirt 180 00:05:40,430 --> 00:05:41,740 sizes, but I'm just 181 00:05:41,930 --> 00:05:43,240 showing three and five here, 182 00:05:43,490 --> 00:05:45,670 just to simplify this slide for now. 183 00:05:46,930 --> 00:05:49,020 So, if I run K-means with 184 00:05:49,130 --> 00:05:50,290 K equals 3, maybe I end 185 00:05:50,670 --> 00:05:51,860 up with, that's my small 186 00:05:53,100 --> 00:05:55,020 and that's my 187 00:05:55,140 --> 00:05:56,720 medium and that's my large. 188 00:05:58,580 --> 00:06:00,370 Whereas, if I run K-means with 189 00:06:00,650 --> 00:06:03,540 5 clusters, maybe I 190 00:06:03,700 --> 00:06:05,170 end up with, those are 191 00:06:05,330 --> 00:06:07,400 my extra small T-shirts, these 192 00:06:07,740 --> 00:06:10,920 are my small, these are 193 00:06:11,050 --> 00:06:13,740 my medium, these are my 194 00:06:13,990 --> 00:06:17,110 large and these are my extra large. 195 00:06:19,320 --> 00:06:20,150 And the nice thing about this 196 00:06:20,320 --> 00:06:21,510 example is that, this then 197 00:06:21,810 --> 00:06:22,940 maybe gives us another way 198 00:06:23,550 --> 00:06:24,730 to choose whether we want 199 00:06:24,970 --> 00:06:26,070 3 or 4 or 5 clusters, 200 00:06:28,570 --> 00:06:29,630 and in particular, what you can 201 00:06:29,730 --> 00:06:30,630 do is, you know, think 202 00:06:30,810 --> 00:06:31,770 about this from the perspective 203 00:06:32,380 --> 00:06:33,810 of the T-shirt business and 204 00:06:34,320 --> 00:06:35,150 ask: "Well if I have 205 00:06:35,620 --> 00:06:37,190 five segments, then how well 206 00:06:38,060 --> 00:06:39,610 will my T-shirts fit my 207 00:06:39,780 --> 00:06:42,100 customers and so, how many T-shirts can I sell? 208 00:06:42,420 --> 00:06:44,390 How happy will my customers be?" 209 00:06:44,550 --> 00:06:45,920 What really makes sense, from the 210 00:06:46,080 --> 00:06:47,530 perspective of the T-shirt business, 211 00:06:47,590 --> 00:06:49,390 in terms of whether, I 212 00:06:49,520 --> 00:06:51,480 want to have Goer T-shirt sizes 213 00:06:51,990 --> 00:06:54,040 so that my T-shirts fit my customers better. 214 00:06:54,970 --> 00:06:56,130 Or do I want to have fewer 215 00:06:56,360 --> 00:06:57,570 T-shirt sizes so that 216 00:06:58,410 --> 00:07:00,220 I make fewer sizes of T-shirts. 217 00:07:00,610 --> 00:07:02,290 And I can sell them to the customers more cheaply. 218 00:07:02,840 --> 00:07:04,700 And so, the t-shirt selling 219 00:07:05,040 --> 00:07:06,150 business, that might give you 220 00:07:06,660 --> 00:07:09,260 a way to decide, between three clusters versus five clusters. 221 00:07:10,780 --> 00:07:12,000 So, that gives you an 222 00:07:12,480 --> 00:07:13,880 example of how a 223 00:07:14,130 --> 00:07:15,810 later downstream purpose like 224 00:07:16,010 --> 00:07:17,260 the problem of deciding what 225 00:07:17,390 --> 00:07:19,230 T-shirts to manufacture, how that 226 00:07:19,380 --> 00:07:21,980 can give you an evaluation metric for choosing the number of clusters. 227 00:07:22,900 --> 00:07:23,800 For those of you that are 228 00:07:23,880 --> 00:07:25,490 doing the program exercises, if 229 00:07:25,670 --> 00:07:27,070 you look at this week's 230 00:07:27,290 --> 00:07:29,540 program exercise associative K-means, that's 231 00:07:29,790 --> 00:07:32,000 an example there of using K-means for image compression. 232 00:07:32,910 --> 00:07:33,960 And so if you were trying to 233 00:07:34,070 --> 00:07:35,170 choose how many clusters 234 00:07:35,410 --> 00:07:36,950 to use for that problem, you could 235 00:07:37,260 --> 00:07:38,550 also, again use the 236 00:07:39,030 --> 00:07:40,330 evaluation metric of image compression 237 00:07:40,890 --> 00:07:42,470 to choose the number of clusters, K? 238 00:07:43,130 --> 00:07:43,870 So, how good do you want the 239 00:07:44,000 --> 00:07:45,430 image to look versus, how much 240 00:07:45,680 --> 00:07:46,680 do you want to compress the file 241 00:07:46,970 --> 00:07:48,390 size of the image, and, 242 00:07:48,610 --> 00:07:49,830 you know, if you do the 243 00:07:50,050 --> 00:07:50,980 programming exercise, what I've just 244 00:07:51,160 --> 00:07:52,480 said will make more sense at that time. 245 00:07:53,760 --> 00:07:56,500 So, just summarize, for the 246 00:07:56,590 --> 00:07:57,800 most part, the number of 247 00:07:58,030 --> 00:07:59,560 customers K is still chosen 248 00:08:00,150 --> 00:08:01,900 by hand by human input or human insight. 249 00:08:02,800 --> 00:08:03,810 One way to try to 250 00:08:03,950 --> 00:08:05,010 do so is to use 251 00:08:05,170 --> 00:08:06,360 the Elbow Method, but I 252 00:08:06,520 --> 00:08:07,660 wouldn't always expect that to 253 00:08:07,760 --> 00:08:08,620 work well, but I think 254 00:08:08,820 --> 00:08:09,730 the better way to think about 255 00:08:09,970 --> 00:08:10,800 how to choose the number of 256 00:08:10,920 --> 00:08:12,310 clusters is to ask, for 257 00:08:12,520 --> 00:08:13,890 what purpose are you running K-means? 258 00:08:15,490 --> 00:08:16,610 And then to think, what is 259 00:08:16,830 --> 00:08:18,210 the number of clusters K that 260 00:08:18,350 --> 00:08:19,490 serves that, you know, whatever 261 00:08:19,670 --> 00:08:21,710 later purpose that you actually run the K-means for.