1 00:00:00,000 --> 00:00:04,683 [MUSIC] 2 00:00:04,683 --> 00:00:07,880 To start with, let's discuss divisive clustering. 3 00:00:07,880 --> 00:00:11,310 And in pictures, what we can think of is that at some granularity, 4 00:00:11,310 --> 00:00:15,750 our data can be pretty well described as a set of clusters. 5 00:00:15,750 --> 00:00:20,240 So maybe it could even be described as a set of elliptical clusters using 6 00:00:20,240 --> 00:00:24,700 the type of mixture models, in particular mixture of Gaussians we described before. 7 00:00:24,700 --> 00:00:29,120 But then when we dig down to a finer granularity, what we see is that 8 00:00:29,120 --> 00:00:34,400 each individual cluster is actually better described as a set of clusters itself. 9 00:00:34,400 --> 00:00:38,150 And we can keep recursing this idea where we take each cluster at a given 10 00:00:38,150 --> 00:00:43,600 granularity, dig down and specify it as a set of clusters itself. 11 00:00:43,600 --> 00:00:46,570 And this is what divisive clustering does. 12 00:00:46,570 --> 00:00:49,170 So as an example, one 13 00:00:49,170 --> 00:00:53,870 very straightforward approach is to just recursively apply R k-means algorithm. 14 00:00:53,870 --> 00:00:57,410 So one application that you're going to look at in your assignment 15 00:00:57,410 --> 00:01:01,190 is clustering Wikipedia articles, which we've looked at in past assignments. 16 00:01:01,190 --> 00:01:04,720 But here we're going to take the set of all these Wikipedia articles and 17 00:01:04,720 --> 00:01:08,370 then we're going to apply k-means, which is k equals 2, as an example. 18 00:01:08,370 --> 00:01:13,185 So just split into two clusters, and maybe at the first split, what we see is that 19 00:01:13,185 --> 00:01:17,290 k-means separates all these articles into articles about athletes and 20 00:01:17,290 --> 00:01:19,130 articles about non-athletes. 21 00:01:19,130 --> 00:01:22,270 Of course these are learn clusters, we don't have these labels, these are things 22 00:01:22,270 --> 00:01:28,320 that we infer digging down and looking into the contents of these clusters. 23 00:01:28,320 --> 00:01:31,164 Okay, well then we recurse this process where for 24 00:01:31,164 --> 00:01:34,900 each one of these clusters we run k-means with k equals 2 again. 25 00:01:34,900 --> 00:01:37,070 And so maybe at the next level, 26 00:01:37,070 --> 00:01:39,770 we have the following set of clusters where there is a cluster for 27 00:01:40,820 --> 00:01:45,630 specific athletes for baseball or athletes for soccer and ice hockey. 28 00:01:45,630 --> 00:01:51,030 And then when we look at the non-athlete cluster, that splits into a cluster for 29 00:01:51,030 --> 00:01:53,080 musicians, artists and actors. 30 00:01:53,080 --> 00:01:57,260 And another one's for scholars, politicians and government officials. 31 00:01:57,260 --> 00:01:59,660 And then we can think about recursing this process. 32 00:01:59,660 --> 00:02:03,710 So, for example, maybe we want to further split the cluster that's about 33 00:02:03,710 --> 00:02:08,210 soccer players and ice hockey players to get at a granularity that better matches 34 00:02:08,210 --> 00:02:11,330 what we'd like to come out of our clustering procedure. 35 00:02:11,330 --> 00:02:13,460 Okay, well, when we're doing divisive clustering, 36 00:02:13,460 --> 00:02:15,770 there are a number of choices that we have to make. 37 00:02:15,770 --> 00:02:19,180 One is we have to think about what algorithm are we going to apply 38 00:02:19,180 --> 00:02:21,130 at every step of this recursion. 39 00:02:21,130 --> 00:02:26,130 So the example we just walked through was applying k-means at every step. 40 00:02:26,130 --> 00:02:28,990 So doing a clustering at one level. 41 00:02:28,990 --> 00:02:31,100 Digging down into each one of those clusters. 42 00:02:31,100 --> 00:02:34,760 Applying k-means again and again and again and again. 43 00:02:34,760 --> 00:02:39,440 Well another question is how many splits to be made at each level. 44 00:02:39,440 --> 00:02:42,810 So that's equivalent to saying what is the value of k we should use in 45 00:02:42,810 --> 00:02:45,380 our k-means recursion. 46 00:02:45,380 --> 00:02:49,380 In the example that we just looked at, we were using k equals 2 at every level, but 47 00:02:49,380 --> 00:02:50,880 that's a choice that we have to specify. 48 00:02:51,960 --> 00:02:55,380 And finally, we have to decide when to continue splitting or 49 00:02:55,380 --> 00:02:57,260 stopping this process. 50 00:02:57,260 --> 00:03:02,500 And there are a number of heuristics that people look at for the stopping criterion. 51 00:03:02,500 --> 00:03:07,030 One is to think about whether a cluster falls below a certain size. 52 00:03:07,030 --> 00:03:12,570 So there are fewer members in that cluster than its pre-specified threshold, and 53 00:03:12,570 --> 00:03:17,100 then if that condition is satisfied, you would just stop splitting that cluster. 54 00:03:17,100 --> 00:03:19,610 Another is to look at the spread of the cluster. 55 00:03:19,610 --> 00:03:24,080 So if the distance from the cluster center 56 00:03:24,080 --> 00:03:28,870 to the furthest point in that cluster falls below some threshold, 57 00:03:28,870 --> 00:03:31,650 then you say okay, well it's a compact enough cluster for 58 00:03:31,650 --> 00:03:35,860 what I'm hoping to get out, and then you stop splitting that cluster. 59 00:03:35,860 --> 00:03:39,640 Or another heuristic is to just continue splitting 60 00:03:39,640 --> 00:03:43,980 until the number of clusters that you've pre-specified is reached. 61 00:03:43,980 --> 00:03:46,860 But of course in this case you get back to having to pre-specify 62 00:03:46,860 --> 00:03:48,250 how many clusters you want. 63 00:03:48,250 --> 00:03:51,630 Which might have been the thing you were trying to avoid in the first place. 64 00:03:51,630 --> 00:03:53,819 But all of these are just heuristics and 65 00:03:53,819 --> 00:03:57,865 often a lot of application-specific intuition has to come into play and 66 00:03:57,865 --> 00:04:02,308 sometimes some manual intervention to really get at the type of granularity you 67 00:04:02,308 --> 00:04:05,117 want to come out of the hierarchical clustering. 68 00:04:05,117 --> 00:04:09,169 [MUSIC]