1 00:00:00,000 --> 00:00:04,152 [MUSIC] 2 00:00:04,152 --> 00:00:09,940 Finally, let's wrap up by providing a few more details on agglomerative clustering. 3 00:00:09,940 --> 00:00:11,510 So one thing worth discussing are, 4 00:00:11,510 --> 00:00:14,730 what are the choices we have to make in agglomerative clustering? 5 00:00:14,730 --> 00:00:18,350 One is, what is the distance metric we're going to use to compute 6 00:00:18,350 --> 00:00:21,280 the pairwise distances between points? 7 00:00:21,280 --> 00:00:24,360 And then the other is, what is the linkage function we're going to use to 8 00:00:24,360 --> 00:00:28,520 compute the resulting distances between pairs of clusters? 9 00:00:28,520 --> 00:00:31,450 And for single linkage, that linkage function is just 10 00:00:31,450 --> 00:00:36,100 the minimum distance between any pair of points appearing in the two clusters. 11 00:00:36,100 --> 00:00:38,030 And finally, another important question is, 12 00:00:38,030 --> 00:00:39,940 where are we going to cut the dendrogram? 13 00:00:39,940 --> 00:00:45,040 So one version of that question is just, where would we place this horizontal cut? 14 00:00:45,040 --> 00:00:47,830 But another thing is that we don't have to make just a horizontal cut, 15 00:00:47,830 --> 00:00:51,360 we can make some weird cut through this tree. 16 00:00:51,360 --> 00:00:56,030 So this is something where really the art of the method comes into play. 17 00:00:56,030 --> 00:00:59,810 And often you have to build in a lot of application-specific information 18 00:00:59,810 --> 00:01:02,190 to think about how to cut the dendrogram. 19 00:01:03,250 --> 00:01:07,580 Well, if you're using hierarchical clustering for some task of visualization 20 00:01:07,580 --> 00:01:12,770 of the data, then often it's preferable to produce a small number of clusters. 21 00:01:12,770 --> 00:01:15,730 So that would define how you'd cut the dendrogram. 22 00:01:15,730 --> 00:01:18,850 But in contrast, if you're trying to do something like outlier detection, where 23 00:01:18,850 --> 00:01:23,750 you want to detect if some data point is very different from other data points, 24 00:01:23,750 --> 00:01:26,860 then you might want to think of other ways of defining the number of clusters, 25 00:01:26,860 --> 00:01:29,810 or equivalently, how to cut the dendrogram. 26 00:01:29,810 --> 00:01:33,900 In this case, one thing you can do is just use a distance threshold, 27 00:01:33,900 --> 00:01:38,100 which is the same method that we described in the previous section. 28 00:01:38,100 --> 00:01:42,890 Or you could do even fancier things, like something called the inconsistency 29 00:01:42,890 --> 00:01:47,980 coefficient, where what you do is, you compare the height of any given merge 30 00:01:47,980 --> 00:01:52,110 to the average height of merges below that point. 31 00:01:52,110 --> 00:01:57,450 Then if that height is much higher than the average heights of merges below, 32 00:01:57,450 --> 00:02:02,440 what it indicates is that that particular merge is merging clusters that are very 33 00:02:02,440 --> 00:02:07,710 far apart relative to what the merges were below that point. 34 00:02:07,710 --> 00:02:10,870 So that might be a reasonable point to cut the dendogram and 35 00:02:10,870 --> 00:02:14,300 keep those clusters as separate clusters. 36 00:02:14,300 --> 00:02:16,620 So just as a little visualization of that, 37 00:02:16,620 --> 00:02:21,360 maybe we have some data points that are fairly far apart. 38 00:02:21,360 --> 00:02:25,190 And then you have this very tightly clustered set of data points here, 39 00:02:25,190 --> 00:02:30,300 and maybe another very tightly clustered set of points here. 40 00:02:30,300 --> 00:02:34,790 Well, if we're just simply cutting the dendrogram based off of a threshold, 41 00:02:34,790 --> 00:02:41,680 you might end up merging these points and keeping these points as a cluster. 42 00:02:41,680 --> 00:02:47,710 But instead, what you might want to do, based on the structure of the data, 43 00:02:47,710 --> 00:02:50,830 is keep these as separate clusters. 44 00:02:50,830 --> 00:02:56,280 Because within that cluster, the distances between points are really, really small, 45 00:02:58,010 --> 00:03:02,780 whereas the distance of this particular merge between these two red clusters 46 00:03:02,780 --> 00:03:06,550 was much larger than the distances of merges within each cluster here. 47 00:03:07,770 --> 00:03:13,510 So this cut of the dendrogram could allow you to do something like the following. 48 00:03:13,510 --> 00:03:17,920 But to be clear, no cutting method is the correct cutting method. 49 00:03:17,920 --> 00:03:19,470 And instead, like we said, 50 00:03:19,470 --> 00:03:23,850 a lot of application-specific intuition or information comes into play. 51 00:03:23,850 --> 00:03:27,350 And a lot of heuristics are often used to determine how to cut the dendrogram. 52 00:03:27,350 --> 00:03:31,560 And the results produced by the hierarchical cluster method, 53 00:03:31,560 --> 00:03:35,520 whatever's output as the set of clusters, is really sensitive, 54 00:03:35,520 --> 00:03:37,450 of course, to how you cut that dendrogram. 55 00:03:37,450 --> 00:03:41,330 You can get very different solutions based on that approach. 56 00:03:41,330 --> 00:03:43,920 So you have to think very carefully about that. 57 00:03:43,920 --> 00:03:47,460 And instead, you can often think about hierarchical clustering as a way 58 00:03:47,460 --> 00:03:51,700 to produce different possible clusterings as you vary a threshold or 59 00:03:51,700 --> 00:03:54,220 as you vary the way in which you cut the dendrogram. 60 00:03:55,570 --> 00:03:58,910 There are also some important computational considerations when we're 61 00:03:58,910 --> 00:04:01,858 thinking about applying hierarchical clustering methods. 62 00:04:01,858 --> 00:04:06,430 So for agglomerative clustering we have to compute distances between every pair of 63 00:04:06,430 --> 00:04:10,760 points when we're going to compute the distances between pairs of clusters. 64 00:04:10,760 --> 00:04:12,920 And this is really, really expensive, 65 00:04:12,920 --> 00:04:17,350 just as we explored in our nearest neighbor method for retrieval. 66 00:04:18,550 --> 00:04:22,140 And you can show that a brute force algorithm for agglomerative clustering has 67 00:04:22,140 --> 00:04:28,130 complexity on the order of N-squared log N, where N is the number of data points. 68 00:04:28,130 --> 00:04:32,023 But just like in nearest neighbors, we talked KD trees and 69 00:04:32,023 --> 00:04:35,050 efficient ways of pruning the search space. 70 00:04:35,050 --> 00:04:39,169 Well, here we can also think about using the triangle 71 00:04:39,169 --> 00:04:44,240 inequality to rule out certain potential merge moves. 72 00:04:44,240 --> 00:04:48,445 And if you do some really fancy things, the best-known algorithm for 73 00:04:48,445 --> 00:04:52,865 performing hierarchical clustering has complexity order N-squared, 74 00:04:52,865 --> 00:04:57,509 instead of N-squared log N, but that's still quite a lot if N is very large. 75 00:04:57,509 --> 00:05:01,710 There are other considerations based off of the structure of your data, too. 76 00:05:01,710 --> 00:05:05,610 So for example, one thing that can happen in single-linkage is something called 77 00:05:05,610 --> 00:05:09,620 chaining, where you can have a sequence of points where the individual 78 00:05:09,620 --> 00:05:14,210 distances between points is really small, and so the points end up getting merged. 79 00:05:14,210 --> 00:05:18,450 But if you look at the resulting cluster, you can have members that are very, 80 00:05:18,450 --> 00:05:19,880 very far away. 81 00:05:19,880 --> 00:05:24,650 And in a lot of applications, having this type of cluster might not be desirable. 82 00:05:25,900 --> 00:05:29,930 So instead, one approach to helping fix this chaining issue 83 00:05:29,930 --> 00:05:32,540 is to consider other linkage functions. 84 00:05:32,540 --> 00:05:36,220 So one thing you can look at is something called complete linkage, 85 00:05:36,220 --> 00:05:41,880 where instead of computing the minimum distance between any set 86 00:05:41,880 --> 00:05:47,240 of points appearing in these two clusters, you look at the maximum distance. 87 00:05:47,240 --> 00:05:50,760 And in this case, if we had looked at the maximum distance in this 88 00:05:50,760 --> 00:05:55,000 really long chain, we wouldn't have ended up merging these clusters, 89 00:05:55,000 --> 00:05:58,020 because that maximum distance would have been very large. 90 00:05:58,020 --> 00:06:01,860 And as an alternative, you can look at something called the Ward criterion, 91 00:06:01,860 --> 00:06:05,760 which looks at the variance of points within a cluster. 92 00:06:05,760 --> 00:06:11,120 But both of these alternative linkage functions end up implicitly imposing, 93 00:06:11,120 --> 00:06:13,610 in essence, shape restrictions on the clusters. 94 00:06:14,830 --> 00:06:19,133 So it does help with this chaining issue, but you get back to things that look 95 00:06:19,133 --> 00:06:22,708 closer to things like the Mitchell or Gaussian type reversal. 96 00:06:22,708 --> 00:06:27,568 It's where you're assuming a specific form that your clusters are going to take, and 97 00:06:27,568 --> 00:06:32,361 in essence kind of trying to rule out some other specific shapes that might actually 98 00:06:32,361 --> 00:06:34,660 be what defines the cluster. 99 00:06:34,660 --> 00:06:37,840 So again, there are just a number of choices you have to make, and 100 00:06:37,840 --> 00:06:39,300 there are implications to those choices. 101 00:06:39,300 --> 00:06:42,968 So think carefully when you're specifying exactly how you're performing 102 00:06:42,968 --> 00:06:44,930 the hierarchical clustering. 103 00:06:44,930 --> 00:06:49,010 But in summary, hierarchical clustering can be really powerful. 104 00:06:49,010 --> 00:06:53,048 It performs clustering at multiple granularities, and 105 00:06:53,048 --> 00:06:58,779 you can get quite descriptive clustering results coming out of this procedure. 106 00:06:58,779 --> 00:07:03,179 [MUSIC]