1 00:00:00,000 --> 00:00:04,696 [MUSIC] 2 00:00:04,696 --> 00:00:08,901 So one way to compactly represent the results of hierarchical equestrian 3 00:00:08,901 --> 00:00:11,498 are through something called a dendrogram. 4 00:00:11,498 --> 00:00:14,880 And we're going to explain the dendrogram in the context of agglomerative 5 00:00:14,880 --> 00:00:19,000 clustering, even though this type of representation can be used for 6 00:00:19,000 --> 00:00:21,740 other hierarchical equestrian approaches as well. 7 00:00:21,740 --> 00:00:24,270 So the dendrogram representation is as follows. 8 00:00:24,270 --> 00:00:27,920 Along the x axis, we're going to place each one of our data points, but 9 00:00:27,920 --> 00:00:33,460 carefully ordered so that we get a very nice visual coming out of this and then 10 00:00:33,460 --> 00:00:39,330 along the y-axis, what we're indicating is the distance between different clusters. 11 00:00:39,330 --> 00:00:42,420 So more specifically, when we're thinking about single linkage, 12 00:00:42,420 --> 00:00:47,800 where we're looking at the minimum distance between the set of points 13 00:00:47,800 --> 00:00:52,340 in any pair of clusters, then what the height of a specific 14 00:00:52,340 --> 00:00:57,810 merge point is going to represent is the distance between those two clusters. 15 00:00:57,810 --> 00:01:03,100 So let's describe this a little bit more where if we look at a given merge point, 16 00:01:03,100 --> 00:01:05,870 and below that we have two separate trees, 17 00:01:07,030 --> 00:01:10,010 each one of those trees is going to represent a different cluster. 18 00:01:10,010 --> 00:01:15,730 So in this example, we have this blue cluster, so all these points 19 00:01:15,730 --> 00:01:19,810 that are colored in blue are data points that are assigned to this blue cluster 20 00:01:19,810 --> 00:01:23,640 at some iteration of this algorithm and likewise, we have this green cluster. 21 00:01:24,720 --> 00:01:28,940 And then the height of the merge between this blue branch and 22 00:01:28,940 --> 00:01:34,630 this green branch indicates what was the minimum distance between 23 00:01:34,630 --> 00:01:39,290 any point in that blue cluster and any point in that green cluster. 24 00:01:39,290 --> 00:01:44,590 And so that minimum distance is where we're going to put that merge of the blue 25 00:01:44,590 --> 00:01:50,710 branch with the green branch, so that specifies the height along the y-axis. 26 00:01:50,710 --> 00:01:54,880 And so, throughout this tree, we're seeing the different clusters that are present 27 00:01:54,880 --> 00:01:58,840 and the distances between these clusters as different merges were made 28 00:01:58,840 --> 00:02:02,830 throughout the algorithm, going from every data point being in its own cluster, 29 00:02:02,830 --> 00:02:07,270 all the way to all the data points being in one cluster. 30 00:02:07,270 --> 00:02:10,730 And likewise, if we look at any path down this dendrogram, 31 00:02:11,990 --> 00:02:17,680 what that indicates is the cluster membership of a given data point 32 00:02:17,680 --> 00:02:19,410 throughout all the different merge steps. 33 00:02:19,410 --> 00:02:23,370 So what cluster that data point belonged to and 34 00:02:23,370 --> 00:02:26,689 the sequence of merges made for those clusters. 35 00:02:27,750 --> 00:02:32,270 So in summary, we see that the dendrogram is able to capture all the critical 36 00:02:32,270 --> 00:02:35,380 elements of the hierarchical clustering result. 37 00:02:35,380 --> 00:02:37,652 So what are the different cluster memberships? 38 00:02:37,652 --> 00:02:40,140 Which cluster is merged with which? 39 00:02:40,140 --> 00:02:43,480 And what were the distances between those clusters 40 00:02:43,480 --> 00:02:45,660 at the point at which they merged? 41 00:02:45,660 --> 00:02:48,880 Well, one important thing we can think of doing from the dendrogram 42 00:02:48,880 --> 00:02:50,440 is extracting a partition. 43 00:02:50,440 --> 00:02:54,320 Because remember, if we run agglomerative clustering from beginning all the way to 44 00:02:54,320 --> 00:02:59,770 the end, we're starting with all data points in separate individual clusters and 45 00:02:59,770 --> 00:03:01,860 then ending with all data points in one cluster. 46 00:03:01,860 --> 00:03:05,230 And what we really want is we want to produce some clustering of our data 47 00:03:05,230 --> 00:03:07,300 points, that's somewhere in between. 48 00:03:07,300 --> 00:03:10,040 Not at the granularity of every point being its own cluster and 49 00:03:10,040 --> 00:03:14,700 not at this very coarse granularity of everything being one cluster. 50 00:03:14,700 --> 00:03:18,870 So this leads to a question of how do we extract some partition? 51 00:03:18,870 --> 00:03:22,010 How do we define a set of clusters to produce from this hierarchical 52 00:03:22,010 --> 00:03:23,700 clustering procedure? 53 00:03:23,700 --> 00:03:26,940 And one really simple approach is to perform a cut 54 00:03:26,940 --> 00:03:30,140 along the y-axis of the dendrogram. 55 00:03:30,140 --> 00:03:33,800 Then every branch that crosses this line that we chose 56 00:03:35,020 --> 00:03:37,770 is going to define a separate cluster. 57 00:03:37,770 --> 00:03:42,140 So in this example, we see we have this fuchsia cluster, blue, green, orange, and 58 00:03:42,140 --> 00:03:43,670 gray clusters. 59 00:03:43,670 --> 00:03:46,860 But remember in this visualization, each of these data points, 60 00:03:46,860 --> 00:03:50,070 this just represented different data indices in our data set. 61 00:03:50,070 --> 00:03:53,950 But what we can do is we can go back to our original feature space and 62 00:03:53,950 --> 00:03:56,880 visualize what the resulting clusters look like. 63 00:03:56,880 --> 00:04:01,230 So if our data were just in 2D, like all the examples we've been given, 64 00:04:01,230 --> 00:04:06,140 because of our nice 2D slide representation, this might be 65 00:04:06,140 --> 00:04:10,550 the clustering associated with the cut that we showed on the previous slide here. 66 00:04:11,680 --> 00:04:17,360 And as we think about varying where this cut point is, we're looking at different 67 00:04:17,360 --> 00:04:21,470 possible clusterings of our dataset through different granularities, going 68 00:04:21,470 --> 00:04:26,680 from really, really fine granularity all the way up to very coarse granularities. 69 00:04:26,680 --> 00:04:30,710 So let's spend a little time thinking about what it means to cut the dendrogram 70 00:04:30,710 --> 00:04:32,094 at some level D*. 71 00:04:33,100 --> 00:04:36,120 Well what we're saying then is that, for 72 00:04:36,120 --> 00:04:40,950 the resulting clusters produced by this cut, there are no pair of 73 00:04:40,950 --> 00:04:46,560 clusters at a distance less than D* that have not already been merged. 74 00:04:46,560 --> 00:04:50,602 So that means that D* is the minimum distance between our clusters at 75 00:04:50,602 --> 00:04:52,420 this level of the clustering. 76 00:04:52,420 --> 00:04:56,729 [MUSIC]