1
00:00:00,000 --> 00:00:04,152
[MUSIC]

2
00:00:04,152 --> 00:00:09,940
Finally, let's wrap up by providing a few
more details on agglomerative clustering.

3
00:00:09,940 --> 00:00:11,510
So one thing worth discussing are,

4
00:00:11,510 --> 00:00:14,730
what are the choices we have to
make in agglomerative clustering?

5
00:00:14,730 --> 00:00:18,350
One is, what is the distance metric
we're going to use to compute

6
00:00:18,350 --> 00:00:21,280
the pairwise distances between points?

7
00:00:21,280 --> 00:00:24,360
And then the other is, what is the linkage
function we're going to use to

8
00:00:24,360 --> 00:00:28,520
compute the resulting distances
between pairs of clusters?

9
00:00:28,520 --> 00:00:31,450
And for single linkage,
that linkage function is just

10
00:00:31,450 --> 00:00:36,100
the minimum distance between any pair of
points appearing in the two clusters.

11
00:00:36,100 --> 00:00:38,030
And finally,
another important question is,

12
00:00:38,030 --> 00:00:39,940
where are we going to cut the dendrogram?

13
00:00:39,940 --> 00:00:45,040
So one version of that question is just,
where would we place this horizontal cut?

14
00:00:45,040 --> 00:00:47,830
But another thing is that we don't
have to make just a horizontal cut,

15
00:00:47,830 --> 00:00:51,360
we can make some weird
cut through this tree.

16
00:00:51,360 --> 00:00:56,030
So this is something where really
the art of the method comes into play.

17
00:00:56,030 --> 00:00:59,810
And often you have to build in a lot
of application-specific information

18
00:00:59,810 --> 00:01:02,190
to think about how to cut the dendrogram.

19
00:01:03,250 --> 00:01:07,580
Well, if you're using hierarchical
clustering for some task of visualization

20
00:01:07,580 --> 00:01:12,770
of the data, then often it's preferable
to produce a small number of clusters.

21
00:01:12,770 --> 00:01:15,730
So that would define how
you'd cut the dendrogram.

22
00:01:15,730 --> 00:01:18,850
But in contrast, if you're trying to do
something like outlier detection, where

23
00:01:18,850 --> 00:01:23,750
you want to detect if some data point is
very different from other data points,

24
00:01:23,750 --> 00:01:26,860
then you might want to think of other
ways of defining the number of clusters,

25
00:01:26,860 --> 00:01:29,810
or equivalently,
how to cut the dendrogram.

26
00:01:29,810 --> 00:01:33,900
In this case, one thing you can do
is just use a distance threshold,

27
00:01:33,900 --> 00:01:38,100
which is the same method that we
described in the previous section.

28
00:01:38,100 --> 00:01:42,890
Or you could do even fancier things,
like something called the inconsistency

29
00:01:42,890 --> 00:01:47,980
coefficient, where what you do is,
you compare the height of any given merge

30
00:01:47,980 --> 00:01:52,110
to the average height of
merges below that point.

31
00:01:52,110 --> 00:01:57,450
Then if that height is much higher than
the average heights of merges below,

32
00:01:57,450 --> 00:02:02,440
what it indicates is that that particular
merge is merging clusters that are very

33
00:02:02,440 --> 00:02:07,710
far apart relative to what
the merges were below that point.

34
00:02:07,710 --> 00:02:10,870
So that might be a reasonable
point to cut the dendogram and

35
00:02:10,870 --> 00:02:14,300
keep those clusters as separate clusters.

36
00:02:14,300 --> 00:02:16,620
So just as a little visualization of that,

37
00:02:16,620 --> 00:02:21,360
maybe we have some data points
that are fairly far apart.

38
00:02:21,360 --> 00:02:25,190
And then you have this very tightly
clustered set of data points here,

39
00:02:25,190 --> 00:02:30,300
and maybe another very tightly
clustered set of points here.

40
00:02:30,300 --> 00:02:34,790
Well, if we're just simply cutting
the dendrogram based off of a threshold,

41
00:02:34,790 --> 00:02:41,680
you might end up merging these points and
keeping these points as a cluster.

42
00:02:41,680 --> 00:02:47,710
But instead, what you might want to do,
based on the structure of the data,

43
00:02:47,710 --> 00:02:50,830
is keep these as separate clusters.

44
00:02:50,830 --> 00:02:56,280
Because within that cluster, the distances
between points are really, really small,

45
00:02:58,010 --> 00:03:02,780
whereas the distance of this particular
merge between these two red clusters

46
00:03:02,780 --> 00:03:06,550
was much larger than the distances
of merges within each cluster here.

47
00:03:07,770 --> 00:03:13,510
So this cut of the dendrogram could allow
you to do something like the following.

48
00:03:13,510 --> 00:03:17,920
But to be clear, no cutting method
is the correct cutting method.

49
00:03:17,920 --> 00:03:19,470
And instead, like we said,

50
00:03:19,470 --> 00:03:23,850
a lot of application-specific intuition or
information comes into play.

51
00:03:23,850 --> 00:03:27,350
And a lot of heuristics are often used
to determine how to cut the dendrogram.

52
00:03:27,350 --> 00:03:31,560
And the results produced by
the hierarchical cluster method,

53
00:03:31,560 --> 00:03:35,520
whatever's output as the set of clusters,
is really sensitive,

54
00:03:35,520 --> 00:03:37,450
of course, to how you cut that dendrogram.

55
00:03:37,450 --> 00:03:41,330
You can get very different
solutions based on that approach.

56
00:03:41,330 --> 00:03:43,920
So you have to think very
carefully about that.

57
00:03:43,920 --> 00:03:47,460
And instead, you can often think about
hierarchical clustering as a way

58
00:03:47,460 --> 00:03:51,700
to produce different possible
clusterings as you vary a threshold or

59
00:03:51,700 --> 00:03:54,220
as you vary the way in which
you cut the dendrogram.

60
00:03:55,570 --> 00:03:58,910
There are also some important
computational considerations when we're

61
00:03:58,910 --> 00:04:01,858
thinking about applying
hierarchical clustering methods.

62
00:04:01,858 --> 00:04:06,430
So for agglomerative clustering we have to
compute distances between every pair of

63
00:04:06,430 --> 00:04:10,760
points when we're going to compute
the distances between pairs of clusters.

64
00:04:10,760 --> 00:04:12,920
And this is really, really expensive,

65
00:04:12,920 --> 00:04:17,350
just as we explored in our nearest
neighbor method for retrieval.

66
00:04:18,550 --> 00:04:22,140
And you can show that a brute force
algorithm for agglomerative clustering has

67
00:04:22,140 --> 00:04:28,130
complexity on the order of N-squared log
N, where N is the number of data points.

68
00:04:28,130 --> 00:04:32,023
But just like in nearest neighbors,
we talked KD trees and

69
00:04:32,023 --> 00:04:35,050
efficient ways of pruning
the search space.

70
00:04:35,050 --> 00:04:39,169
Well, here we can also think
about using the triangle

71
00:04:39,169 --> 00:04:44,240
inequality to rule out certain
potential merge moves.

72
00:04:44,240 --> 00:04:48,445
And if you do some really fancy things,
the best-known algorithm for

73
00:04:48,445 --> 00:04:52,865
performing hierarchical clustering
has complexity order N-squared,

74
00:04:52,865 --> 00:04:57,509
instead of N-squared log N, but that's
still quite a lot if N is very large.

75
00:04:57,509 --> 00:05:01,710
There are other considerations based
off of the structure of your data, too.

76
00:05:01,710 --> 00:05:05,610
So for example, one thing that can happen
in single-linkage is something called

77
00:05:05,610 --> 00:05:09,620
chaining, where you can have a sequence
of points where the individual

78
00:05:09,620 --> 00:05:14,210
distances between points is really small,
and so the points end up getting merged.

79
00:05:14,210 --> 00:05:18,450
But if you look at the resulting cluster,
you can have members that are very,

80
00:05:18,450 --> 00:05:19,880
very far away.

81
00:05:19,880 --> 00:05:24,650
And in a lot of applications, having this
type of cluster might not be desirable.

82
00:05:25,900 --> 00:05:29,930
So instead, one approach to
helping fix this chaining issue

83
00:05:29,930 --> 00:05:32,540
is to consider other linkage functions.

84
00:05:32,540 --> 00:05:36,220
So one thing you can look at is
something called complete linkage,

85
00:05:36,220 --> 00:05:41,880
where instead of computing
the minimum distance between any set

86
00:05:41,880 --> 00:05:47,240
of points appearing in these two clusters,
you look at the maximum distance.

87
00:05:47,240 --> 00:05:50,760
And in this case, if we had looked
at the maximum distance in this

88
00:05:50,760 --> 00:05:55,000
really long chain, we wouldn't have
ended up merging these clusters,

89
00:05:55,000 --> 00:05:58,020
because that maximum distance
would have been very large.

90
00:05:58,020 --> 00:06:01,860
And as an alternative, you can look at
something called the Ward criterion,

91
00:06:01,860 --> 00:06:05,760
which looks at the variance
of points within a cluster.

92
00:06:05,760 --> 00:06:11,120
But both of these alternative linkage
functions end up implicitly imposing,

93
00:06:11,120 --> 00:06:13,610
in essence,
shape restrictions on the clusters.

94
00:06:14,830 --> 00:06:19,133
So it does help with this chaining issue,
but you get back to things that look

95
00:06:19,133 --> 00:06:22,708
closer to things like the Mitchell or
Gaussian type reversal.

96
00:06:22,708 --> 00:06:27,568
It's where you're assuming a specific form
that your clusters are going to take, and

97
00:06:27,568 --> 00:06:32,361
in essence kind of trying to rule out some
other specific shapes that might actually

98
00:06:32,361 --> 00:06:34,660
be what defines the cluster.

99
00:06:34,660 --> 00:06:37,840
So again, there are just a number
of choices you have to make, and

100
00:06:37,840 --> 00:06:39,300
there are implications to those choices.

101
00:06:39,300 --> 00:06:42,968
So think carefully when you're
specifying exactly how you're performing

102
00:06:42,968 --> 00:06:44,930
the hierarchical clustering.

103
00:06:44,930 --> 00:06:49,010
But in summary, hierarchical
clustering can be really powerful.

104
00:06:49,010 --> 00:06:53,048
It performs clustering at
multiple granularities, and

105
00:06:53,048 --> 00:06:58,779
you can get quite descriptive clustering
results coming out of this procedure.

106
00:06:58,779 --> 00:07:03,179
[MUSIC]