1
00:00:00,000 --> 00:00:04,683
[MUSIC]

2
00:00:04,683 --> 00:00:07,880
To start with,
let's discuss divisive clustering.

3
00:00:07,880 --> 00:00:11,310
And in pictures, what we can think
of is that at some granularity,

4
00:00:11,310 --> 00:00:15,750
our data can be pretty well
described as a set of clusters.

5
00:00:15,750 --> 00:00:20,240
So maybe it could even be described
as a set of elliptical clusters using

6
00:00:20,240 --> 00:00:24,700
the type of mixture models, in particular
mixture of Gaussians we described before.

7
00:00:24,700 --> 00:00:29,120
But then when we dig down to a finer
granularity, what we see is that

8
00:00:29,120 --> 00:00:34,400
each individual cluster is actually better
described as a set of clusters itself.

9
00:00:34,400 --> 00:00:38,150
And we can keep recursing this idea
where we take each cluster at a given

10
00:00:38,150 --> 00:00:43,600
granularity, dig down and
specify it as a set of clusters itself.

11
00:00:43,600 --> 00:00:46,570
And this is what divisive clustering does.

12
00:00:46,570 --> 00:00:49,170
So as an example, one

13
00:00:49,170 --> 00:00:53,870
very straightforward approach is to just
recursively apply R k-means algorithm.

14
00:00:53,870 --> 00:00:57,410
So one application that you're
going to look at in your assignment

15
00:00:57,410 --> 00:01:01,190
is clustering Wikipedia articles,
which we've looked at in past assignments.

16
00:01:01,190 --> 00:01:04,720
But here we're going to take the set
of all these Wikipedia articles and

17
00:01:04,720 --> 00:01:08,370
then we're going to apply k-means,
which is k equals 2, as an example.

18
00:01:08,370 --> 00:01:13,185
So just split into two clusters, and maybe
at the first split, what we see is that

19
00:01:13,185 --> 00:01:17,290
k-means separates all these articles
into articles about athletes and

20
00:01:17,290 --> 00:01:19,130
articles about non-athletes.

21
00:01:19,130 --> 00:01:22,270
Of course these are learn clusters, we
don't have these labels, these are things

22
00:01:22,270 --> 00:01:28,320
that we infer digging down and looking
into the contents of these clusters.

23
00:01:28,320 --> 00:01:31,164
Okay, well then we recurse
this process where for

24
00:01:31,164 --> 00:01:34,900
each one of these clusters we run
k-means with k equals 2 again.

25
00:01:34,900 --> 00:01:37,070
And so maybe at the next level,

26
00:01:37,070 --> 00:01:39,770
we have the following set of clusters
where there is a cluster for

27
00:01:40,820 --> 00:01:45,630
specific athletes for baseball or
athletes for soccer and ice hockey.

28
00:01:45,630 --> 00:01:51,030
And then when we look at the non-athlete
cluster, that splits into a cluster for

29
00:01:51,030 --> 00:01:53,080
musicians, artists and actors.

30
00:01:53,080 --> 00:01:57,260
And another one's for scholars,
politicians and government officials.

31
00:01:57,260 --> 00:01:59,660
And then we can think about
recursing this process.

32
00:01:59,660 --> 00:02:03,710
So, for example, maybe we want to
further split the cluster that's about

33
00:02:03,710 --> 00:02:08,210
soccer players and ice hockey players to
get at a granularity that better matches

34
00:02:08,210 --> 00:02:11,330
what we'd like to come out
of our clustering procedure.

35
00:02:11,330 --> 00:02:13,460
Okay, well,
when we're doing divisive clustering,

36
00:02:13,460 --> 00:02:15,770
there are a number of
choices that we have to make.

37
00:02:15,770 --> 00:02:19,180
One is we have to think about what
algorithm are we going to apply

38
00:02:19,180 --> 00:02:21,130
at every step of this recursion.

39
00:02:21,130 --> 00:02:26,130
So the example we just walked through
was applying k-means at every step.

40
00:02:26,130 --> 00:02:28,990
So doing a clustering at one level.

41
00:02:28,990 --> 00:02:31,100
Digging down into each
one of those clusters.

42
00:02:31,100 --> 00:02:34,760
Applying k-means again and
again and again and again.

43
00:02:34,760 --> 00:02:39,440
Well another question is how many
splits to be made at each level.

44
00:02:39,440 --> 00:02:42,810
So that's equivalent to saying what
is the value of k we should use in

45
00:02:42,810 --> 00:02:45,380
our k-means recursion.

46
00:02:45,380 --> 00:02:49,380
In the example that we just looked at, we
were using k equals 2 at every level, but

47
00:02:49,380 --> 00:02:50,880
that's a choice that we have to specify.

48
00:02:51,960 --> 00:02:55,380
And finally, we have to decide
when to continue splitting or

49
00:02:55,380 --> 00:02:57,260
stopping this process.

50
00:02:57,260 --> 00:03:02,500
And there are a number of heuristics that
people look at for the stopping criterion.

51
00:03:02,500 --> 00:03:07,030
One is to think about whether
a cluster falls below a certain size.

52
00:03:07,030 --> 00:03:12,570
So there are fewer members in that cluster
than its pre-specified threshold, and

53
00:03:12,570 --> 00:03:17,100
then if that condition is satisfied, you
would just stop splitting that cluster.

54
00:03:17,100 --> 00:03:19,610
Another is to look at
the spread of the cluster.

55
00:03:19,610 --> 00:03:24,080
So if the distance from the cluster center

56
00:03:24,080 --> 00:03:28,870
to the furthest point in that
cluster falls below some threshold,

57
00:03:28,870 --> 00:03:31,650
then you say okay,
well it's a compact enough cluster for

58
00:03:31,650 --> 00:03:35,860
what I'm hoping to get out, and
then you stop splitting that cluster.

59
00:03:35,860 --> 00:03:39,640
Or another heuristic is to
just continue splitting

60
00:03:39,640 --> 00:03:43,980
until the number of clusters that
you've pre-specified is reached.

61
00:03:43,980 --> 00:03:46,860
But of course in this case you
get back to having to pre-specify

62
00:03:46,860 --> 00:03:48,250
how many clusters you want.

63
00:03:48,250 --> 00:03:51,630
Which might have been the thing you
were trying to avoid in the first place.

64
00:03:51,630 --> 00:03:53,819
But all of these are just heuristics and

65
00:03:53,819 --> 00:03:57,865
often a lot of application-specific
intuition has to come into play and

66
00:03:57,865 --> 00:04:02,308
sometimes some manual intervention to
really get at the type of granularity you

67
00:04:02,308 --> 00:04:05,117
want to come out of
the hierarchical clustering.

68
00:04:05,117 --> 00:04:09,169
[MUSIC]