1
00:00:00,079 --> 00:00:05,061
[MUSIC]

2
00:00:05,061 --> 00:00:08,002
So we've talked about the k-meansw
clustering algorithm and

3
00:00:08,002 --> 00:00:11,060
then a smart initialization
to the algorithm.

4
00:00:11,060 --> 00:00:15,680
But now let's turn to assessing how do we
measure the quality of the clustering and

5
00:00:15,680 --> 00:00:18,940
what's a method for
choosing the number of clusters?

6
00:00:18,940 --> 00:00:23,840
So as an example, here are two
of the clusterings that we found

7
00:00:23,840 --> 00:00:27,261
based on two different initializations
of our k-means algorithm.

8
00:00:27,261 --> 00:00:30,150
And a question is, which one do I prefer?

9
00:00:30,150 --> 00:00:31,060
Which one is better?

10
00:00:32,600 --> 00:00:35,680
Well to assess this, we can return to,
what is the k-means objective?

11
00:00:35,680 --> 00:00:37,610
What are we trying to minimize here?

12
00:00:37,610 --> 00:00:42,640
Well, k-means is trying to minimize
the sum of the square distances

13
00:00:42,640 --> 00:00:46,730
between our observations and
the cluster centers.

14
00:00:46,730 --> 00:00:51,840
So we're minimizing over the set of all

15
00:00:53,260 --> 00:00:57,440
cluster indicators for
all of our observations Xi.

16
00:00:57,440 --> 00:01:01,360
And the set of all cluster centers mu j.

17
00:01:01,360 --> 00:01:06,000
So all possible values of assignments
of observations to clusters and

18
00:01:06,000 --> 00:01:10,820
all possible values of where we
place those cluster centers,

19
00:01:10,820 --> 00:01:13,790
the sum of these square distances.

20
00:01:13,790 --> 00:01:19,218
So here this sum is sum

21
00:01:19,218 --> 00:01:28,280
of squared distances in cluster j.

22
00:01:28,280 --> 00:01:35,844
And here, we're summing over all clusters.

23
00:01:38,611 --> 00:01:43,960
So we can think about using this as a
measure of quality for a given clustering.

24
00:01:43,960 --> 00:01:49,470
So in particular, if the sum of
the squared distances is small,

25
00:01:49,470 --> 00:01:52,180
then that's going to represent a better

26
00:01:52,180 --> 00:01:56,887
clustering according to the k-means
objective than one that's large.

27
00:01:56,887 --> 00:02:04,210
So to see this here,
we have all these distances in the pink or

28
00:02:04,210 --> 00:02:09,650
fuchsia cluster and all of these
little distances in the green cluster.

29
00:02:09,650 --> 00:02:13,670
But when we look at the blue cluster,
we have really, really,

30
00:02:13,670 --> 00:02:16,200
really big distances that are getting.

31
00:02:18,060 --> 00:02:21,660
We have some small distances, but
a bunch of really big distances.

32
00:02:21,660 --> 00:02:25,252
So when we're summing over
the square of all these distances,

33
00:02:25,252 --> 00:02:27,364
this is going to be a very large number.

34
00:02:27,364 --> 00:02:32,760
Whereas here We're summing,

35
00:02:32,760 --> 00:02:36,806
in each case for each cluster, over

36
00:02:38,669 --> 00:02:42,371
Smaller numbers on average.

37
00:02:46,200 --> 00:02:54,639
So we're going to say that this
cluster here is more heterogeneous,

38
00:02:57,852 --> 00:03:02,874
Meaning that there are more

39
00:03:02,874 --> 00:03:11,180
dissimilar objects within a given cluster.

40
00:03:15,400 --> 00:03:18,500
Whereas here,
we end up with tighter clusters.

41
00:03:20,980 --> 00:03:23,440
Where things are more homogenous.

42
00:03:26,400 --> 00:03:31,840
So, we're going to refer to this
metric as cluster heterogeneity.

43
00:03:33,280 --> 00:03:38,370
So, what's the sum of the heterogeneitiy's
over all of our different clusters and

44
00:03:38,370 --> 00:03:39,630
we want this to be small so

45
00:03:39,630 --> 00:03:45,450
that where in examples like this one
where we have these tight, nice clusters.

46
00:03:45,450 --> 00:03:46,390
And just to be clear,

47
00:03:46,390 --> 00:03:51,030
even though we're calling this
a measure of quality, lower is better.

48
00:03:51,030 --> 00:03:53,970
We want smaller sums of square distances.

49
00:03:53,970 --> 00:03:58,030
Okay, so let's just spend some time
thinking about what happens to this

50
00:03:58,030 --> 00:04:03,710
heterogeneity metric as k,
the number of clusters, increases.

51
00:04:03,710 --> 00:04:08,360
Well, when there are more clusters
available to the algorithm.

52
00:04:08,360 --> 00:04:12,420
We can refine these clusters more and
more to our observed data.

53
00:04:12,420 --> 00:04:17,080
And so this can lead to our classic notion
of over-fitting, but not quite classic,

54
00:04:17,080 --> 00:04:19,912
because now we're in this
unsupervised setting.

55
00:04:19,912 --> 00:04:24,458
But it's a similar kind of
idea where the model is very,

56
00:04:24,458 --> 00:04:28,327
very specified to the data
in a way that might not

57
00:04:28,327 --> 00:04:33,710
be descriptive in general for
what we're thinking about.

58
00:04:33,710 --> 00:04:38,671
So as an extreme case of this, imagine
that we define the number of clusters

59
00:04:38,671 --> 00:04:41,844
as equal to the number
of observed data points.

60
00:04:44,045 --> 00:04:48,354
Then what we can do is for
each one of our observations,

61
00:04:48,354 --> 00:04:51,640
we can define that to be a cluster center.

62
00:04:53,080 --> 00:04:56,853
So what's the heterogeneity
of that clustering?

63
00:04:58,591 --> 00:04:59,652
It's 0.

64
00:05:02,036 --> 00:05:04,220
And why is it 0?

65
00:05:04,220 --> 00:05:08,920
Because it's, we're looking at
the sum over all of our clusters

66
00:05:08,920 --> 00:05:11,730
of the square distance of
the observation to the cluster center.

67
00:05:11,730 --> 00:05:13,940
And what's the distance
to the cluster center?

68
00:05:13,940 --> 00:05:14,640
It's 0.

69
00:05:14,640 --> 00:05:17,390
Each cluster center is
the observation itself.

70
00:05:23,460 --> 00:05:28,095
So, in general,
the lowest possible heterogeneity that

71
00:05:28,095 --> 00:05:31,508
we can achieve decreases as we increase k.

72
00:05:31,508 --> 00:05:36,030
But, I want to emphasize that for
any given run of our k-means algorithm.

73
00:05:36,030 --> 00:05:41,110
Let's say, we run out with k +1 clusters,
well, the heterogeneity for

74
00:05:41,110 --> 00:05:45,980
that converge solution, could actually
be larger than the heterogeneity for

75
00:05:45,980 --> 00:05:49,920
a converged solution, where we ran
algorithm with just k clusters.

76
00:05:49,920 --> 00:05:54,300
But in general, when we're thinking about
over the set of possible heterogeneity's

77
00:05:54,300 --> 00:05:57,460
we have,
the trend is that it decreases with k.

78
00:05:57,460 --> 00:06:03,120
So a question is, how do we think
about choosing the number of clusters?

79
00:06:03,120 --> 00:06:08,200
Well we said that our heterogeneity metric
or the lowest possible heterogeneity

80
00:06:08,200 --> 00:06:13,770
we can achieve decreases as we
increase the number of clusters.

81
00:06:13,770 --> 00:06:18,690
And we said lower heterogeneity is better,
but then we also said well,

82
00:06:18,690 --> 00:06:23,446
we don't want too many clusters
because that won't really describe

83
00:06:23,446 --> 00:06:26,318
groupings that are very natural.

84
00:06:26,318 --> 00:06:29,630
We're just going to over partition
the data into these little groups and

85
00:06:29,630 --> 00:06:31,050
get very low heterogeneity,

86
00:06:31,050 --> 00:06:35,540
but not a good description or not a good
structure extracted from the data.

87
00:06:35,540 --> 00:06:38,751
So somehow we need to trade
off between these two things.

88
00:06:38,751 --> 00:06:43,278
And so what we can do is we
can think about a heuristic,

89
00:06:43,278 --> 00:06:49,417
where we're just going to look at
something called the elbow of the curve,

90
00:06:49,417 --> 00:06:55,085
where it's a little bit unclear
whether the elbow is here or here.

91
00:06:55,085 --> 00:07:00,507
So I'll just say, choose one of these

92
00:07:00,507 --> 00:07:05,759
values of k where maybe I would prefer

93
00:07:05,759 --> 00:07:11,181
this value of k because the difference

94
00:07:11,181 --> 00:07:17,452
in heterogeneity relative to values much,

95
00:07:17,452 --> 00:07:23,370
much larger, it's not very big.

96
00:07:23,370 --> 00:07:27,700
But this should provide a more
parsimonious description

97
00:07:27,700 --> 00:07:30,460
of my data in using this much,
much larger k.

98
00:07:32,700 --> 00:07:36,780
But this is just a heuristic.

99
00:07:39,820 --> 00:07:44,038
So, in general, there's really no
answer for what's the right value of k?

100
00:07:44,038 --> 00:07:47,977
And you're going to explore
this in depth in the assignment

101
00:07:47,977 --> 00:07:50,117
associated with this module.

102
00:07:50,117 --> 00:07:53,331
There's one last thing I wanted
to mention here, which is so

103
00:07:53,331 --> 00:07:56,619
far this plot is a plot of
the lowest possible heterogeneity.

104
00:07:56,619 --> 00:08:00,561
Of course we're not going to compute
all possible clusterings for

105
00:08:00,561 --> 00:08:02,897
every value of k to define this curve.

106
00:08:02,897 --> 00:08:04,822
So in practice what people do,

107
00:08:04,822 --> 00:08:08,525
is they just do multiple random
restarts of our algorithm,

108
00:08:08,525 --> 00:08:14,640
even if we're using k-means plus, plus,
or random initializations, mix those in.

109
00:08:14,640 --> 00:08:17,120
Do a bunch of different restarts for
every value of k,

110
00:08:17,120 --> 00:08:20,810
and then just look at
the clustering that provides

111
00:08:20,810 --> 00:08:25,370
the lowest measure of heterogeneity and
plot that as a function of k.

112
00:08:25,370 --> 00:08:27,440
And then choose k based on that.

113
00:08:28,700 --> 00:08:31,850
But again, that itself is very
computationally intensive.

114
00:08:31,850 --> 00:08:36,130
So in practice,
people deal wide range of things, but

115
00:08:36,130 --> 00:08:39,580
it's worth mentioning that how
you choose k of course really,

116
00:08:39,580 --> 00:08:42,770
really matters for
what the resulting clusters represent.

117
00:08:42,770 --> 00:08:46,940
If you choose k = 2, you're just
dividing your cadence into two groups.

118
00:08:46,940 --> 00:08:48,872
If they're really inherently,

119
00:08:48,872 --> 00:08:52,830
very clearly separated clusters
you're not going to capture that.

120
00:08:52,830 --> 00:08:56,800
But if you choose a million clusters,
you're going to get very,

121
00:08:56,800 --> 00:09:00,170
very fine tuned clusters, obviously,
depending on the size of your data,

122
00:09:00,170 --> 00:09:02,230
I'm trying to give an example
that's very large.

123
00:09:02,230 --> 00:09:03,230
So, bear with me.

124
00:09:03,230 --> 00:09:08,030
But the point is that you really
need to look at the data,

125
00:09:08,030 --> 00:09:11,110
think about what these things
mean in your application, and

126
00:09:11,110 --> 00:09:15,250
when it's hard to look at the data, you
can start doing things like the heuristic

127
00:09:15,250 --> 00:09:18,340
we're talking about here where you
select the elbow of this curve.

128
00:09:18,340 --> 00:09:22,980
But again, you'll get a lot of hands
on practice with it in the assignment.

129
00:09:22,980 --> 00:09:27,239
[MUSIC]