1
00:00:00,077 --> 00:00:04,698
[MUSIC]

2
00:00:04,698 --> 00:00:08,140
So, we've discussed
sensitivity to initialization.

3
00:00:08,140 --> 00:00:11,364
Now, let's turn to a method
that provides a smarter and

4
00:00:11,364 --> 00:00:14,394
more reliable form of
initializing our algorithm.

5
00:00:14,394 --> 00:00:18,070
And this is referred to as k-means++.

6
00:00:18,070 --> 00:00:22,960
So, how we think about initializing
our k-means algorithm is really,

7
00:00:22,960 --> 00:00:27,306
really, really critical to
the quality of the solution found.

8
00:00:27,306 --> 00:00:31,440
And so one way to think about a smart
initialization is as follows.

9
00:00:31,440 --> 00:00:36,600
The first step is to select the first
cluster center, just randomly

10
00:00:36,600 --> 00:00:40,300
amongst all our data points, to find it
to be one of those data points at random.

11
00:00:41,400 --> 00:00:46,271
Then the second step is for
each one of our observations,

12
00:00:46,271 --> 00:00:50,851
compute the distance to
the nearest cluster center.

13
00:00:50,851 --> 00:00:54,615
And so, for the first pass through
there's only one cluster center, so

14
00:00:54,615 --> 00:00:57,160
it's just the distance
to that cluster center.

15
00:00:57,160 --> 00:01:01,833
But if there are multiple cluster centers
as we go through this initialization

16
00:01:01,833 --> 00:01:06,742
scheme, then we're going to maintain the
distance to the closest cluster center.

17
00:01:06,742 --> 00:01:12,201
And then when we go to sample,
our next cluster center,

18
00:01:12,201 --> 00:01:18,730
we're going to choose it proportional
to those distances squared.

19
00:01:18,730 --> 00:01:23,130
So we're going to bias towards selecting
an observation that's far from

20
00:01:23,130 --> 00:01:24,810
existing cluster centers.

21
00:01:25,910 --> 00:01:29,430
And then we're going to repeat
the steps until we've sampled

22
00:01:29,430 --> 00:01:31,250
k different cluster centers.

23
00:01:32,490 --> 00:01:35,130
So let's visualize what this means.

24
00:01:35,130 --> 00:01:38,970
So again, here are the data
points that we wish to cluster,

25
00:01:38,970 --> 00:01:41,270
the same example that we've been using.

26
00:01:41,270 --> 00:01:45,040
And the first step of our k-mean is ++
algorithm is just to select one of these

27
00:01:45,040 --> 00:01:48,620
data points at random and
call this the first cluster center.

28
00:01:48,620 --> 00:01:55,181
Then the next step of the algorithm is to
compute distances to this cluster center.

29
00:02:07,235 --> 00:02:11,832
And then based on the square
of these distances we're

30
00:02:11,832 --> 00:02:14,905
going to sample a new cluster center.

31
00:02:14,905 --> 00:02:19,178
So, remember that we're
going to be more likely

32
00:02:22,281 --> 00:02:26,152
To select a data point

33
00:02:29,934 --> 00:02:33,763
As a cluster center

34
00:02:36,860 --> 00:02:41,756
If that data point is far away.

35
00:02:44,799 --> 00:02:49,021
And the fact that in the algorithm
we're looking at the distant squares,

36
00:02:49,021 --> 00:02:51,077
makes that even more exaggerated.

37
00:02:56,818 --> 00:03:01,728
So we're very likely to sample
one of the data points that's

38
00:03:01,728 --> 00:03:06,360
far from this cluster center
as the next cluster center.

39
00:03:06,360 --> 00:03:11,149
And so, maybe we select this
observation as the next cluster center.

40
00:03:11,149 --> 00:03:14,670
So, that's going to be some green cluster.

41
00:03:14,670 --> 00:03:17,440
Now we repeat this
process where we compute

42
00:03:17,440 --> 00:03:20,850
distance to the nearest cluster center.

43
00:03:20,850 --> 00:03:27,422
So, maybe these are going to be
the distances that we compute.

44
00:03:27,422 --> 00:03:30,632
It gets a little bit hard to tell,
I think.

45
00:03:30,632 --> 00:03:36,420
This is a smaller distance than this,
perhaps not, but you get the point.

46
00:03:38,120 --> 00:03:41,390
And then based of of this we're going to,
again,

47
00:03:41,390 --> 00:03:45,720
compute the square of each of these
distances and then sample a point.

48
00:03:45,720 --> 00:03:51,220
And this one here is really far away
from the nearest cluster center,

49
00:03:51,220 --> 00:03:53,820
so it's very likely to get selected.

50
00:03:53,820 --> 00:03:57,140
And, indeed, in this example,
we select that.

51
00:03:57,140 --> 00:04:02,180
So our initialization of
three cluster centers for

52
00:04:02,180 --> 00:04:07,813
our k-means algorithm are going to
be at these locations.

53
00:04:07,813 --> 00:04:12,488
So running k-means++ to initialize our
k-means algorithm is definitely more

54
00:04:12,488 --> 00:04:17,109
computationally costly than just randomly
selecting a set of cluster centers.

55
00:04:17,109 --> 00:04:22,001
But the subsequent k-means algorithm
that we're going to run often

56
00:04:22,001 --> 00:04:23,928
converges more rapidly.

57
00:04:23,928 --> 00:04:31,350
And so, k-means++ tends to improve
the quality of the local optimum of found.

58
00:04:31,350 --> 00:04:36,745
And in addition, reduce the runtime
because it converges more rapid

59
00:04:36,745 --> 00:04:39,105
to that high quality solution.

60
00:04:39,105 --> 00:04:44,807
So in practice, running k-means++ tends to
be a really, really helpful thing to do.

61
00:04:44,807 --> 00:04:48,949
[MUSIC]