1
00:00:00,000 --> 00:00:04,351
[MUSIC]

2
00:00:04,351 --> 00:00:08,390
Let's now turn to applying
MapReduce to k-means.

3
00:00:08,390 --> 00:00:11,250
And first,
let's recall the first two of k-means.

4
00:00:11,250 --> 00:00:14,940
There's one step, which we'll now
refer to as the classify step,

5
00:00:14,940 --> 00:00:18,860
where we assign each
data point to a cluster.

6
00:00:20,450 --> 00:00:25,670
Then we have what we'll now call
the recenter phase, where we take

7
00:00:27,240 --> 00:00:30,550
all the data points that were
assigned to a given cluster, and

8
00:00:30,550 --> 00:00:34,190
we use that to update our cluster centers.

9
00:00:34,190 --> 00:00:38,670
Well, let's think about these steps
within the context of MapReduce.

10
00:00:38,670 --> 00:00:40,450
And in particular what we see is that for

11
00:00:40,450 --> 00:00:45,270
the first step, this is something
that's a data parallel operation.

12
00:00:45,270 --> 00:00:50,610
That is, for every data point,
once you give me the cluster centers,

13
00:00:50,610 --> 00:00:56,410
I can independently perform the assignment
of that data point to a cluster center.

14
00:00:56,410 --> 00:01:00,330
And it does not depend at all
on the other data points.

15
00:01:00,330 --> 00:01:07,280
And so in particular, the mapper is
going to emit these cluster label,

16
00:01:07,280 --> 00:01:13,950
data pairs where here this cluster
label is going to serve as the key and

17
00:01:13,950 --> 00:01:18,090
the data value is going to serve as
the value in this key value pair.

18
00:01:20,170 --> 00:01:22,640
Then when we turn to the recenter step,

19
00:01:22,640 --> 00:01:25,680
we see that this is a step where
we're performing an aggregation.

20
00:01:25,680 --> 00:01:31,190
And this aggregation is independent
across different cluster labels,

21
00:01:31,190 --> 00:01:33,170
so across different keys.

22
00:01:34,460 --> 00:01:39,290
So in particular, for every cluster,
I simply look at the data

23
00:01:39,290 --> 00:01:44,400
assigned to that cluster and
I sum all the values and normalize.

24
00:01:44,400 --> 00:01:47,330
And so that can be implemented
using a reduced step.

25
00:01:48,630 --> 00:01:51,550
So let's dig into each one of these
steps in a little bit more detail.

26
00:01:52,760 --> 00:01:56,190
So in particular let's start
with the classified step and

27
00:01:56,190 --> 00:01:58,450
look at the mapper for this step.

28
00:01:58,450 --> 00:02:02,730
And what the mapper takes
in are the set of cluster

29
00:02:02,730 --> 00:02:07,370
centers, and a single data point.

30
00:02:09,730 --> 00:02:14,550
And then the mapper computes
the distance between that data point and

31
00:02:14,550 --> 00:02:17,550
each one of these cluster centers and

32
00:02:17,550 --> 00:02:22,610
returns the index of the cluster center
that's closest to that data point.

33
00:02:22,610 --> 00:02:29,996
So in particular,
amidst the cluster label and

34
00:02:29,996 --> 00:02:34,610
data point pair, where again

35
00:02:34,610 --> 00:02:40,519
the cluster label serves as the key and

36
00:02:40,519 --> 00:02:46,670
the data point serves as the value.

37
00:02:46,670 --> 00:02:53,115
So for example, maybe we would emit (2,

38
00:02:53,115 --> 00:02:59,571
[17, 0, 1, 7, 0, 0, 5]).

39
00:02:59,571 --> 00:03:04,649
If this data point, which might have been

40
00:03:04,649 --> 00:03:09,727
a word count vector with count 17, 0,

41
00:03:09,727 --> 00:03:14,948
1, 7, 0, 0, 5 on some set of words in

42
00:03:14,948 --> 00:03:20,860
a vocabulary, is assigned to cluster 2.

43
00:03:20,860 --> 00:03:23,250
So this is if Zi = 2.

44
00:03:27,080 --> 00:03:31,000
Okay, so hopefully from this
it's easy to see how we can do

45
00:03:31,000 --> 00:03:35,060
this classified step using a map face.

46
00:03:35,060 --> 00:03:39,621
Then for the recenter step, we see
that we're just aggregating all data

47
00:03:39,621 --> 00:03:43,360
points with any given cluster
divided by the total number of

48
00:03:43,360 --> 00:03:47,358
data points in computing this as
the new mean for that cluster.

49
00:03:47,358 --> 00:03:51,760
So our reducer is going to take in a given

50
00:03:51,760 --> 00:03:57,021
cluster label, which remember is our key.

51
00:03:57,021 --> 00:04:03,890
And then we're going to take in a list
of values associated with that key.

52
00:04:03,890 --> 00:04:09,615
So here we have, these are all data

53
00:04:09,615 --> 00:04:17,978
points assigned to cluster j,

54
00:04:17,978 --> 00:04:22,858
so they have Key j.

55
00:04:22,858 --> 00:04:26,692
And what this

56
00:04:26,692 --> 00:04:32,070
reducer does is it starts with

57
00:04:32,070 --> 00:04:37,295
the total mass and the cluster being 0.

58
00:04:37,295 --> 00:04:41,170
And this is the total

59
00:04:42,270 --> 00:04:46,580
number of observations in the cluster.

60
00:04:49,460 --> 00:04:52,230
And we also initialize that to be 0.

61
00:04:52,230 --> 00:04:55,910
And then for every x value in this list,

62
00:04:55,910 --> 00:05:01,610
we're simply going to increase
the total mass by an amount X,

63
00:05:01,610 --> 00:05:06,100
and then increase the total count number
of observations in the cluster by 1.

64
00:05:06,100 --> 00:05:10,578
And then we're going to
return a key value pair where

65
00:05:10,578 --> 00:05:15,810
the key is the cluster label and

66
00:05:15,810 --> 00:05:19,910
the value is the new mean.

67
00:05:19,910 --> 00:05:26,916
Which is total mass divided

68
00:05:26,916 --> 00:05:31,420
by total number of
observations in the cluster.

69
00:05:32,760 --> 00:05:40,189
So again, we see how our recenter step
fits very nicely within the reduced phase.

70
00:05:42,120 --> 00:05:43,890
So that's MapReduce for k-means.

71
00:05:43,890 --> 00:05:46,050
But there are a couple of
things worth highlighting.

72
00:05:46,050 --> 00:05:49,180
One is the fact that remember that
k-means is an iterative algorithm.

73
00:05:49,180 --> 00:05:53,160
So you run this classify, recenter,
classify, recenter again and again and

74
00:05:53,160 --> 00:05:54,630
again until convergence.

75
00:05:54,630 --> 00:05:57,980
So this actually requires an iterative
version of MapReduce which is

76
00:05:57,980 --> 00:06:00,380
a non-standard formulation.

77
00:06:00,380 --> 00:06:01,850
But it's possible.

78
00:06:01,850 --> 00:06:06,600
And the other thing to remember is
that the mapper gets the entire set

79
00:06:06,600 --> 00:06:07,920
of cluster means.

80
00:06:08,950 --> 00:06:10,810
And there's lots of data to plow through.

81
00:06:10,810 --> 00:06:15,250
And remember, every map recall in
the very general formulation or

82
00:06:15,250 --> 00:06:18,380
the naive implementation just gets
a single data point and there's lots and

83
00:06:18,380 --> 00:06:19,430
lots of data points.

84
00:06:19,430 --> 00:06:23,210
So this is a lot of data
to be passing around.

85
00:06:23,210 --> 00:06:28,350
So, a more efficient
implementation is per mapper call,

86
00:06:28,350 --> 00:06:31,502
give a whole set of data
points to classify.

87
00:06:31,502 --> 00:06:34,210
So a whole set of points to plow through.

88
00:06:34,210 --> 00:06:41,220
So that would be a much more efficient
implementation of MapReduce for k-means.

89
00:06:41,220 --> 00:06:46,180
But to sum up, the steps that are involved
in k-means fit really nicely within this

90
00:06:46,180 --> 00:06:51,650
MapReduce framework, where the map step
corresponds to our classification step,

91
00:06:51,650 --> 00:06:53,960
which is data parallel
over our data points,

92
00:06:53,960 --> 00:06:56,560
just assigning data
points to cluster labels.

93
00:06:56,560 --> 00:07:02,290
And the reduced step corresponds
to the recenter step in k-means,

94
00:07:02,290 --> 00:07:07,120
which is data parallel over cluster labels
which represent they keys in this example.

95
00:07:08,130 --> 00:07:10,422
Where we're just going to,
for every cluster,

96
00:07:10,422 --> 00:07:12,715
aggregate all the data
within that cluster and

97
00:07:12,715 --> 00:07:17,036
divide by the total number of observations
which is equivalent to computing the mean.

98
00:07:17,036 --> 00:07:21,739
[MUSIC]