1 00:00:00,000 --> 00:00:04,351 [MUSIC] 2 00:00:04,351 --> 00:00:08,390 Let's now turn to applying MapReduce to k-means. 3 00:00:08,390 --> 00:00:11,250 And first, let's recall the first two of k-means. 4 00:00:11,250 --> 00:00:14,940 There's one step, which we'll now refer to as the classify step, 5 00:00:14,940 --> 00:00:18,860 where we assign each data point to a cluster. 6 00:00:20,450 --> 00:00:25,670 Then we have what we'll now call the recenter phase, where we take 7 00:00:27,240 --> 00:00:30,550 all the data points that were assigned to a given cluster, and 8 00:00:30,550 --> 00:00:34,190 we use that to update our cluster centers. 9 00:00:34,190 --> 00:00:38,670 Well, let's think about these steps within the context of MapReduce. 10 00:00:38,670 --> 00:00:40,450 And in particular what we see is that for 11 00:00:40,450 --> 00:00:45,270 the first step, this is something that's a data parallel operation. 12 00:00:45,270 --> 00:00:50,610 That is, for every data point, once you give me the cluster centers, 13 00:00:50,610 --> 00:00:56,410 I can independently perform the assignment of that data point to a cluster center. 14 00:00:56,410 --> 00:01:00,330 And it does not depend at all on the other data points. 15 00:01:00,330 --> 00:01:07,280 And so in particular, the mapper is going to emit these cluster label, 16 00:01:07,280 --> 00:01:13,950 data pairs where here this cluster label is going to serve as the key and 17 00:01:13,950 --> 00:01:18,090 the data value is going to serve as the value in this key value pair. 18 00:01:20,170 --> 00:01:22,640 Then when we turn to the recenter step, 19 00:01:22,640 --> 00:01:25,680 we see that this is a step where we're performing an aggregation. 20 00:01:25,680 --> 00:01:31,190 And this aggregation is independent across different cluster labels, 21 00:01:31,190 --> 00:01:33,170 so across different keys. 22 00:01:34,460 --> 00:01:39,290 So in particular, for every cluster, I simply look at the data 23 00:01:39,290 --> 00:01:44,400 assigned to that cluster and I sum all the values and normalize. 24 00:01:44,400 --> 00:01:47,330 And so that can be implemented using a reduced step. 25 00:01:48,630 --> 00:01:51,550 So let's dig into each one of these steps in a little bit more detail. 26 00:01:52,760 --> 00:01:56,190 So in particular let's start with the classified step and 27 00:01:56,190 --> 00:01:58,450 look at the mapper for this step. 28 00:01:58,450 --> 00:02:02,730 And what the mapper takes in are the set of cluster 29 00:02:02,730 --> 00:02:07,370 centers, and a single data point. 30 00:02:09,730 --> 00:02:14,550 And then the mapper computes the distance between that data point and 31 00:02:14,550 --> 00:02:17,550 each one of these cluster centers and 32 00:02:17,550 --> 00:02:22,610 returns the index of the cluster center that's closest to that data point. 33 00:02:22,610 --> 00:02:29,996 So in particular, amidst the cluster label and 34 00:02:29,996 --> 00:02:34,610 data point pair, where again 35 00:02:34,610 --> 00:02:40,519 the cluster label serves as the key and 36 00:02:40,519 --> 00:02:46,670 the data point serves as the value. 37 00:02:46,670 --> 00:02:53,115 So for example, maybe we would emit (2, 38 00:02:53,115 --> 00:02:59,571 [17, 0, 1, 7, 0, 0, 5]). 39 00:02:59,571 --> 00:03:04,649 If this data point, which might have been 40 00:03:04,649 --> 00:03:09,727 a word count vector with count 17, 0, 41 00:03:09,727 --> 00:03:14,948 1, 7, 0, 0, 5 on some set of words in 42 00:03:14,948 --> 00:03:20,860 a vocabulary, is assigned to cluster 2. 43 00:03:20,860 --> 00:03:23,250 So this is if Zi = 2. 44 00:03:27,080 --> 00:03:31,000 Okay, so hopefully from this it's easy to see how we can do 45 00:03:31,000 --> 00:03:35,060 this classified step using a map face. 46 00:03:35,060 --> 00:03:39,621 Then for the recenter step, we see that we're just aggregating all data 47 00:03:39,621 --> 00:03:43,360 points with any given cluster divided by the total number of 48 00:03:43,360 --> 00:03:47,358 data points in computing this as the new mean for that cluster. 49 00:03:47,358 --> 00:03:51,760 So our reducer is going to take in a given 50 00:03:51,760 --> 00:03:57,021 cluster label, which remember is our key. 51 00:03:57,021 --> 00:04:03,890 And then we're going to take in a list of values associated with that key. 52 00:04:03,890 --> 00:04:09,615 So here we have, these are all data 53 00:04:09,615 --> 00:04:17,978 points assigned to cluster j, 54 00:04:17,978 --> 00:04:22,858 so they have Key j. 55 00:04:22,858 --> 00:04:26,692 And what this 56 00:04:26,692 --> 00:04:32,070 reducer does is it starts with 57 00:04:32,070 --> 00:04:37,295 the total mass and the cluster being 0. 58 00:04:37,295 --> 00:04:41,170 And this is the total 59 00:04:42,270 --> 00:04:46,580 number of observations in the cluster. 60 00:04:49,460 --> 00:04:52,230 And we also initialize that to be 0. 61 00:04:52,230 --> 00:04:55,910 And then for every x value in this list, 62 00:04:55,910 --> 00:05:01,610 we're simply going to increase the total mass by an amount X, 63 00:05:01,610 --> 00:05:06,100 and then increase the total count number of observations in the cluster by 1. 64 00:05:06,100 --> 00:05:10,578 And then we're going to return a key value pair where 65 00:05:10,578 --> 00:05:15,810 the key is the cluster label and 66 00:05:15,810 --> 00:05:19,910 the value is the new mean. 67 00:05:19,910 --> 00:05:26,916 Which is total mass divided 68 00:05:26,916 --> 00:05:31,420 by total number of observations in the cluster. 69 00:05:32,760 --> 00:05:40,189 So again, we see how our recenter step fits very nicely within the reduced phase. 70 00:05:42,120 --> 00:05:43,890 So that's MapReduce for k-means. 71 00:05:43,890 --> 00:05:46,050 But there are a couple of things worth highlighting. 72 00:05:46,050 --> 00:05:49,180 One is the fact that remember that k-means is an iterative algorithm. 73 00:05:49,180 --> 00:05:53,160 So you run this classify, recenter, classify, recenter again and again and 74 00:05:53,160 --> 00:05:54,630 again until convergence. 75 00:05:54,630 --> 00:05:57,980 So this actually requires an iterative version of MapReduce which is 76 00:05:57,980 --> 00:06:00,380 a non-standard formulation. 77 00:06:00,380 --> 00:06:01,850 But it's possible. 78 00:06:01,850 --> 00:06:06,600 And the other thing to remember is that the mapper gets the entire set 79 00:06:06,600 --> 00:06:07,920 of cluster means. 80 00:06:08,950 --> 00:06:10,810 And there's lots of data to plow through. 81 00:06:10,810 --> 00:06:15,250 And remember, every map recall in the very general formulation or 82 00:06:15,250 --> 00:06:18,380 the naive implementation just gets a single data point and there's lots and 83 00:06:18,380 --> 00:06:19,430 lots of data points. 84 00:06:19,430 --> 00:06:23,210 So this is a lot of data to be passing around. 85 00:06:23,210 --> 00:06:28,350 So, a more efficient implementation is per mapper call, 86 00:06:28,350 --> 00:06:31,502 give a whole set of data points to classify. 87 00:06:31,502 --> 00:06:34,210 So a whole set of points to plow through. 88 00:06:34,210 --> 00:06:41,220 So that would be a much more efficient implementation of MapReduce for k-means. 89 00:06:41,220 --> 00:06:46,180 But to sum up, the steps that are involved in k-means fit really nicely within this 90 00:06:46,180 --> 00:06:51,650 MapReduce framework, where the map step corresponds to our classification step, 91 00:06:51,650 --> 00:06:53,960 which is data parallel over our data points, 92 00:06:53,960 --> 00:06:56,560 just assigning data points to cluster labels. 93 00:06:56,560 --> 00:07:02,290 And the reduced step corresponds to the recenter step in k-means, 94 00:07:02,290 --> 00:07:07,120 which is data parallel over cluster labels which represent they keys in this example. 95 00:07:08,130 --> 00:07:10,422 Where we're just going to, for every cluster, 96 00:07:10,422 --> 00:07:12,715 aggregate all the data within that cluster and 97 00:07:12,715 --> 00:07:17,036 divide by the total number of observations which is equivalent to computing the mean. 98 00:07:17,036 --> 00:07:21,739 [MUSIC]