1 00:00:00,320 --> 00:00:01,510 In the last few videos, we talked 2 00:00:01,810 --> 00:00:03,430 about stochastic gradient descent, and, 3 00:00:03,620 --> 00:00:05,020 you know, other variations of the 4 00:00:05,120 --> 00:00:06,530 stochastic gradient descent algorithm, 5 00:00:06,910 --> 00:00:09,150 including those adaptations to online 6 00:00:09,490 --> 00:00:10,420 learning, but all of those 7 00:00:10,610 --> 00:00:11,810 algorithms could be run on 8 00:00:12,110 --> 00:00:13,740 one machine, or could be run on one computer. 9 00:00:14,800 --> 00:00:15,870 And some machine learning problems 10 00:00:16,310 --> 00:00:17,270 are just too big to run 11 00:00:17,520 --> 00:00:19,160 on one machine, sometimes maybe 12 00:00:19,300 --> 00:00:21,050 you just so much data you 13 00:00:21,170 --> 00:00:22,350 just don't ever want to run 14 00:00:22,670 --> 00:00:23,980 all that data through a 15 00:00:24,100 --> 00:00:26,270 single computer, no matter what algorithm you would use on that computer. 16 00:00:28,470 --> 00:00:29,640 So in this video I'd 17 00:00:29,740 --> 00:00:31,240 like to talk about different approach 18 00:00:31,770 --> 00:00:33,610 to large scale machine learning, called 19 00:00:34,010 --> 00:00:36,190 the map reduce approach. 20 00:00:37,030 --> 00:00:38,080 And even though we have 21 00:00:38,380 --> 00:00:39,400 quite a few videos on stochastic 22 00:00:39,970 --> 00:00:41,230 gradient descent and we're going 23 00:00:41,550 --> 00:00:43,100 to spend relative less time 24 00:00:43,460 --> 00:00:45,350 on map reduce--don't judge the 25 00:00:45,560 --> 00:00:46,750 relative importance of map reduce 26 00:00:47,160 --> 00:00:48,240 versus the gradient descent 27 00:00:48,690 --> 00:00:49,590 based on the amount amount of 28 00:00:49,660 --> 00:00:51,480 time I spend on these ideas in particular. 29 00:00:52,230 --> 00:00:53,380 Many people will say that 30 00:00:53,790 --> 00:00:54,840 map reduce is at least 31 00:00:55,090 --> 00:00:56,330 an equally important, and some 32 00:00:56,580 --> 00:00:57,850 would say an even more important idea 33 00:00:58,500 --> 00:01:00,620 compared to gradient descent, only 34 00:01:01,460 --> 00:01:03,040 it's relatively simpler to 35 00:01:03,160 --> 00:01:04,620 explain, which is why I'm 36 00:01:04,720 --> 00:01:05,580 going to spend less time on 37 00:01:05,830 --> 00:01:07,040 it, but using these ideas 38 00:01:07,670 --> 00:01:08,400 you might be able to scale 39 00:01:09,070 --> 00:01:10,640 learning algorithms to even 40 00:01:10,880 --> 00:01:12,520 far larger problems than is 41 00:01:12,630 --> 00:01:14,530 possible using stochastic gradient descent. 42 00:01:18,720 --> 00:01:19,000 Here's the idea. 43 00:01:19,810 --> 00:01:21,020 Let's say we want to fit 44 00:01:21,490 --> 00:01:22,960 a linear regression model or 45 00:01:23,140 --> 00:01:24,440 a logistic regression model or some 46 00:01:24,540 --> 00:01:26,100 such, and let's start again 47 00:01:26,430 --> 00:01:27,660 with batch gradient descent, so 48 00:01:27,840 --> 00:01:30,300 that's our batch gradient descent learning rule. 49 00:01:31,240 --> 00:01:32,430 And to keep the writing 50 00:01:32,850 --> 00:01:34,170 on this slide tractable, I'm going 51 00:01:34,340 --> 00:01:36,990 to assume throughout that we have m equals 400 examples. 52 00:01:37,530 --> 00:01:39,560 Of course, by our 53 00:01:39,750 --> 00:01:40,850 standards, in terms of large scale 54 00:01:41,090 --> 00:01:42,050 machine learning, you know m 55 00:01:42,170 --> 00:01:43,210 might be pretty small and so, 56 00:01:43,770 --> 00:01:45,390 this might be more commonly 57 00:01:45,870 --> 00:01:46,920 applied to problems, where you 58 00:01:47,050 --> 00:01:48,190 have maybe closer to 400 59 00:01:48,740 --> 00:01:49,940 million examples, or some 60 00:01:50,080 --> 00:01:51,310 such, but just to 61 00:01:51,390 --> 00:01:52,330 make the writing on the slide 62 00:01:52,770 --> 00:01:55,000 simpler, I'm going to pretend we have 400 examples. 63 00:01:55,690 --> 00:01:57,460 So in that case, the 64 00:01:57,790 --> 00:01:59,080 batch gradient descent learning rule 65 00:01:59,570 --> 00:02:00,930 has this 400 and the 66 00:02:01,500 --> 00:02:02,930 sum from i equals 1 through 67 00:02:03,330 --> 00:02:05,050 400 through my 400 examples 68 00:02:05,590 --> 00:02:06,890 here, and if m 69 00:02:07,050 --> 00:02:09,780 is large, then this is a computationally expensive step. 70 00:02:10,890 --> 00:02:12,830 So, what the MapReduce idea 71 00:02:13,250 --> 00:02:14,470 does is the following, and 72 00:02:14,890 --> 00:02:15,740 I should say the map 73 00:02:15,950 --> 00:02:16,940 reduce idea is due to 74 00:02:17,680 --> 00:02:20,190 two researchers, Jeff Dean 75 00:02:20,700 --> 00:02:22,060 and Sanjay Gimawat. 76 00:02:22,640 --> 00:02:23,490 Jeff Dean, by the way, is 77 00:02:24,190 --> 00:02:26,520 one of the most legendary engineers in 78 00:02:26,660 --> 00:02:28,300 all of Silicon Valley and he 79 00:02:28,420 --> 00:02:29,530 kind of built a large 80 00:02:29,820 --> 00:02:31,670 fraction of the architectural 81 00:02:32,310 --> 00:02:34,770 infrastructure that all of Google runs on today. 82 00:02:36,000 --> 00:02:37,320 But here's the map reduce idea. 83 00:02:37,850 --> 00:02:38,570 So, let's say I have 84 00:02:38,700 --> 00:02:39,840 some training set, if we 85 00:02:39,900 --> 00:02:41,220 want to denote by this box here 86 00:02:41,610 --> 00:02:42,760 of X Y pairs, 87 00:02:44,250 --> 00:02:47,730 where it's X1, Y1, down 88 00:02:47,990 --> 00:02:49,640 to my 400 examples, 89 00:02:50,520 --> 00:02:51,660 Xm, Ym. 90 00:02:52,190 --> 00:02:53,780 So, that's my training set with 400 training examples. 91 00:02:55,060 --> 00:02:56,550 In the MapReduce idea, one way 92 00:02:56,690 --> 00:02:58,190 to do, is split this training 93 00:02:58,570 --> 00:03:00,510 set in to different subsets. 94 00:03:01,890 --> 00:03:02,590 I'm going to. 95 00:03:02,950 --> 00:03:04,150 assume for this example that 96 00:03:04,290 --> 00:03:05,530 I have 4 computers, 97 00:03:06,160 --> 00:03:07,160 or 4 machines to run in 98 00:03:07,300 --> 00:03:08,670 parallel on my training set, 99 00:03:08,890 --> 00:03:10,570 which is why I'm splitting this into 4 machines. 100 00:03:10,920 --> 00:03:12,290 If you have 10 machines or 101 00:03:12,400 --> 00:03:13,810 100 machines, then you would 102 00:03:13,970 --> 00:03:15,890 split your training set into 10 pieces or 100 pieces or what have you. 103 00:03:18,040 --> 00:03:19,710 And what the first of my 104 00:03:19,850 --> 00:03:20,840 4 machines is to do, 105 00:03:21,100 --> 00:03:23,170 say, is use just the 106 00:03:23,270 --> 00:03:25,170 first one quarter of my 107 00:03:25,300 --> 00:03:28,680 training set--so use just the first 100 training examples. 108 00:03:30,020 --> 00:03:31,440 And in particular, what it's 109 00:03:31,480 --> 00:03:32,520 going to do is look at 110 00:03:32,630 --> 00:03:34,800 this summation, and compute 111 00:03:35,490 --> 00:03:38,560 that summation for just the first 100 training examples. 112 00:03:40,030 --> 00:03:40,960 So let me write that up 113 00:03:41,110 --> 00:03:42,530 I'm going to compute a variable 114 00:03:43,560 --> 00:03:46,230 temp 1 to superscript 1 115 00:03:46,320 --> 00:03:49,410 the first machine J equals 116 00:03:50,450 --> 00:03:52,150 sum from equals 1 through 117 00:03:52,260 --> 00:03:53,160 100, and then I'm going to plug 118 00:03:53,500 --> 00:03:56,610 in exactly that term there--so I have 119 00:03:57,260 --> 00:04:00,140 X-theta, Xi, minus Yi 120 00:04:01,800 --> 00:04:03,230 times Xij, right? 121 00:04:03,740 --> 00:04:05,680 So that's just that 122 00:04:05,910 --> 00:04:07,460 gradient descent term up there. 123 00:04:08,300 --> 00:04:09,780 And then similarly, I'm going 124 00:04:10,010 --> 00:04:11,330 to take the second quarter 125 00:04:11,600 --> 00:04:13,130 of my data and send it 126 00:04:13,320 --> 00:04:14,520 to my second machine, and 127 00:04:14,690 --> 00:04:15,680 my second machine will use 128 00:04:15,900 --> 00:04:18,750 training examples 101 through 200 129 00:04:19,350 --> 00:04:21,170 and you will compute similar variables 130 00:04:21,720 --> 00:04:22,880 of a temp to j which 131 00:04:23,110 --> 00:04:24,450 is the same sum for index 132 00:04:24,890 --> 00:04:26,620 from examples 101 through 200. 133 00:04:26,840 --> 00:04:29,680 And similarly machines 3 134 00:04:29,830 --> 00:04:32,720 and 4 will use the 135 00:04:32,830 --> 00:04:34,110 third quarter and the fourth 136 00:04:34,570 --> 00:04:36,550 quarter of my training set. 137 00:04:37,530 --> 00:04:38,950 So now each machine has 138 00:04:39,190 --> 00:04:40,580 to sum over 100 instead 139 00:04:41,060 --> 00:04:42,570 of over 400 examples and so 140 00:04:42,760 --> 00:04:43,750 has to do only a quarter 141 00:04:44,050 --> 00:04:45,220 of the work and thus presumably 142 00:04:45,900 --> 00:04:48,000 it could do it about four times as fast. 143 00:04:49,380 --> 00:04:50,630 Finally, after all these machines 144 00:04:50,990 --> 00:04:51,740 have done this work, I am 145 00:04:51,850 --> 00:04:53,560 going to take these temp variables 146 00:04:55,350 --> 00:04:56,480 and put them back together. 147 00:04:56,870 --> 00:04:58,400 So I take these variables and 148 00:04:58,530 --> 00:04:59,950 send them all to a You 149 00:05:00,090 --> 00:05:03,080 know centralized master server and 150 00:05:03,300 --> 00:05:04,750 what the master will do 151 00:05:05,140 --> 00:05:06,720 is combine these results together. 152 00:05:07,360 --> 00:05:08,470 and in particular, it will 153 00:05:08,780 --> 00:05:10,780 update my parameters theta 154 00:05:11,000 --> 00:05:13,160 j according to theta 155 00:05:13,410 --> 00:05:14,720 j gets updated as theta j 156 00:05:15,730 --> 00:05:17,560 minus Of the 157 00:05:17,680 --> 00:05:19,510 learning rate alpha times one 158 00:05:20,120 --> 00:05:22,940 over 400 times temp, 159 00:05:23,300 --> 00:05:27,410 1, J, plus temp 160 00:05:27,760 --> 00:05:30,290 2j plus temp 3j 161 00:05:32,400 --> 00:05:35,470 plus temp 4j and 162 00:05:35,560 --> 00:05:37,890 of course we have to do this separately for J equals 0. 163 00:05:37,980 --> 00:05:39,570 You know, up to 164 00:05:39,820 --> 00:05:41,220 and within this number of features. 165 00:05:42,550 --> 00:05:45,420 So operating this equation into I hope it's clear. 166 00:05:45,670 --> 00:05:47,870 So what this equation 167 00:05:50,930 --> 00:05:53,220 is doing is exactly the 168 00:05:53,290 --> 00:05:54,570 same is that when you 169 00:05:54,660 --> 00:05:56,140 have a centralized master server 170 00:05:56,680 --> 00:05:57,950 that takes the results, the ten 171 00:05:58,040 --> 00:05:58,780 one j the ten two j 172 00:05:59,000 --> 00:05:59,850 ten three j and ten four 173 00:05:59,970 --> 00:06:01,760 j and adds them up 174 00:06:02,030 --> 00:06:03,430 and so of course the sum 175 00:06:04,090 --> 00:06:04,960 of these four things. 176 00:06:06,360 --> 00:06:07,810 Right, that's just the sum of 177 00:06:08,060 --> 00:06:09,440 this, plus the sum 178 00:06:09,760 --> 00:06:11,490 of this, plus the sum 179 00:06:11,630 --> 00:06:13,000 of this, plus the sum 180 00:06:13,120 --> 00:06:14,290 of that, and those four 181 00:06:14,470 --> 00:06:15,830 things just add up to 182 00:06:15,920 --> 00:06:17,740 be equal to this sum that 183 00:06:17,880 --> 00:06:19,580 we're originally computing a batch stream descent. 184 00:06:20,590 --> 00:06:21,550 And then we have the alpha times 185 00:06:21,860 --> 00:06:22,910 1 of 400, alpha times 1 186 00:06:23,350 --> 00:06:24,690 of 100, and this is 187 00:06:25,020 --> 00:06:27,020 exactly equivalent to the 188 00:06:27,140 --> 00:06:29,390 batch gradient descent algorithm, only, 189 00:06:29,910 --> 00:06:30,880 instead of needing to sum 190 00:06:31,290 --> 00:06:32,540 over all four hundred training 191 00:06:32,810 --> 00:06:33,900 examples on just one 192 00:06:34,040 --> 00:06:35,280 machine, we can instead 193 00:06:35,760 --> 00:06:37,460 divide up the work load on four machines. 194 00:06:39,090 --> 00:06:40,190 So, here's what the general 195 00:06:40,630 --> 00:06:43,410 picture of the MapReduce technique looks like. 196 00:06:45,060 --> 00:06:46,510 We have some training sets, and 197 00:06:46,670 --> 00:06:48,200 if we want to paralyze across four 198 00:06:48,420 --> 00:06:49,100 machines, we are going to 199 00:06:49,170 --> 00:06:51,670 take the training set and split it, you know, equally. 200 00:06:52,120 --> 00:06:54,640 Split it as evenly as we can into four subsets. 201 00:06:56,470 --> 00:06:57,110 Then we are going to take the 202 00:06:57,180 --> 00:06:59,560 4 subsets of the training data and send them to 4 different computers. 203 00:07:00,530 --> 00:07:01,660 And each of the 4 computers 204 00:07:02,130 --> 00:07:03,570 can compute a summation over 205 00:07:03,950 --> 00:07:04,850 just one quarter of the 206 00:07:04,910 --> 00:07:06,230 training set, and then 207 00:07:06,340 --> 00:07:07,720 finally take each of 208 00:07:07,780 --> 00:07:09,310 the computers takes the results, sends 209 00:07:09,580 --> 00:07:12,720 them to a centralized server, which then combines the results together. 210 00:07:13,570 --> 00:07:14,900 So, on the previous line 211 00:07:15,190 --> 00:07:16,540 in that example, the bulk 212 00:07:16,800 --> 00:07:17,910 of the work in gradient descent, 213 00:07:18,330 --> 00:07:20,140 was computing the sum from 214 00:07:20,430 --> 00:07:22,270 i equals 1 to 400 of something. 215 00:07:22,670 --> 00:07:24,110 So more generally, sum from 216 00:07:24,370 --> 00:07:25,570 i equals 1 to m 217 00:07:26,320 --> 00:07:28,180 of that formula for gradient descent. 218 00:07:29,160 --> 00:07:30,430 And now, because each of 219 00:07:30,550 --> 00:07:31,890 the four computers can do just 220 00:07:32,190 --> 00:07:33,800 a quarter of the work, potentially 221 00:07:34,350 --> 00:07:35,940 you can get up to a 4x speed up. 222 00:07:38,820 --> 00:07:39,980 In particular, if there were 223 00:07:40,190 --> 00:07:41,900 no network latencies and 224 00:07:42,100 --> 00:07:42,970 no costs of the network 225 00:07:43,400 --> 00:07:44,450 communications to send the 226 00:07:44,600 --> 00:07:45,450 data back and forth, you can 227 00:07:45,610 --> 00:07:47,820 potentially get up to a 4x speed up. 228 00:07:48,050 --> 00:07:49,410 Of course, in practice, 229 00:07:50,100 --> 00:07:52,080 because of network latencies, 230 00:07:52,810 --> 00:07:54,500 the overhead of combining the 231 00:07:54,600 --> 00:07:55,880 results afterwards and other factors, 232 00:07:56,640 --> 00:07:59,150 in practice you get slightly less than a 4x speedup. 233 00:08:00,140 --> 00:08:01,280 But, none the less, this sort 234 00:08:01,360 --> 00:08:02,710 of macro juice approach does offer 235 00:08:03,110 --> 00:08:04,560 us a way to process much 236 00:08:04,870 --> 00:08:05,940 larger data sets than is 237 00:08:06,270 --> 00:08:07,550 possible using a single computer. 238 00:08:08,770 --> 00:08:10,060 If you are thinking of applying 239 00:08:10,730 --> 00:08:12,200 Map Reduce to some learning 240 00:08:12,350 --> 00:08:14,260 algorithm, in order to speed this up. 241 00:08:14,750 --> 00:08:16,160 By paralleling the computation 242 00:08:16,900 --> 00:08:18,480 over different computers, the key 243 00:08:18,730 --> 00:08:20,040 question to ask yourself is, 244 00:08:20,760 --> 00:08:22,190 can your learning algorithm be expressed 245 00:08:22,880 --> 00:08:25,150 as a summation over the training set? 246 00:08:25,440 --> 00:08:26,430 And it turns out that many 247 00:08:26,670 --> 00:08:28,100 learning algorithms can actually be 248 00:08:28,410 --> 00:08:29,880 expressed as computing sums of 249 00:08:30,170 --> 00:08:31,820 functions over the training set and 250 00:08:32,610 --> 00:08:34,030 the computational expense of running 251 00:08:34,250 --> 00:08:35,480 them on large data sets is 252 00:08:35,600 --> 00:08:37,810 because they need to sum over a very large training set. 253 00:08:38,620 --> 00:08:39,870 So, whenever your learning algorithm 254 00:08:40,200 --> 00:08:41,350 can be expressed as a 255 00:08:41,450 --> 00:08:42,410 sum of the training set 256 00:08:42,660 --> 00:08:43,760 and whenever the bulk of the 257 00:08:43,860 --> 00:08:44,810 work of the learning algorithm 258 00:08:45,200 --> 00:08:46,170 can be expressed as the sum 259 00:08:46,320 --> 00:08:47,780 of the training set, then map 260 00:08:48,030 --> 00:08:49,030 reviews might a good candidate 261 00:08:50,100 --> 00:08:52,830 for scaling your learning algorithms through very, very good data sets. 262 00:08:53,880 --> 00:08:54,910 Lets just look at one more example. 263 00:08:56,020 --> 00:08:58,120 Let's say that we want to use one of the advanced optimization algorithm. 264 00:08:58,410 --> 00:08:59,430 So, things like, you 265 00:08:59,550 --> 00:09:00,460 know, l, b, f, g, s constant 266 00:09:00,900 --> 00:09:02,960 gradient and so on, and 267 00:09:03,070 --> 00:09:05,190 let's say we want to train a logistic regression of the algorithm. 268 00:09:06,080 --> 00:09:08,680 For that, we need to compute two main quantities. 269 00:09:09,300 --> 00:09:10,460 One is for the advanced 270 00:09:10,960 --> 00:09:13,520 optimization algorithms like, you know, LPF and constant gradient. 271 00:09:14,310 --> 00:09:15,270 We need to provide it a 272 00:09:15,530 --> 00:09:17,210 routine to compute the 273 00:09:17,460 --> 00:09:18,760 cost function of the optimization objective. 274 00:09:20,220 --> 00:09:21,690 And so for logistic regression, you 275 00:09:21,820 --> 00:09:22,870 remember that a cost function 276 00:09:23,660 --> 00:09:24,700 has this sort of sum over 277 00:09:24,960 --> 00:09:26,340 the training set, and so 278 00:09:26,970 --> 00:09:28,980 if youre paralizing over 279 00:09:29,110 --> 00:09:29,970 ten machines, you would split 280 00:09:30,310 --> 00:09:31,640 up the training set onto ten 281 00:09:31,910 --> 00:09:33,150 machines and have each 282 00:09:33,360 --> 00:09:35,380 of the ten machines compute the sum 283 00:09:35,860 --> 00:09:37,460 of this quantity over just 284 00:09:37,760 --> 00:09:38,660 one tenth of the training 285 00:09:40,370 --> 00:09:40,370 data. 286 00:09:40,670 --> 00:09:41,550 Then, the other thing that the 287 00:09:42,110 --> 00:09:43,400 advanced optimization algorithms need, 288 00:09:43,660 --> 00:09:44,790 is a routine to compute 289 00:09:45,190 --> 00:09:47,160 these partial derivative terms. 290 00:09:47,280 --> 00:09:48,980 Once again, these derivative terms, for 291 00:09:49,100 --> 00:09:50,350 which it's a logistic regression, can 292 00:09:50,540 --> 00:09:51,840 be expressed as a sum over 293 00:09:52,010 --> 00:09:53,130 the training set, and so once 294 00:09:53,330 --> 00:09:54,600 again, similar to our earlier 295 00:09:54,950 --> 00:09:56,060 example, you would have 296 00:09:56,520 --> 00:09:57,800 each machine compute that summation 297 00:09:58,800 --> 00:10:01,170 over just some small fraction of your training data. 298 00:10:02,440 --> 00:10:04,590 And finally, having computed 299 00:10:05,050 --> 00:10:06,260 all of these things, they could 300 00:10:06,400 --> 00:10:07,520 then send their results to 301 00:10:07,680 --> 00:10:09,400 a centralized server, which can 302 00:10:09,640 --> 00:10:12,760 then add up the partial sums. 303 00:10:13,320 --> 00:10:14,410 This corresponds to adding up 304 00:10:14,500 --> 00:10:17,000 those tenth i or 305 00:10:17,550 --> 00:10:21,880 tenth ij variables, which 306 00:10:22,100 --> 00:10:23,610 were computed locally on machine 307 00:10:23,980 --> 00:10:25,390 number i, and so 308 00:10:25,420 --> 00:10:26,800 the centralized server can sum 309 00:10:27,050 --> 00:10:28,220 these things up and get 310 00:10:28,450 --> 00:10:30,230 the overall cost function 311 00:10:30,870 --> 00:10:32,750 and get the overall partial derivative, 312 00:10:33,390 --> 00:10:35,710 which you can then pass through the advanced optimization algorithm. 313 00:10:36,890 --> 00:10:38,100 So, more broadly, by taking 314 00:10:39,080 --> 00:10:40,790 other learning algorithms and 315 00:10:41,020 --> 00:10:42,430 expressing them in sort of 316 00:10:42,720 --> 00:10:43,800 summation form or by expressing 317 00:10:44,340 --> 00:10:45,660 them in terms of computing sums 318 00:10:45,990 --> 00:10:47,100 of functions over the training set, 319 00:10:47,740 --> 00:10:49,290 you can use the MapReduce technique to 320 00:10:49,440 --> 00:10:51,420 parallelize other learning algorithms as well, 321 00:10:51,710 --> 00:10:53,310 and scale them to very large training sets. 322 00:10:54,340 --> 00:10:55,850 Finally, as one last comment, 323 00:10:56,390 --> 00:10:57,170 so far we have been 324 00:10:57,510 --> 00:10:59,630 discussing MapReduce algorithms as 325 00:10:59,850 --> 00:11:01,400 allowing you to parallelize over 326 00:11:02,090 --> 00:11:03,630 multiple computers, maybe multiple 327 00:11:03,940 --> 00:11:05,020 computers in a computer 328 00:11:05,220 --> 00:11:08,060 cluster or over multiple computers in the data center. 329 00:11:09,150 --> 00:11:10,580 It turns out that sometimes even 330 00:11:10,770 --> 00:11:12,010 if you have just a single computer, 331 00:11:13,090 --> 00:11:14,390 MapReduce can also be applicable. 332 00:11:15,530 --> 00:11:16,970 In particular, on many single 333 00:11:17,320 --> 00:11:18,510 computers now, you can have 334 00:11:18,780 --> 00:11:20,520 multiple processing cores. 335 00:11:21,170 --> 00:11:21,860 You can have multiple CPUs, 336 00:11:22,180 --> 00:11:23,120 and within each CPU you can 337 00:11:23,240 --> 00:11:26,170 have multiple proc cores. 338 00:11:26,310 --> 00:11:27,170 If you have a large training 339 00:11:27,520 --> 00:11:28,460 set, what you can 340 00:11:28,570 --> 00:11:29,540 do if, say, you have 341 00:11:29,740 --> 00:11:31,520 a computer with 4 342 00:11:31,880 --> 00:11:33,400 computing cores, what you 343 00:11:33,460 --> 00:11:34,390 can do is, even on a 344 00:11:34,550 --> 00:11:35,580 single computer you can split the 345 00:11:35,760 --> 00:11:37,680 training sets into pieces and 346 00:11:37,810 --> 00:11:39,140 send the training set to different 347 00:11:39,660 --> 00:11:40,960 cores within a single box, 348 00:11:41,220 --> 00:11:42,570 like within a single desktop computer 349 00:11:43,240 --> 00:11:45,070 or a single server and use 350 00:11:45,370 --> 00:11:47,200 MapReduce this way to divvy up work load. 351 00:11:48,000 --> 00:11:49,010 Each of the cores can then 352 00:11:49,200 --> 00:11:50,240 carry out the sum over, 353 00:11:50,950 --> 00:11:52,000 say, one quarter of your 354 00:11:52,050 --> 00:11:53,440 training set, and then they 355 00:11:53,510 --> 00:11:55,090 can take the partial sums and 356 00:11:55,510 --> 00:11:56,890 combine them, in order 357 00:11:57,220 --> 00:11:59,360 to get the summation over the entire training set. 358 00:11:59,750 --> 00:12:01,280 The advantage of thinking 359 00:12:01,600 --> 00:12:02,880 about MapReduce this way, as 360 00:12:03,350 --> 00:12:04,760 paralyzing over cause within a 361 00:12:04,900 --> 00:12:06,720 single machine, rather than parallelizing over 362 00:12:06,910 --> 00:12:08,480 multiple machines is that, 363 00:12:09,060 --> 00:12:09,970 this way you don't have to 364 00:12:10,100 --> 00:12:11,740 worry about network latency, because 365 00:12:12,020 --> 00:12:13,380 all the communication, all the 366 00:12:13,460 --> 00:12:14,810 sending of the [xx] 367 00:12:15,890 --> 00:12:18,020 back and forth, all that happens within a single machine. 368 00:12:18,420 --> 00:12:20,170 And so network latency becomes 369 00:12:20,590 --> 00:12:21,530 much less of an issue compared 370 00:12:21,960 --> 00:12:23,050 to if you were using this 371 00:12:23,540 --> 00:12:26,080 to over different computers within the data sensor. 372 00:12:27,040 --> 00:12:27,930 Finally, one last caveat on 373 00:12:27,990 --> 00:12:30,740 parallelizing within a multi-core machine. 374 00:12:31,580 --> 00:12:32,600 Depending on the details 375 00:12:32,930 --> 00:12:34,290 of your implementation, if you have a 376 00:12:34,610 --> 00:12:35,920 multi-core machine and if you 377 00:12:36,190 --> 00:12:38,130 have certain numerical linear algebra libraries. 378 00:12:39,350 --> 00:12:40,490 It turns out that the sum numerical linear algebra libraries 379 00:12:41,490 --> 00:12:43,940 that can automatically parallelize their 380 00:12:44,680 --> 00:12:47,500 linear algebra operations across multiple cores within the machine. 381 00:12:48,770 --> 00:12:50,140 So if you're fortunate enough to 382 00:12:50,280 --> 00:12:51,300 be using one of those numerical linear algebra 383 00:12:51,710 --> 00:12:52,980 libraries and certainly 384 00:12:53,640 --> 00:12:55,120 this does not apply to every single library. 385 00:12:55,830 --> 00:12:57,800 If you're using one of those libraries and. 386 00:12:58,200 --> 00:13:00,680 If you have a very good vectorizing implementation of the learning algorithm. 387 00:13:01,720 --> 00:13:02,710 Sometimes you can just implement 388 00:13:03,160 --> 00:13:05,060 you standard learning algorithm in 389 00:13:05,150 --> 00:13:06,460 a vectorized fashion and not 390 00:13:06,710 --> 00:13:08,630 worry about parallelization and numerical linear algebra libararies 391 00:13:10,030 --> 00:13:12,480 could take care of some of it for you. 392 00:13:12,620 --> 00:13:14,710 So you don't need to implement [xx] but. 393 00:13:14,860 --> 00:13:16,570 for other any problems, taking advantage 394 00:13:17,180 --> 00:13:18,660 of this sort of map reducing commentation, 395 00:13:19,240 --> 00:13:20,690 finding and using this 396 00:13:20,880 --> 00:13:22,070 MapReduce formulation and to 397 00:13:22,170 --> 00:13:23,410 paralelize a cross coarse except 398 00:13:23,890 --> 00:13:24,970 yourself might be a 399 00:13:25,070 --> 00:13:27,310 good idea as well and could let you speed up your learning algorithm. 400 00:13:29,860 --> 00:13:31,390 In this video, we talked about 401 00:13:31,730 --> 00:13:33,650 the MapReduce approach to parallelizing 402 00:13:34,460 --> 00:13:35,850 machine learning by taking a 403 00:13:36,070 --> 00:13:37,450 data and spreading them across 404 00:13:37,830 --> 00:13:39,660 many computers in the data center. 405 00:13:40,160 --> 00:13:41,930 Although these ideas are 406 00:13:42,290 --> 00:13:43,970 critical to paralysing across multiple 407 00:13:44,290 --> 00:13:45,400 cores within a single computer 408 00:13:46,870 --> 00:13:47,150 as well. 409 00:13:47,650 --> 00:13:48,600 Today there are some good 410 00:13:49,260 --> 00:13:51,080 open source implementations of MapReduce, 411 00:13:51,440 --> 00:13:52,210 so there are many users 412 00:13:52,710 --> 00:13:54,480 in open source system called 413 00:13:54,890 --> 00:13:55,820 Hadoop and using either your 414 00:13:56,010 --> 00:13:57,580 own implementation or using someone 415 00:13:57,850 --> 00:13:59,770 else's open source implementation, you 416 00:13:59,920 --> 00:14:01,090 can use these ideas to 417 00:14:01,410 --> 00:14:02,730 parallelize learning algorithms and 418 00:14:03,540 --> 00:14:04,580 get them to run on much 419 00:14:04,950 --> 00:14:05,980 larger data sets than is 420 00:14:06,320 --> 00:14:07,770 possible using just a single machine.