1
00:00:00,320 --> 00:00:01,510
In the last few videos, we talked

2
00:00:01,810 --> 00:00:03,430
about stochastic gradient descent, and,

3
00:00:03,620 --> 00:00:05,020
you know, other variations of the

4
00:00:05,120 --> 00:00:06,530
stochastic gradient descent algorithm,

5
00:00:06,910 --> 00:00:09,150
including those adaptations to online

6
00:00:09,490 --> 00:00:10,420
learning, but all of those

7
00:00:10,610 --> 00:00:11,810
algorithms could be run on

8
00:00:12,110 --> 00:00:13,740
one machine, or could be run on one computer.

9
00:00:14,800 --> 00:00:15,870
And some machine learning problems

10
00:00:16,310 --> 00:00:17,270
are just too big to run

11
00:00:17,520 --> 00:00:19,160
on one machine, sometimes maybe

12
00:00:19,300 --> 00:00:21,050
you just so much data you

13
00:00:21,170 --> 00:00:22,350
just don't ever want to run

14
00:00:22,670 --> 00:00:23,980
all that data through a

15
00:00:24,100 --> 00:00:26,270
single computer, no matter what algorithm you would use on that computer.

16
00:00:28,470 --> 00:00:29,640
So in this video I'd

17
00:00:29,740 --> 00:00:31,240
like to talk about different approach

18
00:00:31,770 --> 00:00:33,610
to large scale machine learning, called

19
00:00:34,010 --> 00:00:36,190
the map reduce approach.

20
00:00:37,030 --> 00:00:38,080
And even though we have

21
00:00:38,380 --> 00:00:39,400
quite a few videos on stochastic

22
00:00:39,970 --> 00:00:41,230
gradient descent and we're going

23
00:00:41,550 --> 00:00:43,100
to spend relative less time

24
00:00:43,460 --> 00:00:45,350
on map reduce--don't judge the

25
00:00:45,560 --> 00:00:46,750
relative importance of map reduce

26
00:00:47,160 --> 00:00:48,240
versus the gradient descent

27
00:00:48,690 --> 00:00:49,590
based on the amount amount of

28
00:00:49,660 --> 00:00:51,480
time I spend on these ideas in particular.

29
00:00:52,230 --> 00:00:53,380
Many people will say that

30
00:00:53,790 --> 00:00:54,840
map reduce is at least

31
00:00:55,090 --> 00:00:56,330
an equally important, and some

32
00:00:56,580 --> 00:00:57,850
would say an even more important idea

33
00:00:58,500 --> 00:01:00,620
compared to gradient descent, only

34
00:01:01,460 --> 00:01:03,040
it's relatively simpler to

35
00:01:03,160 --> 00:01:04,620
explain, which is why I'm

36
00:01:04,720 --> 00:01:05,580
going to spend less time on

37
00:01:05,830 --> 00:01:07,040
it, but using these ideas

38
00:01:07,670 --> 00:01:08,400
you might be able to scale

39
00:01:09,070 --> 00:01:10,640
learning algorithms to even

40
00:01:10,880 --> 00:01:12,520
far larger problems than is

41
00:01:12,630 --> 00:01:14,530
possible using stochastic gradient descent.

42
00:01:18,720 --> 00:01:19,000
Here's the idea.

43
00:01:19,810 --> 00:01:21,020
Let's say we want to fit

44
00:01:21,490 --> 00:01:22,960
a linear regression model or

45
00:01:23,140 --> 00:01:24,440
a logistic regression model or some

46
00:01:24,540 --> 00:01:26,100
such, and let's start again

47
00:01:26,430 --> 00:01:27,660
with batch gradient descent, so

48
00:01:27,840 --> 00:01:30,300
that's our batch gradient descent learning rule.

49
00:01:31,240 --> 00:01:32,430
And to keep the writing

50
00:01:32,850 --> 00:01:34,170
on this slide tractable, I'm going

51
00:01:34,340 --> 00:01:36,990
to assume throughout that we have m equals 400 examples.

52
00:01:37,530 --> 00:01:39,560
Of course, by our

53
00:01:39,750 --> 00:01:40,850
standards, in terms of large scale

54
00:01:41,090 --> 00:01:42,050
machine learning, you know m

55
00:01:42,170 --> 00:01:43,210
might be pretty small and so,

56
00:01:43,770 --> 00:01:45,390
this might be more commonly

57
00:01:45,870 --> 00:01:46,920
applied to problems, where you

58
00:01:47,050 --> 00:01:48,190
have maybe closer to 400

59
00:01:48,740 --> 00:01:49,940
million examples, or some

60
00:01:50,080 --> 00:01:51,310
such, but just to

61
00:01:51,390 --> 00:01:52,330
make the writing on the slide

62
00:01:52,770 --> 00:01:55,000
simpler, I'm going to pretend we have 400 examples.

63
00:01:55,690 --> 00:01:57,460
So in that case, the

64
00:01:57,790 --> 00:01:59,080
batch gradient descent learning rule

65
00:01:59,570 --> 00:02:00,930
has this 400 and the

66
00:02:01,500 --> 00:02:02,930
sum from i equals 1 through

67
00:02:03,330 --> 00:02:05,050
400 through my 400 examples

68
00:02:05,590 --> 00:02:06,890
here, and if m

69
00:02:07,050 --> 00:02:09,780
is large, then this is a computationally expensive step.

70
00:02:10,890 --> 00:02:12,830
So, what the MapReduce idea

71
00:02:13,250 --> 00:02:14,470
does is the following, and

72
00:02:14,890 --> 00:02:15,740
I should say the map

73
00:02:15,950 --> 00:02:16,940
reduce idea is due to

74
00:02:17,680 --> 00:02:20,190
two researchers, Jeff Dean

75
00:02:20,700 --> 00:02:22,060
and Sanjay Gimawat.

76
00:02:22,640 --> 00:02:23,490
Jeff Dean, by the way, is

77
00:02:24,190 --> 00:02:26,520
one of the most legendary engineers in

78
00:02:26,660 --> 00:02:28,300
all of Silicon Valley and he

79
00:02:28,420 --> 00:02:29,530
kind of built a large

80
00:02:29,820 --> 00:02:31,670
fraction of the architectural

81
00:02:32,310 --> 00:02:34,770
infrastructure that all of Google runs on today.

82
00:02:36,000 --> 00:02:37,320
But here's the map reduce idea.

83
00:02:37,850 --> 00:02:38,570
So, let's say I have

84
00:02:38,700 --> 00:02:39,840
some training set, if we

85
00:02:39,900 --> 00:02:41,220
want to denote by this box here

86
00:02:41,610 --> 00:02:42,760
of X Y pairs,

87
00:02:44,250 --> 00:02:47,730
where it's X1, Y1, down

88
00:02:47,990 --> 00:02:49,640
to my 400 examples,

89
00:02:50,520 --> 00:02:51,660
Xm, Ym.

90
00:02:52,190 --> 00:02:53,780
So, that's my training set with 400 training examples.

91
00:02:55,060 --> 00:02:56,550
In the MapReduce idea, one way

92
00:02:56,690 --> 00:02:58,190
to do, is split this training

93
00:02:58,570 --> 00:03:00,510
set in to different subsets.

94
00:03:01,890 --> 00:03:02,590
I'm going to.

95
00:03:02,950 --> 00:03:04,150
assume for this example that

96
00:03:04,290 --> 00:03:05,530
I have 4 computers,

97
00:03:06,160 --> 00:03:07,160
or 4 machines to run in

98
00:03:07,300 --> 00:03:08,670
parallel on my training set,

99
00:03:08,890 --> 00:03:10,570
which is why I'm splitting this into 4 machines.

100
00:03:10,920 --> 00:03:12,290
If you have 10 machines or

101
00:03:12,400 --> 00:03:13,810
100 machines, then you would

102
00:03:13,970 --> 00:03:15,890
split your training set into 10 pieces or 100 pieces or what have you.

103
00:03:18,040 --> 00:03:19,710
And what the first of my

104
00:03:19,850 --> 00:03:20,840
4 machines is to do,

105
00:03:21,100 --> 00:03:23,170
say, is use just the

106
00:03:23,270 --> 00:03:25,170
first one quarter of my

107
00:03:25,300 --> 00:03:28,680
training set--so use just the first 100 training examples.

108
00:03:30,020 --> 00:03:31,440
And in particular, what it's

109
00:03:31,480 --> 00:03:32,520
going to do is look at

110
00:03:32,630 --> 00:03:34,800
this summation, and compute

111
00:03:35,490 --> 00:03:38,560
that summation for just the first 100 training examples.

112
00:03:40,030 --> 00:03:40,960
So let me write that up

113
00:03:41,110 --> 00:03:42,530
I'm going to compute a variable

114
00:03:43,560 --> 00:03:46,230
temp 1 to superscript 1

115
00:03:46,320 --> 00:03:49,410
the first machine J equals

116
00:03:50,450 --> 00:03:52,150
sum from equals 1 through

117
00:03:52,260 --> 00:03:53,160
100, and then I'm going to plug

118
00:03:53,500 --> 00:03:56,610
in exactly that term there--so I have

119
00:03:57,260 --> 00:04:00,140
X-theta, Xi, minus Yi

120
00:04:01,800 --> 00:04:03,230
times Xij, right?

121
00:04:03,740 --> 00:04:05,680
So that's just that

122
00:04:05,910 --> 00:04:07,460
gradient descent term up there.

123
00:04:08,300 --> 00:04:09,780
And then similarly, I'm going

124
00:04:10,010 --> 00:04:11,330
to take the second quarter

125
00:04:11,600 --> 00:04:13,130
of my data and send it

126
00:04:13,320 --> 00:04:14,520
to my second machine, and

127
00:04:14,690 --> 00:04:15,680
my second machine will use

128
00:04:15,900 --> 00:04:18,750
training examples 101 through 200

129
00:04:19,350 --> 00:04:21,170
and you will compute similar variables

130
00:04:21,720 --> 00:04:22,880
of a temp to j which

131
00:04:23,110 --> 00:04:24,450
is the same sum for index

132
00:04:24,890 --> 00:04:26,620
from examples 101 through 200.

133
00:04:26,840 --> 00:04:29,680
And similarly machines 3

134
00:04:29,830 --> 00:04:32,720
and 4 will use the

135
00:04:32,830 --> 00:04:34,110
third quarter and the fourth

136
00:04:34,570 --> 00:04:36,550
quarter of my training set.

137
00:04:37,530 --> 00:04:38,950
So now each machine has

138
00:04:39,190 --> 00:04:40,580
to sum over 100 instead

139
00:04:41,060 --> 00:04:42,570
of over 400 examples and so

140
00:04:42,760 --> 00:04:43,750
has to do only a quarter

141
00:04:44,050 --> 00:04:45,220
of the work and thus presumably

142
00:04:45,900 --> 00:04:48,000
it could do it about four times as fast.

143
00:04:49,380 --> 00:04:50,630
Finally, after all these machines

144
00:04:50,990 --> 00:04:51,740
have done this work, I am

145
00:04:51,850 --> 00:04:53,560
going to take these temp variables

146
00:04:55,350 --> 00:04:56,480
and put them back together.

147
00:04:56,870 --> 00:04:58,400
So I take these variables and

148
00:04:58,530 --> 00:04:59,950
send them all to a You

149
00:05:00,090 --> 00:05:03,080
know centralized master server and

150
00:05:03,300 --> 00:05:04,750
what the master will do

151
00:05:05,140 --> 00:05:06,720
is combine these results together.

152
00:05:07,360 --> 00:05:08,470
and in particular, it will

153
00:05:08,780 --> 00:05:10,780
update my parameters theta

154
00:05:11,000 --> 00:05:13,160
j according to theta

155
00:05:13,410 --> 00:05:14,720
j gets updated as theta j

156
00:05:15,730 --> 00:05:17,560
minus Of the

157
00:05:17,680 --> 00:05:19,510
learning rate alpha times one

158
00:05:20,120 --> 00:05:22,940
over 400 times temp,

159
00:05:23,300 --> 00:05:27,410
1, J, plus temp

160
00:05:27,760 --> 00:05:30,290
2j plus temp 3j

161
00:05:32,400 --> 00:05:35,470
plus temp 4j and

162
00:05:35,560 --> 00:05:37,890
of course we have to do this separately for J equals 0.

163
00:05:37,980 --> 00:05:39,570
You know, up to

164
00:05:39,820 --> 00:05:41,220
and within this number of features.

165
00:05:42,550 --> 00:05:45,420
So operating this equation into I hope it's clear.

166
00:05:45,670 --> 00:05:47,870
So what this equation

167
00:05:50,930 --> 00:05:53,220
is doing is exactly the

168
00:05:53,290 --> 00:05:54,570
same is that when you

169
00:05:54,660 --> 00:05:56,140
have a centralized master server

170
00:05:56,680 --> 00:05:57,950
that takes the results, the ten

171
00:05:58,040 --> 00:05:58,780
one j the ten two j

172
00:05:59,000 --> 00:05:59,850
ten three j and ten four

173
00:05:59,970 --> 00:06:01,760
j and adds them up

174
00:06:02,030 --> 00:06:03,430
and so of course the sum

175
00:06:04,090 --> 00:06:04,960
of these four things.

176
00:06:06,360 --> 00:06:07,810
Right, that's just the sum of

177
00:06:08,060 --> 00:06:09,440
this, plus the sum

178
00:06:09,760 --> 00:06:11,490
of this, plus the sum

179
00:06:11,630 --> 00:06:13,000
of this, plus the sum

180
00:06:13,120 --> 00:06:14,290
of that, and those four

181
00:06:14,470 --> 00:06:15,830
things just add up to

182
00:06:15,920 --> 00:06:17,740
be equal to this sum that

183
00:06:17,880 --> 00:06:19,580
we're originally computing a batch stream descent.

184
00:06:20,590 --> 00:06:21,550
And then we have the alpha times

185
00:06:21,860 --> 00:06:22,910
1 of 400, alpha times 1

186
00:06:23,350 --> 00:06:24,690
of 100, and this is

187
00:06:25,020 --> 00:06:27,020
exactly equivalent to the

188
00:06:27,140 --> 00:06:29,390
batch gradient descent algorithm, only,

189
00:06:29,910 --> 00:06:30,880
instead of needing to sum

190
00:06:31,290 --> 00:06:32,540
over all four hundred training

191
00:06:32,810 --> 00:06:33,900
examples on just one

192
00:06:34,040 --> 00:06:35,280
machine, we can instead

193
00:06:35,760 --> 00:06:37,460
divide up the work load on four machines.

194
00:06:39,090 --> 00:06:40,190
So, here's what the general

195
00:06:40,630 --> 00:06:43,410
picture of the MapReduce technique looks like.

196
00:06:45,060 --> 00:06:46,510
We have some training sets, and

197
00:06:46,670 --> 00:06:48,200
if we want to paralyze across four

198
00:06:48,420 --> 00:06:49,100
machines, we are going to

199
00:06:49,170 --> 00:06:51,670
take the training set and split it, you know, equally.

200
00:06:52,120 --> 00:06:54,640
Split it as evenly as we can into four subsets.

201
00:06:56,470 --> 00:06:57,110
Then we are going to take the

202
00:06:57,180 --> 00:06:59,560
4 subsets of the training data and send them to 4 different computers.

203
00:07:00,530 --> 00:07:01,660
And each of the 4 computers

204
00:07:02,130 --> 00:07:03,570
can compute a summation over

205
00:07:03,950 --> 00:07:04,850
just one quarter of the

206
00:07:04,910 --> 00:07:06,230
training set, and then

207
00:07:06,340 --> 00:07:07,720
finally take each of

208
00:07:07,780 --> 00:07:09,310
the computers takes the results, sends

209
00:07:09,580 --> 00:07:12,720
them to a centralized server, which then combines the results together.

210
00:07:13,570 --> 00:07:14,900
So, on the previous line

211
00:07:15,190 --> 00:07:16,540
in that example, the bulk

212
00:07:16,800 --> 00:07:17,910
of the work in gradient descent,

213
00:07:18,330 --> 00:07:20,140
was computing the sum from

214
00:07:20,430 --> 00:07:22,270
i equals 1 to 400 of something.

215
00:07:22,670 --> 00:07:24,110
So more generally, sum from

216
00:07:24,370 --> 00:07:25,570
i equals 1 to m

217
00:07:26,320 --> 00:07:28,180
of that formula for gradient descent.

218
00:07:29,160 --> 00:07:30,430
And now, because each of

219
00:07:30,550 --> 00:07:31,890
the four computers can do just

220
00:07:32,190 --> 00:07:33,800
a quarter of the work, potentially

221
00:07:34,350 --> 00:07:35,940
you can get up to a 4x speed up.

222
00:07:38,820 --> 00:07:39,980
In particular, if there were

223
00:07:40,190 --> 00:07:41,900
no network latencies and

224
00:07:42,100 --> 00:07:42,970
no costs of the network

225
00:07:43,400 --> 00:07:44,450
communications to send the

226
00:07:44,600 --> 00:07:45,450
data back and forth, you can

227
00:07:45,610 --> 00:07:47,820
potentially get up to a 4x speed up.

228
00:07:48,050 --> 00:07:49,410
Of course, in practice,

229
00:07:50,100 --> 00:07:52,080
because of network latencies,

230
00:07:52,810 --> 00:07:54,500
the overhead of combining the

231
00:07:54,600 --> 00:07:55,880
results afterwards and other factors,

232
00:07:56,640 --> 00:07:59,150
in practice you get slightly less than a 4x speedup.

233
00:08:00,140 --> 00:08:01,280
But, none the less, this sort

234
00:08:01,360 --> 00:08:02,710
of macro juice approach does offer

235
00:08:03,110 --> 00:08:04,560
us a way to process much

236
00:08:04,870 --> 00:08:05,940
larger data sets than is

237
00:08:06,270 --> 00:08:07,550
possible using a single computer.

238
00:08:08,770 --> 00:08:10,060
If you are thinking of applying

239
00:08:10,730 --> 00:08:12,200
Map Reduce to some learning

240
00:08:12,350 --> 00:08:14,260
algorithm, in order to speed this up.

241
00:08:14,750 --> 00:08:16,160
By paralleling the computation

242
00:08:16,900 --> 00:08:18,480
over different computers, the key

243
00:08:18,730 --> 00:08:20,040
question to ask yourself is,

244
00:08:20,760 --> 00:08:22,190
can your learning algorithm be expressed

245
00:08:22,880 --> 00:08:25,150
as a summation over the training set?

246
00:08:25,440 --> 00:08:26,430
And it turns out that many

247
00:08:26,670 --> 00:08:28,100
learning algorithms can actually be

248
00:08:28,410 --> 00:08:29,880
expressed as computing sums of

249
00:08:30,170 --> 00:08:31,820
functions over the training set and

250
00:08:32,610 --> 00:08:34,030
the computational expense of running

251
00:08:34,250 --> 00:08:35,480
them on large data sets is

252
00:08:35,600 --> 00:08:37,810
because they need to sum over a very large training set.

253
00:08:38,620 --> 00:08:39,870
So, whenever your learning algorithm

254
00:08:40,200 --> 00:08:41,350
can be expressed as a

255
00:08:41,450 --> 00:08:42,410
sum of the training set

256
00:08:42,660 --> 00:08:43,760
and whenever the bulk of the

257
00:08:43,860 --> 00:08:44,810
work of the learning algorithm

258
00:08:45,200 --> 00:08:46,170
can be expressed as the sum

259
00:08:46,320 --> 00:08:47,780
of the training set, then map

260
00:08:48,030 --> 00:08:49,030
reviews might a good candidate

261
00:08:50,100 --> 00:08:52,830
for scaling your learning algorithms through very, very good data sets.

262
00:08:53,880 --> 00:08:54,910
Lets just look at one more example.

263
00:08:56,020 --> 00:08:58,120
Let's say that we want to use one of the advanced optimization algorithm.

264
00:08:58,410 --> 00:08:59,430
So, things like, you

265
00:08:59,550 --> 00:09:00,460
know, l, b, f, g, s constant

266
00:09:00,900 --> 00:09:02,960
gradient and so on, and

267
00:09:03,070 --> 00:09:05,190
let's say we want to train a logistic regression of the algorithm.

268
00:09:06,080 --> 00:09:08,680
For that, we need to compute two main quantities.

269
00:09:09,300 --> 00:09:10,460
One is for the advanced

270
00:09:10,960 --> 00:09:13,520
optimization algorithms like, you know, LPF and constant gradient.

271
00:09:14,310 --> 00:09:15,270
We need to provide it a

272
00:09:15,530 --> 00:09:17,210
routine to compute the

273
00:09:17,460 --> 00:09:18,760
cost function of the optimization objective.

274
00:09:20,220 --> 00:09:21,690
And so for logistic regression, you

275
00:09:21,820 --> 00:09:22,870
remember that a cost function

276
00:09:23,660 --> 00:09:24,700
has this sort of sum over

277
00:09:24,960 --> 00:09:26,340
the training set, and so

278
00:09:26,970 --> 00:09:28,980
if youre paralizing over

279
00:09:29,110 --> 00:09:29,970
ten machines, you would split

280
00:09:30,310 --> 00:09:31,640
up the training set onto ten

281
00:09:31,910 --> 00:09:33,150
machines and have each

282
00:09:33,360 --> 00:09:35,380
of the ten machines compute the sum

283
00:09:35,860 --> 00:09:37,460
of this quantity over just

284
00:09:37,760 --> 00:09:38,660
one tenth of the training

285
00:09:40,370 --> 00:09:40,370
data.

286
00:09:40,670 --> 00:09:41,550
Then, the other thing that the

287
00:09:42,110 --> 00:09:43,400
advanced optimization algorithms need,

288
00:09:43,660 --> 00:09:44,790
is a routine to compute

289
00:09:45,190 --> 00:09:47,160
these partial derivative terms.

290
00:09:47,280 --> 00:09:48,980
Once again, these derivative terms, for

291
00:09:49,100 --> 00:09:50,350
which it's a logistic regression, can

292
00:09:50,540 --> 00:09:51,840
be expressed as a sum over

293
00:09:52,010 --> 00:09:53,130
the training set, and so once

294
00:09:53,330 --> 00:09:54,600
again, similar to our earlier

295
00:09:54,950 --> 00:09:56,060
example, you would have

296
00:09:56,520 --> 00:09:57,800
each machine compute that summation

297
00:09:58,800 --> 00:10:01,170
over just some small fraction of your training data.

298
00:10:02,440 --> 00:10:04,590
And finally, having computed

299
00:10:05,050 --> 00:10:06,260
all of these things, they could

300
00:10:06,400 --> 00:10:07,520
then send their results to

301
00:10:07,680 --> 00:10:09,400
a centralized server, which can

302
00:10:09,640 --> 00:10:12,760
then add up the partial sums.

303
00:10:13,320 --> 00:10:14,410
This corresponds to adding up

304
00:10:14,500 --> 00:10:17,000
those tenth i or

305
00:10:17,550 --> 00:10:21,880
tenth ij variables, which

306
00:10:22,100 --> 00:10:23,610
were computed locally on machine

307
00:10:23,980 --> 00:10:25,390
number i, and so

308
00:10:25,420 --> 00:10:26,800
the centralized server can sum

309
00:10:27,050 --> 00:10:28,220
these things up and get

310
00:10:28,450 --> 00:10:30,230
the overall cost function

311
00:10:30,870 --> 00:10:32,750
and get the overall partial derivative,

312
00:10:33,390 --> 00:10:35,710
which you can then pass through the advanced optimization algorithm.

313
00:10:36,890 --> 00:10:38,100
So, more broadly, by taking

314
00:10:39,080 --> 00:10:40,790
other learning algorithms and

315
00:10:41,020 --> 00:10:42,430
expressing them in sort of

316
00:10:42,720 --> 00:10:43,800
summation form or by expressing

317
00:10:44,340 --> 00:10:45,660
them in terms of computing sums

318
00:10:45,990 --> 00:10:47,100
of functions over the training set,

319
00:10:47,740 --> 00:10:49,290
you can use the MapReduce technique to

320
00:10:49,440 --> 00:10:51,420
parallelize other learning algorithms as well,

321
00:10:51,710 --> 00:10:53,310
and scale them to very large training sets.

322
00:10:54,340 --> 00:10:55,850
Finally, as one last comment,

323
00:10:56,390 --> 00:10:57,170
so far we have been

324
00:10:57,510 --> 00:10:59,630
discussing MapReduce algorithms as

325
00:10:59,850 --> 00:11:01,400
allowing you to parallelize over

326
00:11:02,090 --> 00:11:03,630
multiple computers, maybe multiple

327
00:11:03,940 --> 00:11:05,020
computers in a computer

328
00:11:05,220 --> 00:11:08,060
cluster or over multiple computers in the data center.

329
00:11:09,150 --> 00:11:10,580
It turns out that sometimes even

330
00:11:10,770 --> 00:11:12,010
if you have just a single computer,

331
00:11:13,090 --> 00:11:14,390
MapReduce can also be applicable.

332
00:11:15,530 --> 00:11:16,970
In particular, on many single

333
00:11:17,320 --> 00:11:18,510
computers now, you can have

334
00:11:18,780 --> 00:11:20,520
multiple processing cores.

335
00:11:21,170 --> 00:11:21,860
You can have multiple CPUs,

336
00:11:22,180 --> 00:11:23,120
and within each CPU you can

337
00:11:23,240 --> 00:11:26,170
have multiple proc cores.

338
00:11:26,310 --> 00:11:27,170
If you have a large training

339
00:11:27,520 --> 00:11:28,460
set, what you can

340
00:11:28,570 --> 00:11:29,540
do if, say, you have

341
00:11:29,740 --> 00:11:31,520
a computer with 4

342
00:11:31,880 --> 00:11:33,400
computing cores, what you

343
00:11:33,460 --> 00:11:34,390
can do is, even on a

344
00:11:34,550 --> 00:11:35,580
single computer you can split the

345
00:11:35,760 --> 00:11:37,680
training sets into pieces and

346
00:11:37,810 --> 00:11:39,140
send the training set to different

347
00:11:39,660 --> 00:11:40,960
cores within a single box,

348
00:11:41,220 --> 00:11:42,570
like within a single desktop computer

349
00:11:43,240 --> 00:11:45,070
or a single server and use

350
00:11:45,370 --> 00:11:47,200
MapReduce this way to divvy up work load.

351
00:11:48,000 --> 00:11:49,010
Each of the cores can then

352
00:11:49,200 --> 00:11:50,240
carry out the sum over,

353
00:11:50,950 --> 00:11:52,000
say, one quarter of your

354
00:11:52,050 --> 00:11:53,440
training set, and then they

355
00:11:53,510 --> 00:11:55,090
can take the partial sums and

356
00:11:55,510 --> 00:11:56,890
combine them, in order

357
00:11:57,220 --> 00:11:59,360
to get the summation over the entire training set.

358
00:11:59,750 --> 00:12:01,280
The advantage of thinking

359
00:12:01,600 --> 00:12:02,880
about MapReduce this way, as

360
00:12:03,350 --> 00:12:04,760
paralyzing over cause within a

361
00:12:04,900 --> 00:12:06,720
single machine, rather than parallelizing over

362
00:12:06,910 --> 00:12:08,480
multiple machines is that,

363
00:12:09,060 --> 00:12:09,970
this way you don't have to

364
00:12:10,100 --> 00:12:11,740
worry about network latency, because

365
00:12:12,020 --> 00:12:13,380
all the communication, all the

366
00:12:13,460 --> 00:12:14,810
sending of the  [xx]

367
00:12:15,890 --> 00:12:18,020
back and forth, all that happens within a single machine.

368
00:12:18,420 --> 00:12:20,170
And so network latency becomes

369
00:12:20,590 --> 00:12:21,530
much less of an issue compared

370
00:12:21,960 --> 00:12:23,050
to if you were using this

371
00:12:23,540 --> 00:12:26,080
to over different computers within the data sensor.

372
00:12:27,040 --> 00:12:27,930
Finally, one last caveat on

373
00:12:27,990 --> 00:12:30,740
parallelizing within a multi-core machine.

374
00:12:31,580 --> 00:12:32,600
Depending on the details

375
00:12:32,930 --> 00:12:34,290
of your implementation, if you have a

376
00:12:34,610 --> 00:12:35,920
multi-core machine and if you

377
00:12:36,190 --> 00:12:38,130
have certain numerical linear algebra libraries.

378
00:12:39,350 --> 00:12:40,490
It turns out that the sum numerical linear algebra libraries

379
00:12:41,490 --> 00:12:43,940
that can automatically parallelize their

380
00:12:44,680 --> 00:12:47,500
linear algebra operations across multiple cores within the machine.

381
00:12:48,770 --> 00:12:50,140
So if you're fortunate enough to

382
00:12:50,280 --> 00:12:51,300
be using one of those numerical linear algebra

383
00:12:51,710 --> 00:12:52,980
libraries and certainly

384
00:12:53,640 --> 00:12:55,120
this does not apply to every single library.

385
00:12:55,830 --> 00:12:57,800
If you're using one of those libraries and.

386
00:12:58,200 --> 00:13:00,680
If you have a very good vectorizing implementation of the learning algorithm.

387
00:13:01,720 --> 00:13:02,710
Sometimes you can just implement

388
00:13:03,160 --> 00:13:05,060
you standard learning algorithm in

389
00:13:05,150 --> 00:13:06,460
a vectorized fashion and not

390
00:13:06,710 --> 00:13:08,630
worry about parallelization and numerical linear algebra libararies

391
00:13:10,030 --> 00:13:12,480
could take care of some of it for you.

392
00:13:12,620 --> 00:13:14,710
So you don't need to implement [xx] but.

393
00:13:14,860 --> 00:13:16,570
for other any problems, taking advantage

394
00:13:17,180 --> 00:13:18,660
of this sort of map reducing commentation,

395
00:13:19,240 --> 00:13:20,690
finding and using this

396
00:13:20,880 --> 00:13:22,070
MapReduce formulation and to

397
00:13:22,170 --> 00:13:23,410
paralelize a cross coarse except

398
00:13:23,890 --> 00:13:24,970
yourself might be a

399
00:13:25,070 --> 00:13:27,310
good idea as well and could let you speed up your learning algorithm.

400
00:13:29,860 --> 00:13:31,390
In this video, we talked about

401
00:13:31,730 --> 00:13:33,650
the MapReduce approach to parallelizing

402
00:13:34,460 --> 00:13:35,850
machine learning by taking a

403
00:13:36,070 --> 00:13:37,450
data and spreading them across

404
00:13:37,830 --> 00:13:39,660
many computers in the data center.

405
00:13:40,160 --> 00:13:41,930
Although these ideas are

406
00:13:42,290 --> 00:13:43,970
critical to paralysing across multiple

407
00:13:44,290 --> 00:13:45,400
cores within a single computer

408
00:13:46,870 --> 00:13:47,150
as well.

409
00:13:47,650 --> 00:13:48,600
Today there are some good

410
00:13:49,260 --> 00:13:51,080
open source implementations of MapReduce,

411
00:13:51,440 --> 00:13:52,210
so there are many users

412
00:13:52,710 --> 00:13:54,480
in open source system called

413
00:13:54,890 --> 00:13:55,820
Hadoop and using either your

414
00:13:56,010 --> 00:13:57,580
own implementation or using someone

415
00:13:57,850 --> 00:13:59,770
else's open source implementation, you

416
00:13:59,920 --> 00:14:01,090
can use these ideas to

417
00:14:01,410 --> 00:14:02,730
parallelize learning algorithms and

418
00:14:03,540 --> 00:14:04,580
get them to run on much

419
00:14:04,950 --> 00:14:05,980
larger data sets than is

420
00:14:06,320 --> 00:14:07,770
possible using just a single machine.