1
00:00:00,240 --> 00:00:01,560
So, it's taken us a

2
00:00:01,700 --> 00:00:02,690
lot of videos to get through

3
00:00:03,120 --> 00:00:04,480
the neural network learning algorithm.

4
00:00:05,620 --> 00:00:06,640
In this video, what I'd like

5
00:00:06,800 --> 00:00:08,090
to do is try to

6
00:00:08,350 --> 00:00:10,040
put all the pieces together, to

7
00:00:10,370 --> 00:00:12,120
give a overall summary or

8
00:00:12,360 --> 00:00:13,410
a bigger picture view, of how

9
00:00:13,650 --> 00:00:15,290
all the pieces fit together and

10
00:00:15,530 --> 00:00:16,990
of the overall process of how

11
00:00:17,260 --> 00:00:18,830
to implement a neural network learning algorithm.

12
00:00:21,870 --> 00:00:23,210
When training a neural network, the

13
00:00:23,280 --> 00:00:24,290
first thing you need to do

14
00:00:24,400 --> 00:00:25,920
is pick some network architecture

15
00:00:26,680 --> 00:00:27,950
and by architecture I just

16
00:00:28,200 --> 00:00:30,510
mean connectivity pattern between the neurons.

17
00:00:31,080 --> 00:00:31,840
So, you know, we might choose

18
00:00:32,700 --> 00:00:33,770
between say, a neural network

19
00:00:34,230 --> 00:00:35,440
with three input units

20
00:00:35,960 --> 00:00:37,400
and five hidden units and

21
00:00:37,500 --> 00:00:39,560
four output units versus one

22
00:00:39,800 --> 00:00:41,460
of 3, 5 hidden, 5

23
00:00:41,700 --> 00:00:43,430
hidden, 4 output and

24
00:00:43,910 --> 00:00:45,220
here are 3, 5,

25
00:00:45,550 --> 00:00:47,060
5, 5 units in each

26
00:00:47,320 --> 00:00:48,870
of three hidden layers and four

27
00:00:49,120 --> 00:00:50,250
open units, and so these

28
00:00:50,430 --> 00:00:52,000
choices of how many hidden

29
00:00:52,270 --> 00:00:53,410
units in each layer

30
00:00:53,810 --> 00:00:55,560
and how many hidden layers, those

31
00:00:55,780 --> 00:00:57,580
are architecture choices.

32
00:00:57,910 --> 00:00:58,680
So, how do you make these choices?

33
00:00:59,710 --> 00:01:01,270
Well first, the number

34
00:01:01,530 --> 00:01:03,840
of input units well that's pretty well defined.

35
00:01:04,680 --> 00:01:05,960
And once you decides on the fix

36
00:01:06,580 --> 00:01:07,870
set of features x the

37
00:01:08,080 --> 00:01:09,420
number of input units will just be, you know, the

38
00:01:10,140 --> 00:01:12,180
dimension of your features x(i)

39
00:01:12,330 --> 00:01:14,470
would be determined by that.

40
00:01:14,760 --> 00:01:15,970
And if you are doing multiclass

41
00:01:16,210 --> 00:01:17,370
classifications the number of

42
00:01:17,520 --> 00:01:18,320
output of this will be

43
00:01:18,420 --> 00:01:19,720
determined by the number

44
00:01:20,060 --> 00:01:22,860
of classes in your classification problem.

45
00:01:23,260 --> 00:01:24,890
And just a reminder if you have

46
00:01:25,160 --> 00:01:27,290
a multiclass classification where y

47
00:01:27,570 --> 00:01:28,970
takes on say values between

48
00:01:30,040 --> 00:01:31,350
1 and 10, so that

49
00:01:31,470 --> 00:01:33,560
you have ten possible classes.

50
00:01:34,690 --> 00:01:37,200
Then remember to right, your

51
00:01:37,820 --> 00:01:39,340
output y as these were the vectors.

52
00:01:40,130 --> 00:01:41,560
So instead of clause one, you

53
00:01:41,730 --> 00:01:42,840
recode it as a vector

54
00:01:43,150 --> 00:01:44,600
like that, or for

55
00:01:44,670 --> 00:01:47,280
the second class you recode it as a vector like that.

56
00:01:48,130 --> 00:01:49,080
So if one of these

57
00:01:49,210 --> 00:01:51,000
apples takes on

58
00:01:51,140 --> 00:01:53,910
the fifth class, you know, y equals 5, then

59
00:01:54,120 --> 00:01:55,130
what you're showing to your neural

60
00:01:55,380 --> 00:01:56,840
network is not actually a value

61
00:01:57,250 --> 00:01:59,520
of y equals 5, instead here

62
00:02:00,030 --> 00:02:00,950
at the upper layer which would

63
00:02:01,280 --> 00:02:02,650
have ten output units, you

64
00:02:02,740 --> 00:02:03,920
will instead feed to the

65
00:02:04,070 --> 00:02:05,710
vector which you know

66
00:02:07,470 --> 00:02:08,430
with one in the fifth

67
00:02:08,770 --> 00:02:11,050
position and a bunch of zeros down here.

68
00:02:11,420 --> 00:02:12,470
So the choice of number

69
00:02:12,890 --> 00:02:14,330
of input units and number of output units

70
00:02:14,970 --> 00:02:16,600
is maybe somewhat reasonably straightforward.

71
00:02:18,000 --> 00:02:18,950
And as for the number

72
00:02:19,410 --> 00:02:21,040
of hidden units and the

73
00:02:21,140 --> 00:02:23,110
number of hidden layers, a

74
00:02:23,210 --> 00:02:24,350
reasonable default is to

75
00:02:24,540 --> 00:02:26,010
use a single hidden layer

76
00:02:26,660 --> 00:02:28,040
and so this type of

77
00:02:28,880 --> 00:02:30,400
neural network shown on the left with

78
00:02:30,580 --> 00:02:33,270
just one hidden layer is probably the most common.

79
00:02:34,490 --> 00:02:35,870
Or if you use more

80
00:02:36,140 --> 00:02:38,410
than one hidden layer, again the

81
00:02:38,670 --> 00:02:39,600
reasonable default will be to

82
00:02:39,760 --> 00:02:40,950
have the same number of

83
00:02:41,130 --> 00:02:42,560
hidden units in every single layer.

84
00:02:42,810 --> 00:02:44,600
So here we have two

85
00:02:45,020 --> 00:02:46,370
hidden layers and each

86
00:02:46,610 --> 00:02:47,650
of these hidden layers have the

87
00:02:47,860 --> 00:02:49,500
same number five of hidden

88
00:02:49,790 --> 00:02:50,740
units and here we have, you know,

89
00:02:51,600 --> 00:02:53,020
three hidden layers and

90
00:02:53,170 --> 00:02:54,790
each of them has the same

91
00:02:54,980 --> 00:02:56,400
number, that is five hidden units.

92
00:02:57,440 --> 00:02:59,440
Rather than doing this sort

93
00:02:59,740 --> 00:03:02,850
of network architecture on the left would be a perfect ably reasonable default.

94
00:03:04,020 --> 00:03:04,780
And as for the number

95
00:03:05,120 --> 00:03:07,040
of hidden units - usually, the

96
00:03:07,120 --> 00:03:08,100
more hidden units the better;

97
00:03:08,560 --> 00:03:09,640
it's just that if you have

98
00:03:09,900 --> 00:03:11,110
a lot of hidden units, it

99
00:03:11,330 --> 00:03:13,150
can become more computationally expensive, but

100
00:03:13,300 --> 00:03:15,850
very often, having more hidden units is a good thing.

101
00:03:17,250 --> 00:03:18,560
And usually the number of hidden

102
00:03:18,720 --> 00:03:20,820
units in each layer will be maybe

103
00:03:21,080 --> 00:03:22,130
comparable to the dimension

104
00:03:22,490 --> 00:03:23,670
of x, comparable to the

105
00:03:23,810 --> 00:03:24,950
number of features, or it could

106
00:03:25,140 --> 00:03:26,880
be any where from same number

107
00:03:27,180 --> 00:03:29,590
of hidden units of input features to

108
00:03:29,770 --> 00:03:32,430
maybe so that three or four times of that.

109
00:03:32,680 --> 00:03:34,770
So having the number of hidden units is comparable.

110
00:03:35,140 --> 00:03:36,350
You know, several times, or

111
00:03:36,410 --> 00:03:37,380
some what bigger than the number

112
00:03:37,430 --> 00:03:38,750
of input features is often

113
00:03:39,280 --> 00:03:41,320
a useful thing to do So,

114
00:03:42,150 --> 00:03:43,490
hopefully this gives you one

115
00:03:43,810 --> 00:03:45,140
reasonable set of default choices

116
00:03:45,650 --> 00:03:47,770
for neural architecture and and

117
00:03:48,200 --> 00:03:49,460
if you follow these guidelines, you

118
00:03:49,540 --> 00:03:50,580
will probably get something that works

119
00:03:50,930 --> 00:03:52,180
well, but in a

120
00:03:52,360 --> 00:03:53,770
later set of videos where

121
00:03:54,050 --> 00:03:55,270
I will talk specifically about

122
00:03:55,580 --> 00:03:56,900
advice for how to apply

123
00:03:57,410 --> 00:03:58,770
algorithms, I will actually

124
00:03:58,840 --> 00:04:01,880
say a lot more about how to choose a neural network architecture.

125
00:04:02,540 --> 00:04:03,920
Or actually have quite

126
00:04:03,970 --> 00:04:04,960
a lot I want to

127
00:04:04,960 --> 00:04:06,180
say later to make good choices

128
00:04:06,710 --> 00:04:08,780
for the number of hidden units, the number of hidden layers, and so on.

129
00:04:10,620 --> 00:04:12,310
Next, here's what we

130
00:04:12,420 --> 00:04:13,740
need to implement in order to

131
00:04:13,860 --> 00:04:15,360
trade in neural network, there are

132
00:04:15,510 --> 00:04:16,820
actually six steps that I

133
00:04:17,080 --> 00:04:18,030
have; I have four on this

134
00:04:18,160 --> 00:04:19,100
slide and two more steps

135
00:04:19,380 --> 00:04:21,480
on the next slide.

136
00:04:21,620 --> 00:04:22,220
First step is to set up the neural

137
00:04:22,430 --> 00:04:23,510
network and to randomly

138
00:04:24,080 --> 00:04:25,570
initialize the values of the weights.

139
00:04:25,790 --> 00:04:27,000
And we usually initialize the

140
00:04:27,080 --> 00:04:29,710
weights to small values near zero.

141
00:04:31,100 --> 00:04:33,120
Then we implement forward propagation

142
00:04:34,080 --> 00:04:35,060
so that we can input

143
00:04:35,480 --> 00:04:37,150
any excellent neural network and

144
00:04:37,490 --> 00:04:38,860
compute h of x which is this

145
00:04:39,070 --> 00:04:40,820
output vector of the y values.

146
00:04:44,260 --> 00:04:45,910
We then also implement code to

147
00:04:46,010 --> 00:04:47,500
compute this cost function j of theta.

148
00:04:49,770 --> 00:04:51,160
And next we implement

149
00:04:52,120 --> 00:04:53,330
back-prop, or the back-propagation

150
00:04:54,400 --> 00:04:55,680
algorithm, to compute these

151
00:04:55,910 --> 00:04:58,000
partial derivatives terms, partial

152
00:04:58,440 --> 00:04:59,830
derivatives of j of theta

153
00:05:00,340 --> 00:05:04,240
with respect to the parameters. Concretely, to implement back prop.

154
00:05:04,960 --> 00:05:05,880
Usually we will do that

155
00:05:06,250 --> 00:05:08,460
with a fore loop over the training examples.

156
00:05:09,700 --> 00:05:10,650
Some of you may have heard of

157
00:05:10,830 --> 00:05:12,640
advanced, and frankly very

158
00:05:12,940 --> 00:05:14,500
advanced factorization methods where you

159
00:05:14,670 --> 00:05:15,720
don't have a four-loop over

160
00:05:16,570 --> 00:05:18,580
the m-training examples, that the

161
00:05:18,660 --> 00:05:19,900
first time you're implementing back prop

162
00:05:20,250 --> 00:05:21,420
there should almost certainly the four

163
00:05:21,420 --> 00:05:22,980
loop in your code,

164
00:05:23,800 --> 00:05:25,010
where you're iterating over the examples,

165
00:05:25,810 --> 00:05:27,760
you know, x1, y1, then so

166
00:05:28,030 --> 00:05:29,510
you do forward prop and

167
00:05:29,640 --> 00:05:30,400
back prop on the first

168
00:05:30,850 --> 00:05:32,510
example, and then in

169
00:05:32,710 --> 00:05:33,730
the second iteration of the

170
00:05:33,780 --> 00:05:35,360
four-loop, you do forward propagation

171
00:05:35,980 --> 00:05:38,050
and back propagation on the second example, and so on.

172
00:05:38,170 --> 00:05:40,900
Until you get through the final example.

173
00:05:41,680 --> 00:05:43,110
So there should be

174
00:05:43,230 --> 00:05:44,250
a four-loop in your implementation

175
00:05:45,050 --> 00:05:47,180
of back prop, at least the first time implementing it.

176
00:05:48,120 --> 00:05:49,160
And then there are frankly

177
00:05:49,390 --> 00:05:50,520
somewhat complicated ways to do

178
00:05:50,890 --> 00:05:52,660
this without a four-loop, but

179
00:05:52,810 --> 00:05:53,950
I definitely do not recommend

180
00:05:54,360 --> 00:05:55,340
trying to do that much more

181
00:05:55,660 --> 00:05:58,420
complicated version the first time you try to implement back prop.

182
00:05:59,850 --> 00:06:00,920
So concretely, we have a

183
00:06:01,010 --> 00:06:02,200
four-loop over my m-training examples

184
00:06:03,240 --> 00:06:04,630
and inside the four-loop we're

185
00:06:04,770 --> 00:06:06,300
going to perform fore prop

186
00:06:06,580 --> 00:06:08,090
and back prop using just this one example.

187
00:06:09,310 --> 00:06:10,320
And what that means is that

188
00:06:10,560 --> 00:06:12,470
we're going to take x(i), and

189
00:06:12,690 --> 00:06:14,010
feed that to my input layer,

190
00:06:14,770 --> 00:06:16,370
perform forward-prop, perform back-prop

191
00:06:17,370 --> 00:06:18,360
and that will if all of

192
00:06:18,430 --> 00:06:19,840
these activations and all of

193
00:06:19,930 --> 00:06:22,090
these delta terms for all

194
00:06:22,300 --> 00:06:23,440
of the layers of all my

195
00:06:23,770 --> 00:06:24,720
units in the neural

196
00:06:24,950 --> 00:06:27,170
network then still

197
00:06:27,610 --> 00:06:28,760
inside this four-loop, let

198
00:06:29,180 --> 00:06:30,450
me draw some curly braces

199
00:06:30,940 --> 00:06:31,950
just to show the scope with

200
00:06:32,030 --> 00:06:32,930
the four-loop, this is in

201
00:06:34,160 --> 00:06:35,480
octave code of course, but it's more a sequence Java

202
00:06:36,190 --> 00:06:38,350
code, and a four-loop encompasses all this.

203
00:06:39,060 --> 00:06:40,060
We're going to compute those delta

204
00:06:40,480 --> 00:06:43,690
terms, which are is the formula that we gave earlier.

205
00:06:45,540 --> 00:06:47,370
Plus, you know, delta l plus one times

206
00:06:48,630 --> 00:06:51,150
a, l transpose of the code.

207
00:06:51,490 --> 00:06:53,540
And then finally, outside the

208
00:06:54,180 --> 00:06:55,630
having computed these delta

209
00:06:55,970 --> 00:06:57,550
terms, these accumulation terms, we

210
00:06:57,870 --> 00:06:59,050
would then have some other

211
00:06:59,170 --> 00:07:00,430
code and then that will

212
00:07:00,720 --> 00:07:03,240
allow us to compute these partial derivative terms.

213
00:07:03,860 --> 00:07:05,450
Right and these partial derivative

214
00:07:05,970 --> 00:07:07,020
terms have to take

215
00:07:07,210 --> 00:07:10,270
into account the regularization term lambda as well.

216
00:07:11,050 --> 00:07:13,240
And so, those formulas were given in the earlier video.

217
00:07:14,830 --> 00:07:15,720
So, how do you done that

218
00:07:16,680 --> 00:07:18,080
you now hopefully have code to

219
00:07:18,180 --> 00:07:20,050
compute these partial derivative terms.

220
00:07:21,190 --> 00:07:23,030
Next is step five, what I

221
00:07:23,240 --> 00:07:24,420
do is then use gradient

222
00:07:24,730 --> 00:07:26,700
checking to compare these partial

223
00:07:27,120 --> 00:07:28,530
derivative terms that were computed. So, I've

224
00:07:29,420 --> 00:07:30,980
compared the versions computed using

225
00:07:31,270 --> 00:07:33,990
back propagation versus the

226
00:07:34,430 --> 00:07:36,470
partial derivatives computed using the numerical

227
00:07:37,710 --> 00:07:39,850
estimates as using numerical estimates of the derivatives.

228
00:07:40,350 --> 00:07:41,810
So, I do gradient checking to make

229
00:07:41,970 --> 00:07:44,340
sure that both of these give you very similar values.

230
00:07:45,830 --> 00:07:47,410
Having done gradient checking just now reassures

231
00:07:47,910 --> 00:07:49,280
us that our implementation of back

232
00:07:49,590 --> 00:07:51,470
propagation is correct, and is

233
00:07:51,610 --> 00:07:52,850
then very important that we disable

234
00:07:53,530 --> 00:07:54,710
gradient checking, because the gradient

235
00:07:55,080 --> 00:07:57,150
checking code is computationally very slow.

236
00:07:59,020 --> 00:08:00,880
And finally, we then

237
00:08:01,120 --> 00:08:03,280
use an optimization algorithm such

238
00:08:03,510 --> 00:08:04,940
as gradient descent, or one of

239
00:08:04,960 --> 00:08:07,520
the advanced optimization methods such

240
00:08:07,740 --> 00:08:10,020
as LB of GS, contract gradient has

241
00:08:10,250 --> 00:08:13,120
embodied into fminunc or other  optimization methods.

242
00:08:13,940 --> 00:08:15,500
We use these together with

243
00:08:15,730 --> 00:08:17,380
back propagation, so back

244
00:08:17,620 --> 00:08:18,670
propagation is the thing

245
00:08:18,770 --> 00:08:20,640
that computes these partial derivatives for us.

246
00:08:21,730 --> 00:08:22,680
And so, we know how to

247
00:08:22,860 --> 00:08:24,020
compute the cost function, we know

248
00:08:24,100 --> 00:08:25,550
how to compute the partial derivatives using

249
00:08:25,830 --> 00:08:27,410
back propagation, so we

250
00:08:27,480 --> 00:08:28,830
can use one of these optimization methods

251
00:08:29,580 --> 00:08:30,850
to try to minimize j of

252
00:08:31,130 --> 00:08:33,500
theta as a function of the parameters theta.

253
00:08:34,330 --> 00:08:35,410
And by the way, for

254
00:08:35,660 --> 00:08:37,330
neural networks, this cost function

255
00:08:38,300 --> 00:08:39,630
j of theta is non-convex,

256
00:08:40,530 --> 00:08:42,490
or is not convex and so

257
00:08:43,260 --> 00:08:45,600
it can theoretically be susceptible

258
00:08:46,250 --> 00:08:47,480
to local minima, and in

259
00:08:47,650 --> 00:08:49,580
fact algorithms like gradient descent and

260
00:08:49,840 --> 00:08:51,950
the advance optimization methods can,

261
00:08:52,400 --> 00:08:53,660
in theory, get stuck in local

262
00:08:55,190 --> 00:08:56,300
optima, but it turns out

263
00:08:56,480 --> 00:08:57,680
that in practice this is

264
00:08:57,870 --> 00:08:59,230
not usually a huge problem

265
00:08:59,560 --> 00:09:00,800
and even though we can't guarantee

266
00:09:01,210 --> 00:09:02,320
that these algorithms will find a

267
00:09:02,510 --> 00:09:04,260
global optimum, usually algorithms like

268
00:09:04,390 --> 00:09:05,870
gradient descent will do a

269
00:09:05,930 --> 00:09:07,700
very good job minimizing this

270
00:09:07,850 --> 00:09:09,230
cost function j of

271
00:09:09,280 --> 00:09:10,350
theta and get a

272
00:09:10,420 --> 00:09:11,820
very good local minimum, even

273
00:09:12,060 --> 00:09:13,690
if it doesn't get to the global optimum.

274
00:09:14,500 --> 00:09:16,950
Finally, gradient descents for

275
00:09:17,230 --> 00:09:19,500
a neural network might still seem a little bit magical.

276
00:09:20,170 --> 00:09:21,680
So, let me just show one

277
00:09:21,890 --> 00:09:22,990
more figure to try to get

278
00:09:23,170 --> 00:09:25,660
that intuition about what gradient descent for a neural network is doing.

279
00:09:27,020 --> 00:09:28,460
This was actually similar to the

280
00:09:28,590 --> 00:09:31,190
figure that I was using earlier to explain gradient descent.

281
00:09:31,730 --> 00:09:32,750
So, we have some cost

282
00:09:33,090 --> 00:09:34,480
function, and we have

283
00:09:34,710 --> 00:09:36,590
a number of parameters in our neural network. Right

284
00:09:36,810 --> 00:09:39,190
here I've just written down two of the parameter values.

285
00:09:40,080 --> 00:09:41,250
In reality, of course, in

286
00:09:41,520 --> 00:09:43,570
the neural network, we can have lots of parameters with these.

287
00:09:44,190 --> 00:09:46,980
Theta one, theta two--all of these are matrices, right?

288
00:09:47,030 --> 00:09:48,130
So we can have very high dimensional

289
00:09:48,580 --> 00:09:49,870
parameters but because of

290
00:09:49,960 --> 00:09:51,620
the limitations the source of

291
00:09:51,790 --> 00:09:52,970
parts we can draw. I'm pretending

292
00:09:53,410 --> 00:09:55,840
that we have only two parameters in this neural network.

293
00:09:56,270 --> 00:09:56,890
Although obviously we have a lot more in practice.

294
00:09:59,280 --> 00:10:00,700
Now, this cost function j of

295
00:10:00,800 --> 00:10:02,470
theta measures how well

296
00:10:02,880 --> 00:10:04,730
the neural network fits the training data.

297
00:10:06,000 --> 00:10:06,920
So, if you take a point

298
00:10:07,120 --> 00:10:08,590
like this one, down here,

299
00:10:10,270 --> 00:10:11,180
that's a point where j

300
00:10:11,460 --> 00:10:12,580
of theta is pretty low,

301
00:10:12,870 --> 00:10:16,170
and so this corresponds to a setting of the parameters.

302
00:10:17,020 --> 00:10:17,840
There's a setting of the parameters

303
00:10:18,350 --> 00:10:19,920
theta, where, you know, for most

304
00:10:20,140 --> 00:10:22,450
of the training examples, the output

305
00:10:24,120 --> 00:10:26,270
of my hypothesis, that may

306
00:10:26,410 --> 00:10:27,420
be pretty close to y(i)

307
00:10:27,650 --> 00:10:28,720
and if this is

308
00:10:28,840 --> 00:10:31,560
true than that's what causes my cost function to be pretty low.

309
00:10:32,690 --> 00:10:33,770
Whereas in contrast, if you were

310
00:10:33,820 --> 00:10:35,140
to take a value like that, a

311
00:10:35,510 --> 00:10:37,260
point like that corresponds to,

312
00:10:38,080 --> 00:10:39,260
where for many training examples,

313
00:10:39,890 --> 00:10:40,780
the output of my neural

314
00:10:41,040 --> 00:10:42,860
network is far from

315
00:10:43,110 --> 00:10:44,340
the actual value y(i)

316
00:10:44,540 --> 00:10:45,850
that was observed in the training set.

317
00:10:46,610 --> 00:10:47,480
So points like this on the

318
00:10:47,590 --> 00:10:50,100
line correspond to where the

319
00:10:50,450 --> 00:10:51,450
hypothesis, where the neural

320
00:10:51,740 --> 00:10:53,330
network is outputting values

321
00:10:53,770 --> 00:10:54,810
on the training set that are

322
00:10:55,020 --> 00:10:56,260
far from y(i). So, it's not

323
00:10:56,470 --> 00:10:57,970
fitting the training set well, whereas

324
00:10:58,170 --> 00:10:59,640
points like this with low

325
00:10:59,970 --> 00:11:01,300
values of the cost function corresponds

326
00:11:02,130 --> 00:11:03,380
to where j of theta

327
00:11:04,130 --> 00:11:05,270
is low, and therefore corresponds

328
00:11:05,950 --> 00:11:07,590
to where the neural network happens

329
00:11:07,850 --> 00:11:09,290
to be fitting my training set

330
00:11:09,510 --> 00:11:11,340
well, because I mean this is what's

331
00:11:11,550 --> 00:11:14,070
needed to be true in order for j of theta to be small.

332
00:11:15,480 --> 00:11:16,810
So what gradient descent does is

333
00:11:16,870 --> 00:11:18,330
we'll start from some random

334
00:11:18,730 --> 00:11:20,300
initial point like that

335
00:11:20,430 --> 00:11:22,990
one over there, and it will repeatedly go downhill.

336
00:11:24,040 --> 00:11:25,400
And so what back propagation is

337
00:11:25,570 --> 00:11:27,220
doing is computing the direction

338
00:11:27,940 --> 00:11:29,370
of the gradient, and what

339
00:11:29,520 --> 00:11:30,740
gradient descent is doing is

340
00:11:31,040 --> 00:11:32,060
it's taking little steps downhill

341
00:11:32,880 --> 00:11:34,220
until hopefully it gets to,

342
00:11:34,610 --> 00:11:36,410
in this case, a pretty good local optimum.

343
00:11:37,880 --> 00:11:39,250
So, when you implement back

344
00:11:39,410 --> 00:11:40,840
propagation and use gradient

345
00:11:41,200 --> 00:11:42,420
descent or one of the

346
00:11:42,840 --> 00:11:44,750
advanced optimization methods, this picture

347
00:11:45,330 --> 00:11:47,290
sort of explains what the algorithm is doing.

348
00:11:47,450 --> 00:11:48,820
It's trying to find a value

349
00:11:49,260 --> 00:11:50,920
of the parameters where the

350
00:11:51,260 --> 00:11:52,180
output values in the neural

351
00:11:52,450 --> 00:11:54,300
network closely matches the

352
00:11:54,410 --> 00:11:55,520
values of the y(i)'s

353
00:11:55,660 --> 00:11:58,800
observed in your training set.

354
00:11:58,910 --> 00:12:00,250
So, hopefully this gives you

355
00:12:00,400 --> 00:12:01,610
a better sense of how

356
00:12:01,920 --> 00:12:03,930
the many different pieces of

357
00:12:04,120 --> 00:12:05,760
neural network learning fit together.

358
00:12:07,120 --> 00:12:09,010
In case even after this video, in

359
00:12:09,120 --> 00:12:10,130
case you still feel like there

360
00:12:10,360 --> 00:12:11,420
are, like, a lot of different pieces

361
00:12:12,070 --> 00:12:13,450
and it's not entirely clear what

362
00:12:13,690 --> 00:12:14,670
some of them do or how all

363
00:12:14,860 --> 00:12:17,760
of these pieces come together, that's actually okay.

364
00:12:18,790 --> 00:12:21,780
Neural network learning and back propagation is a complicated algorithm.

365
00:12:23,000 --> 00:12:23,960
And even though I've seen

366
00:12:24,290 --> 00:12:25,340
the math behind back propagation

367
00:12:25,860 --> 00:12:26,710
for many years and I've used

368
00:12:27,030 --> 00:12:28,470
back propagation, I think very

369
00:12:28,680 --> 00:12:30,210
successfully, for many years, even

370
00:12:30,380 --> 00:12:31,510
today I still feel like I

371
00:12:31,570 --> 00:12:32,670
don't always have a great

372
00:12:33,400 --> 00:12:35,610
grasp of exactly what back propagation is doing sometimes.

373
00:12:36,200 --> 00:12:37,850
And what the optimization process

374
00:12:38,520 --> 00:12:41,480
looks like of minimizing j if theta.

375
00:12:41,920 --> 00:12:42,830
Much this is a much harder algorithm

376
00:12:43,450 --> 00:12:44,680
to feel like I have a

377
00:12:44,830 --> 00:12:46,590
much less good handle on

378
00:12:46,690 --> 00:12:47,690
exactly what this is doing

379
00:12:48,240 --> 00:12:49,360
compared to say, linear regression or logistic regression.

380
00:12:51,390 --> 00:12:53,180
Which were mathematically and conceptually

381
00:12:53,510 --> 00:12:55,090
much simpler and much cleaner algorithms.

382
00:12:56,200 --> 00:12:57,030
But so in case if you feel the

383
00:12:57,070 --> 00:12:58,560
same way, you know, that's actually perfectly

384
00:12:58,970 --> 00:13:01,010
okay, but if you

385
00:13:01,170 --> 00:13:02,790
do implement back propagation, hopefully

386
00:13:03,160 --> 00:13:04,260
what you find is that this

387
00:13:04,460 --> 00:13:05,410
is one of the most powerful

388
00:13:05,790 --> 00:13:08,030
learning algorithms and if you

389
00:13:08,130 --> 00:13:09,510
implement this algorithm, implement back propagation,

390
00:13:10,250 --> 00:13:11,230
implement one of these optimization

391
00:13:11,340 --> 00:13:13,260
methods, you find that

392
00:13:13,610 --> 00:13:14,940
back propagation will be able

393
00:13:15,390 --> 00:13:17,330
to fit very complex, powerful, non-linear

394
00:13:17,830 --> 00:13:19,370
functions to your data,

395
00:13:20,080 --> 00:13:21,060
and this is one of the

396
00:13:21,190 --> 00:13:22,790
most effective learning algorithms we have today.