1
00:00:00,300 --> 00:00:01,680
In the last video, we talked

2
00:00:01,990 --> 00:00:03,920
about gradient descent for minimizing

3
00:00:04,440 --> 00:00:06,700
the cost function J of theta for logistic regression.

4
00:00:07,800 --> 00:00:08,930
In this video, I'd like to

5
00:00:09,020 --> 00:00:10,250
tell you about some advanced

6
00:00:10,850 --> 00:00:12,340
optimization algorithms and some

7
00:00:12,670 --> 00:00:14,060
advanced optimization concepts.

8
00:00:15,180 --> 00:00:16,480
Using some of these ideas, we'll

9
00:00:16,630 --> 00:00:17,930
be able to get logistic regression

10
00:00:19,010 --> 00:00:20,220
to run much more quickly than

11
00:00:20,350 --> 00:00:21,970
it's possible with gradient descent.

12
00:00:22,880 --> 00:00:24,190
And this will also let

13
00:00:24,320 --> 00:00:26,060
the algorithms scale much better

14
00:00:26,670 --> 00:00:28,030
to very large machine learning problems,

15
00:00:28,660 --> 00:00:30,950
such as if we had a very large number of features.

16
00:00:31,850 --> 00:00:33,360
Here's an alternative view of

17
00:00:33,750 --> 00:00:34,910
what gradient descent is doing.

18
00:00:35,590 --> 00:00:38,030
We have some cost function J and we want to minimize it.

19
00:00:38,950 --> 00:00:39,980
So what we need to

20
00:00:40,340 --> 00:00:41,080
is, we need to write

21
00:00:41,330 --> 00:00:42,640
code that can take

22
00:00:42,850 --> 00:00:44,980
as input the parameters theta and

23
00:00:45,200 --> 00:00:46,470
they can compute two things: J

24
00:00:46,700 --> 00:00:48,190
of theta and these partial

25
00:00:48,620 --> 00:00:50,280
derivative terms for, you

26
00:00:50,530 --> 00:00:51,820
know, J equals 0, 1

27
00:00:51,890 --> 00:00:53,700
up to N.  Given code that

28
00:00:53,830 --> 00:00:54,980
can do these two things, what

29
00:00:55,160 --> 00:00:56,710
gradient descent does is it

30
00:00:56,790 --> 00:00:58,620
repeatedly performs the following update.

31
00:00:59,100 --> 00:00:59,100
Right?

32
00:00:59,280 --> 00:01:00,500
So given the code that

33
00:01:00,670 --> 00:01:01,750
we wrote to compute these partial

34
00:01:02,090 --> 00:01:03,800
derivatives, gradient descent plugs

35
00:01:04,480 --> 00:01:07,330
in here and uses that to update our parameters theta.

36
00:01:08,650 --> 00:01:09,590
So another way of thinking

37
00:01:09,910 --> 00:01:11,070
about gradient descent is that

38
00:01:11,350 --> 00:01:12,670
we need to supply code to

39
00:01:12,810 --> 00:01:14,050
compute J of theta and

40
00:01:14,230 --> 00:01:15,700
these derivatives, and then

41
00:01:15,900 --> 00:01:16,930
these get plugged into gradient

42
00:01:17,370 --> 00:01:20,110
descents, which can then try to minimize the function for us.

43
00:01:20,970 --> 00:01:21,970
For gradient descent, I guess

44
00:01:22,480 --> 00:01:23,790
technically you don't actually need code

45
00:01:24,170 --> 00:01:26,520
to compute the cost function J of theta.

46
00:01:26,940 --> 00:01:28,980
You only need code to compute the derivative terms.

47
00:01:29,740 --> 00:01:30,480
But if you think of your

48
00:01:30,590 --> 00:01:32,300
code as also monitoring convergence

49
00:01:33,000 --> 00:01:34,060
of some such,

50
00:01:34,190 --> 00:01:35,440
we'll just think of

51
00:01:35,530 --> 00:01:37,380
ourselves as providing code to

52
00:01:37,510 --> 00:01:38,530
compute both the cost

53
00:01:38,890 --> 00:01:40,250
function and the derivative terms.

54
00:01:42,700 --> 00:01:44,130
So, having written code to

55
00:01:44,280 --> 00:01:45,860
compute these two things, one

56
00:01:46,090 --> 00:01:47,820
algorithm we can use is gradient descent.

57
00:01:48,910 --> 00:01:51,590
But gradient descent isn't the only algorithm we can use.

58
00:01:52,280 --> 00:01:53,690
And there are other algorithms,

59
00:01:54,330 --> 00:01:55,930
more advanced, more sophisticated ones,

60
00:01:56,720 --> 00:01:57,880
that, if we only provide

61
00:01:58,400 --> 00:01:59,520
them a way to compute

62
00:01:59,960 --> 00:02:01,550
these two things, then these

63
00:02:01,760 --> 00:02:03,040
are different approaches to optimize

64
00:02:03,490 --> 00:02:04,790
the cost function for us.

65
00:02:05,110 --> 00:02:07,910
So conjugate gradient BFGS and

66
00:02:08,110 --> 00:02:09,240
L-BFGS are examples of more

67
00:02:09,460 --> 00:02:11,490
sophisticated optimization algorithms that

68
00:02:11,640 --> 00:02:12,610
need a way to compute J

69
00:02:12,810 --> 00:02:13,670
of theta, and need a way

70
00:02:13,750 --> 00:02:15,430
to compute the derivatives, and can

71
00:02:15,670 --> 00:02:16,940
then use more sophisticated

72
00:02:17,620 --> 00:02:19,880
strategies than gradient descent to minimize the cost function.

73
00:02:21,260 --> 00:02:22,560
The details of exactly what

74
00:02:22,780 --> 00:02:25,920
these three algorithms is well beyond the scope of this course.

75
00:02:26,490 --> 00:02:28,200
And in fact you often

76
00:02:28,650 --> 00:02:30,570
end up spending, you know, many days,

77
00:02:31,060 --> 00:02:32,670
or a small number of weeks studying these algorithms.

78
00:02:33,240 --> 00:02:35,840
If you take a class and advance the numerical computing.

79
00:02:36,920 --> 00:02:38,200
But let me just tell you about some of their properties.

80
00:02:40,080 --> 00:02:42,150
These three algorithms have a number of advantages.

81
00:02:42,900 --> 00:02:44,070
One is that, with any

82
00:02:44,290 --> 00:02:45,850
of this algorithms you usually do

83
00:02:46,000 --> 00:02:48,970
not need to manually pick the learning rate alpha.

84
00:02:50,670 --> 00:02:51,450
So one way to think

85
00:02:51,650 --> 00:02:53,630
of these algorithms is that given

86
00:02:54,230 --> 00:02:56,900
is the way to compute the derivative and a cost function.

87
00:02:57,320 --> 00:02:59,740
You can think of these algorithms as having a clever inter-loop.

88
00:03:00,060 --> 00:03:00,680
And, in fact, they have a clever

89
00:03:01,810 --> 00:03:03,780
inter-loop called a line

90
00:03:04,200 --> 00:03:05,840
search algorithm that automatically

91
00:03:06,520 --> 00:03:08,010
tries out different values for

92
00:03:08,080 --> 00:03:09,360
the learning rate alpha and automatically

93
00:03:10,010 --> 00:03:11,090
picks a good learning rate alpha

94
00:03:12,030 --> 00:03:12,900
so that it can even pick

95
00:03:13,130 --> 00:03:14,570
a different learning rate for every iteration.

96
00:03:15,490 --> 00:03:18,230
And so then you don't need to choose it yourself.

97
00:03:21,430 --> 00:03:22,770
These algorithms actually do

98
00:03:22,910 --> 00:03:24,260
more sophisticated things than just

99
00:03:24,470 --> 00:03:25,640
pick a good learning rate, and

100
00:03:25,800 --> 00:03:27,300
so they often end up

101
00:03:27,490 --> 00:03:30,320
converging much faster than gradient descent.

102
00:03:32,470 --> 00:03:33,740
These algorithms actually do more

103
00:03:33,980 --> 00:03:35,160
sophisticated things than just

104
00:03:35,360 --> 00:03:36,740
pick a good learning rate, and

105
00:03:36,880 --> 00:03:38,770
so they often end up converging much

106
00:03:39,020 --> 00:03:40,840
faster than gradient descent, but

107
00:03:41,040 --> 00:03:42,230
detailed discussion of exactly

108
00:03:42,710 --> 00:03:44,420
what they do is beyond the scope of this course.

109
00:03:45,580 --> 00:03:47,060
In fact, I actually used

110
00:03:47,570 --> 00:03:49,020
to have used these algorithms for

111
00:03:49,170 --> 00:03:50,170
a long time, like maybe over

112
00:03:50,470 --> 00:03:53,070
a decade, quite frequently, and it

113
00:03:53,290 --> 00:03:54,410
was only, you know, a

114
00:03:54,510 --> 00:03:55,460
few years ago that I actually

115
00:03:56,150 --> 00:03:57,200
figured out for myself the details

116
00:03:57,780 --> 00:04:00,220
of what conjugate gradient, BFGS and O-BFGS do.

117
00:04:00,980 --> 00:04:02,740
So it is actually entirely possible

118
00:04:03,560 --> 00:04:05,380
to use these algorithms successfully and

119
00:04:05,480 --> 00:04:06,530
apply to lots of different learning

120
00:04:06,780 --> 00:04:08,490
problems without actually understanding

121
00:04:09,460 --> 00:04:11,140
the inter-loop of what these algorithms do.

122
00:04:12,270 --> 00:04:13,630
If these algorithms have a disadvantage,

123
00:04:14,200 --> 00:04:15,350
I'd say that the main

124
00:04:15,610 --> 00:04:16,970
disadvantage is that they're

125
00:04:17,110 --> 00:04:19,390
quite a lot more complex than gradient descent.

126
00:04:20,180 --> 00:04:21,700
And in particular, you probably should

127
00:04:21,970 --> 00:04:23,290
not implement these algorithms

128
00:04:23,850 --> 00:04:26,060
- conjugate gradient, L-BGFS, BFGS -

129
00:04:26,360 --> 00:04:29,520
yourself unless you're an expert in numerical computing.

130
00:04:30,720 --> 00:04:32,320
Instead, just as I

131
00:04:32,420 --> 00:04:33,640
wouldn't recommend that you write

132
00:04:33,850 --> 00:04:35,240
your own code to compute square

133
00:04:35,590 --> 00:04:36,660
roots of numbers or to

134
00:04:36,770 --> 00:04:39,010
compute inverses of matrices, for

135
00:04:39,140 --> 00:04:40,600
these algorithms also what I

136
00:04:40,710 --> 00:04:42,530
would recommend you do is just use a software library.

137
00:04:43,030 --> 00:04:43,770
So, you know, to take a square

138
00:04:44,120 --> 00:04:44,940
root what all of us

139
00:04:45,150 --> 00:04:46,440
do is use some function

140
00:04:47,080 --> 00:04:48,310
that someone else has

141
00:04:48,530 --> 00:04:50,200
written to compute the square roots of our numbers.

142
00:04:51,330 --> 00:04:53,530
And fortunately, Octave and

143
00:04:53,760 --> 00:04:55,070
the closely related language MATLAB

144
00:04:55,430 --> 00:04:57,110
- we'll be using that -

145
00:04:57,140 --> 00:04:58,370
Octave has a very good. Has a pretty

146
00:04:58,530 --> 00:05:02,410
reasonable library implementing some of these advanced optimization algorithms.

147
00:05:03,380 --> 00:05:04,350
And so if you just use

148
00:05:04,600 --> 00:05:06,800
the built-in library, you know, you get pretty good results.

149
00:05:08,010 --> 00:05:08,880
I should say that there is

150
00:05:09,370 --> 00:05:10,880
a difference between good

151
00:05:11,230 --> 00:05:12,740
and bad implementations of these algorithms.

152
00:05:13,690 --> 00:05:15,010
And so, if you're using a

153
00:05:15,120 --> 00:05:16,270
different language for your machine

154
00:05:16,470 --> 00:05:17,560
learning application, if you're using

155
00:05:18,190 --> 00:05:20,090
C, C++, Java, and

156
00:05:20,250 --> 00:05:24,060
so on, you

157
00:05:24,210 --> 00:05:24,710
might want to try out a couple

158
00:05:24,730 --> 00:05:25,660
of different libraries to make sure that you find a

159
00:05:25,740 --> 00:05:27,790
good library for implementing these algorithms.

160
00:05:28,250 --> 00:05:29,410
Because there is a difference in

161
00:05:29,480 --> 00:05:30,740
performance between a good implementation

162
00:05:31,680 --> 00:05:33,150
of, you know, contour gradient or

163
00:05:33,530 --> 00:05:35,150
LPFGS versus less good

164
00:05:35,350 --> 00:05:37,680
implementation of contour gradient or LPFGS.

165
00:05:43,060 --> 00:05:44,310
So now let's explain how

166
00:05:44,580 --> 00:05:47,080
to use these algorithms, I'm going to do so with an example.

167
00:05:48,970 --> 00:05:50,220
Let's say that you have a

168
00:05:50,370 --> 00:05:51,620
problem with two parameters

169
00:05:53,380 --> 00:05:55,580
equals theta zero and theta one.

170
00:05:56,410 --> 00:05:57,450
And let's say your cost function

171
00:05:57,970 --> 00:05:59,210
is J of theta equals theta

172
00:05:59,430 --> 00:06:01,540
one minus five squared, plus theta two minus five squared.

173
00:06:02,630 --> 00:06:04,080
So with this cost function.

174
00:06:04,590 --> 00:06:06,960
You know the value for theta 1 and theta 2.

175
00:06:07,080 --> 00:06:09,590
If you want to minimize J of theta as a function of theta.

176
00:06:09,940 --> 00:06:10,910
The value that minimizes it is

177
00:06:11,030 --> 00:06:12,040
going to be theta 1

178
00:06:12,420 --> 00:06:14,220
equals 5, theta 2 equals equals five.

179
00:06:15,230 --> 00:06:16,620
Now, again, I know some of

180
00:06:16,950 --> 00:06:18,320
you know more calculus than others,

181
00:06:19,010 --> 00:06:20,770
but the derivatives of the

182
00:06:20,850 --> 00:06:23,420
cost function J turn out to be these two expressions.

183
00:06:24,270 --> 00:06:25,060
I've done the calculus.

184
00:06:26,260 --> 00:06:27,250
So if you want to apply

185
00:06:27,480 --> 00:06:29,220
one of the advanced optimization algorithms

186
00:06:29,810 --> 00:06:31,380
to minimize cost function J.

187
00:06:31,660 --> 00:06:32,630
So, you know, if we

188
00:06:32,880 --> 00:06:34,680
didn't know the minimum was at

189
00:06:34,780 --> 00:06:36,140
5, 5, but if you want to have

190
00:06:36,240 --> 00:06:37,550
a cost function 5 the minimum

191
00:06:37,970 --> 00:06:39,840
numerically using something like

192
00:06:40,040 --> 00:06:41,560
gradient descent but preferably more

193
00:06:41,730 --> 00:06:43,430
advanced than gradient descent, what

194
00:06:43,550 --> 00:06:45,010
you would do is implement an octave

195
00:06:45,570 --> 00:06:46,690
function like this, so we

196
00:06:46,860 --> 00:06:48,190
implement a cost function,

197
00:06:49,210 --> 00:06:51,180
cost function theta function like that,

198
00:06:52,180 --> 00:06:53,250
and what this does is that

199
00:06:53,380 --> 00:06:55,660
it returns two arguments, the

200
00:06:55,760 --> 00:06:57,780
first J-val, is how

201
00:06:58,910 --> 00:07:00,020
we would compute the cost function

202
00:07:00,680 --> 00:07:01,780
J. And so this says J-val

203
00:07:02,080 --> 00:07:03,210
equals, you know, theta

204
00:07:03,440 --> 00:07:04,630
one minus five squared plus theta

205
00:07:05,330 --> 00:07:06,230
two minus five squared.

206
00:07:06,540 --> 00:07:09,140
So it's just computing this cost function over here.

207
00:07:10,540 --> 00:07:12,040
And the second argument that

208
00:07:12,260 --> 00:07:14,190
this function returns is gradient.

209
00:07:14,840 --> 00:07:16,030
So gradient is going to

210
00:07:16,160 --> 00:07:17,320
be a two by one vector,

211
00:07:18,870 --> 00:07:20,050
and the two elements of the

212
00:07:20,120 --> 00:07:22,100
gradient vector correspond to

213
00:07:22,800 --> 00:07:24,670
the two partial derivative terms over here.

214
00:07:27,150 --> 00:07:28,570
Having implemented this cost function,

215
00:07:29,580 --> 00:07:30,390
you would, you can then

216
00:07:31,510 --> 00:07:33,010
call the advanced optimization

217
00:07:34,270 --> 00:07:35,720
function called the fminunc

218
00:07:35,950 --> 00:07:36,900
- it stands for function

219
00:07:37,610 --> 00:07:39,360
minimization unconstrained in Octave

220
00:07:40,300 --> 00:07:41,520
-and the way you call this is as follows.

221
00:07:41,790 --> 00:07:42,350
You set a few options.

222
00:07:43,230 --> 00:07:43,580
This is a options

223
00:07:44,330 --> 00:07:46,680
as a data structure that stores the options you want.

224
00:07:47,320 --> 00:07:48,960
So grant up on,

225
00:07:49,160 --> 00:07:52,100
this sets the gradient objective parameter to on.

226
00:07:52,270 --> 00:07:55,180
It just means you are indeed going to provide a gradient to this algorithm.

227
00:07:56,150 --> 00:07:57,550
I'm going to set the maximum number

228
00:07:57,840 --> 00:07:59,280
of iterations to, let's say, one hundred.

229
00:07:59,580 --> 00:08:02,230
We're going give it an initial guess for theta.

230
00:08:02,720 --> 00:08:03,680
There's a 2 by 1 vector.

231
00:08:04,440 --> 00:08:06,860
And then this command calls fminunc.

232
00:08:07,530 --> 00:08:10,290
This at symbol presents a

233
00:08:10,420 --> 00:08:11,810
pointer to the cost function

234
00:08:13,010 --> 00:08:14,320
that we just defined up there.

235
00:08:15,060 --> 00:08:16,020
And if you call this,

236
00:08:16,270 --> 00:08:18,290
this will compute, you know, will use

237
00:08:18,620 --> 00:08:20,490
one of the more advanced optimization algorithms.

238
00:08:21,110 --> 00:08:23,350
And if you want to think it as just like gradient descent.

239
00:08:23,690 --> 00:08:25,170
But automatically choosing the learning

240
00:08:25,500 --> 00:08:27,290
rate alpha for so you don't have to do so yourself.

241
00:08:28,210 --> 00:08:29,880
But it will then attempt to

242
00:08:30,160 --> 00:08:32,000
use the sort of advanced optimization algorithms.

243
00:08:32,640 --> 00:08:33,770
Like gradient descent on steroids.

244
00:08:34,400 --> 00:08:36,490
To try to find the optimal value of theta for you.

245
00:08:37,180 --> 00:08:39,040
Let me actually show you what this looks like in Octave.

246
00:08:40,690 --> 00:08:42,460
So I've written this cost function

247
00:08:42,900 --> 00:08:46,440
of theta function exactly as we had it on the previous line.

248
00:08:46,650 --> 00:08:49,070
It computes J-val which is the cost function.

249
00:08:49,920 --> 00:08:51,810
And it computes the gradient with

250
00:08:52,040 --> 00:08:53,050
the two elements being the partial

251
00:08:53,450 --> 00:08:54,430
derivatives of the cost function

252
00:08:55,220 --> 00:08:56,200
with respect to, you know,

253
00:08:56,360 --> 00:08:57,910
the two parameters, theta one and theta two.

254
00:08:59,040 --> 00:09:00,360
Now let's switch to my Octave window.

255
00:09:00,710 --> 00:09:02,900
I'm gonna type in those commands I had just now.

256
00:09:03,470 --> 00:09:05,850
So, options equals optimset. This is

257
00:09:06,630 --> 00:09:08,510
the notation for setting my

258
00:09:09,670 --> 00:09:11,190
parameters on my options,

259
00:09:11,710 --> 00:09:13,850
for my optimization algorithm. Grant option on, maxIter, 100

260
00:09:14,130 --> 00:09:17,600
so that says 100

261
00:09:18,310 --> 00:09:19,610
iterations, and I am

262
00:09:19,730 --> 00:09:22,090
going to provide the gradient to my algorithm.

263
00:09:23,490 --> 00:09:27,190
Let's say initial theta equals zero's two by one.

264
00:09:27,980 --> 00:09:29,280
So that's my initial guess for theta.

265
00:09:30,500 --> 00:09:31,390
And now I have of theta,

266
00:09:32,620 --> 00:09:35,100
function val exit flag

267
00:09:37,610 --> 00:09:39,430
equals fminunc constraint.

268
00:09:40,570 --> 00:09:41,600
A pointer to the cost function.

269
00:09:43,010 --> 00:09:44,700
and provide my initial guess.

270
00:09:46,090 --> 00:09:49,060
And the options like so.

271
00:09:49,820 --> 00:09:52,760
And if I hit enter this will run the optimization algorithm.

272
00:09:53,940 --> 00:09:54,810
And it returns pretty quickly.

273
00:09:55,790 --> 00:09:57,040
This funny formatting that's because

274
00:09:57,430 --> 00:09:58,430
my line, you know, my

275
00:09:59,700 --> 00:10:00,290
code wrapped around.

276
00:10:00,680 --> 00:10:02,540
So, this funny thing

277
00:10:02,760 --> 00:10:04,890
is just because my command line had wrapped around.

278
00:10:05,490 --> 00:10:06,290
But what this says is that

279
00:10:06,550 --> 00:10:08,500
numerically renders, you know, think

280
00:10:08,670 --> 00:10:10,400
of it as gradient descent

281
00:10:10,440 --> 00:10:11,620
on steroids, they found the optimal value of

282
00:10:11,760 --> 00:10:13,150
a theta is theta 1

283
00:10:13,400 --> 00:10:15,670
equals 5, theta 2 equals 5, exactly as we're hoping for.

284
00:10:16,520 --> 00:10:18,760
The function value at the

285
00:10:18,840 --> 00:10:21,430
optimum is essentially 10 to the minus 30.

286
00:10:21,670 --> 00:10:23,160
So that's essentially zero, which

287
00:10:23,370 --> 00:10:24,760
is also what we're hoping for.

288
00:10:24,840 --> 00:10:27,060
And the exit flag is

289
00:10:27,240 --> 00:10:29,080
1, and this shows

290
00:10:29,730 --> 00:10:31,400
what the convergence status of this.

291
00:10:31,800 --> 00:10:33,010
And if you want you can do

292
00:10:33,150 --> 00:10:35,020
help fminunc to

293
00:10:35,130 --> 00:10:36,480
read the documentation for how

294
00:10:36,680 --> 00:10:38,650
to interpret the exit flag.

295
00:10:38,760 --> 00:10:41,600
But the exit flag let's you verify whether or not this algorithm thing has converged.

296
00:10:43,960 --> 00:10:46,450
So that's how you run these algorithms in Octave.

297
00:10:47,480 --> 00:10:48,920
I should mention, by the way,

298
00:10:48,940 --> 00:10:51,020
that for the Octave implementation, this value

299
00:10:51,640 --> 00:10:53,010
of theta, your parameter vector

300
00:10:53,370 --> 00:10:54,940
of theta, must be in

301
00:10:55,280 --> 00:10:58,210
rd for d greater than or equal to 2.

302
00:10:58,450 --> 00:11:00,330
So if theta is just a real number.

303
00:11:00,770 --> 00:11:02,040
So, if it is not at least

304
00:11:02,160 --> 00:11:03,160
a two-dimensional vector

305
00:11:03,800 --> 00:11:04,860
or some higher than two-dimensional

306
00:11:05,160 --> 00:11:06,840
vector, this fminunc

307
00:11:07,560 --> 00:11:08,760
may not work, so and if

308
00:11:09,140 --> 00:11:10,310
in case you have a

309
00:11:10,590 --> 00:11:11,590
one-dimensional function that you use

310
00:11:11,830 --> 00:11:12,930
to optimize, you can look

311
00:11:13,100 --> 00:11:14,680
in the octave documentation for fminunc

312
00:11:14,950 --> 00:11:16,230
for additional details.

313
00:11:18,230 --> 00:11:19,360
So, that's how we optimize

314
00:11:19,620 --> 00:11:21,640
our trial example of this

315
00:11:22,190 --> 00:11:23,810
simple quick driving cost function.

316
00:11:24,440 --> 00:11:26,520
However, we apply this to let's just say progression.

317
00:11:27,720 --> 00:11:29,270
In logistic regression we have

318
00:11:29,520 --> 00:11:31,290
a parameter vector theta, and

319
00:11:31,430 --> 00:11:32,210
I'm going to use a mix

320
00:11:32,620 --> 00:11:34,880
of octave notation and sort of math notation.

321
00:11:35,300 --> 00:11:36,400
But I hope this explanation

322
00:11:36,870 --> 00:11:38,050
will be clear, but our parameter

323
00:11:38,520 --> 00:11:40,360
vector theta comprises these

324
00:11:40,540 --> 00:11:41,780
parameters theta 0 through theta

325
00:11:42,210 --> 00:11:44,230
n because octave indexes,

326
00:11:46,090 --> 00:11:48,040
vectors using indexing from

327
00:11:48,460 --> 00:11:49,640
1, you know, theta 0 is

328
00:11:49,710 --> 00:11:51,190
actually written theta 1

329
00:11:51,330 --> 00:11:53,290
in octave, theta 1 is gonna be written.

330
00:11:53,930 --> 00:11:54,690
So, if theta 2 in octave

331
00:11:55,280 --> 00:11:56,180
and that's gonna be a written

332
00:11:56,780 --> 00:11:58,430
theta n+1, right?

333
00:11:58,610 --> 00:12:00,650
And that's because Octave indexes

334
00:12:01,320 --> 00:12:03,070
is vectors starting from index

335
00:12:03,430 --> 00:12:05,200
of 1 and so the index of 0.

336
00:12:06,920 --> 00:12:07,950
So what we need

337
00:12:08,160 --> 00:12:09,670
to do then is write a

338
00:12:09,880 --> 00:12:12,070
cost function that captures

339
00:12:12,710 --> 00:12:14,210
the cost function for logistic regression.

340
00:12:15,170 --> 00:12:16,450
Concretely, the cost function

341
00:12:16,880 --> 00:12:18,310
needs to return J-val, which is, you know, J-val

342
00:12:18,940 --> 00:12:20,430
as you need some codes to

343
00:12:20,640 --> 00:12:22,440
compute J of theta and

344
00:12:22,710 --> 00:12:24,010
we also need to give it the gradient.

345
00:12:24,540 --> 00:12:25,460
So, gradient 1 is going

346
00:12:25,920 --> 00:12:27,080
to be some code to compute

347
00:12:27,280 --> 00:12:29,100
the partial derivative in respect to

348
00:12:29,390 --> 00:12:31,250
theta 0, the next partial

349
00:12:31,600 --> 00:12:34,300
derivative respect to theta 1 and so on.

350
00:12:34,770 --> 00:12:36,260
Once again, this is gradient

351
00:12:37,500 --> 00:12:38,390
1, gradient 2 and so

352
00:12:39,030 --> 00:12:40,330
on, rather than gradient 0, gradient

353
00:12:40,500 --> 00:12:42,730
1 because octave indexes

354
00:12:43,460 --> 00:12:46,200
is vectors starting from one rather than from zero.

355
00:12:47,440 --> 00:12:48,460
But the main concept I hope

356
00:12:48,690 --> 00:12:49,540
you take away from this slide

357
00:12:49,900 --> 00:12:50,870
is, that what you need to do,

358
00:12:51,070 --> 00:12:54,370
is write a function that returns

359
00:12:55,500 --> 00:12:56,930
the cost function and returns the gradient.

360
00:12:58,410 --> 00:12:59,750
And so in order to

361
00:12:59,960 --> 00:13:01,410
apply this to logistic regression

362
00:13:02,100 --> 00:13:03,430
or even to linear regression, if

363
00:13:03,560 --> 00:13:06,230
you want to use these optimization algorithms for linear regression.

364
00:13:07,340 --> 00:13:08,350
What you need to do is plug in

365
00:13:08,500 --> 00:13:09,960
the appropriate code to compute

366
00:13:10,820 --> 00:13:12,280
these things over here.

367
00:13:15,100 --> 00:13:17,910
So, now you know how to use these advanced optimization algorithms.

368
00:13:19,030 --> 00:13:21,170
Because, using, because for

369
00:13:21,320 --> 00:13:22,660
these algorithms, you're using a

370
00:13:22,870 --> 00:13:25,190
sophisticated optimization library, it makes

371
00:13:25,690 --> 00:13:26,710
the just a little bit

372
00:13:26,940 --> 00:13:28,510
more opaque and so

373
00:13:28,740 --> 00:13:30,390
just maybe a little bit harder to debug.

374
00:13:31,290 --> 00:13:32,660
But because these algorithms often

375
00:13:33,010 --> 00:13:34,370
run much faster than gradient descent,

376
00:13:35,010 --> 00:13:36,760
often quite typically whenever

377
00:13:37,060 --> 00:13:38,180
I have a large machine learning

378
00:13:38,410 --> 00:13:39,500
problem, I will use

379
00:13:39,760 --> 00:13:42,110
these algorithms instead of using gradient descent.

380
00:13:43,900 --> 00:13:45,070
And with these ideas, hopefully,

381
00:13:45,450 --> 00:13:46,710
you'll be able to get logistic progression

382
00:13:47,350 --> 00:13:48,780
and also linear regression to work

383
00:13:49,100 --> 00:13:51,410
on much larger problems.

384
00:13:51,830 --> 00:13:53,820
So, that's it for advanced optimization concepts.

385
00:13:55,120 --> 00:13:56,170
And in the next and

386
00:13:56,320 --> 00:13:57,720
final video on Logistic Regression,

387
00:13:58,550 --> 00:13:59,470
I want to tell you how to

388
00:13:59,600 --> 00:14:00,990
take the logistic regression algorithm

389
00:14:01,520 --> 00:14:02,790
that you already know about and make

390
00:14:02,990 --> 00:14:05,420
it work also on multi-class classification problems.