1
00:00:00,160 --> 00:00:01,480
For logistic regression, we previously

2
00:00:02,110 --> 00:00:04,730
talked about two types of optimization algorithms.

3
00:00:05,190 --> 00:00:06,190
We talked about how to use

4
00:00:06,560 --> 00:00:09,210
gradient descent to optimize as cost function J of theta.

5
00:00:09,690 --> 00:00:10,770
And we also talked about

6
00:00:11,120 --> 00:00:12,730
advanced optimization methods.

7
00:00:13,520 --> 00:00:14,670
Ones that require that you

8
00:00:14,790 --> 00:00:16,300
provide a way to compute

9
00:00:16,940 --> 00:00:18,160
your cost function J of

10
00:00:18,420 --> 00:00:20,920
theta and that you provide a way to compute the derivatives.

11
00:00:22,450 --> 00:00:23,920
In this video, we'll show how

12
00:00:24,190 --> 00:00:25,420
you can adapt both of

13
00:00:25,500 --> 00:00:27,570
those techniques, both gradient descent and

14
00:00:27,720 --> 00:00:29,350
the more advanced optimization techniques

15
00:00:30,280 --> 00:00:31,770
in order to have them

16
00:00:31,950 --> 00:00:33,550
work for regularized logistic regression.

17
00:00:35,430 --> 00:00:36,670
So, here's the idea.

18
00:00:37,260 --> 00:00:38,770
We saw earlier that Logistic

19
00:00:39,190 --> 00:00:40,490
Regression can also be prone

20
00:00:40,850 --> 00:00:42,540
to overfitting if you fit

21
00:00:42,810 --> 00:00:44,090
it with a very, sort of,

22
00:00:44,290 --> 00:00:45,890
high order polynomial features like this.

23
00:00:46,470 --> 00:00:48,250
Where G is the

24
00:00:48,480 --> 00:00:49,970
sigmoid function and in

25
00:00:50,030 --> 00:00:51,330
particular you end up with

26
00:00:51,530 --> 00:00:53,020
a hypothesis, you know,

27
00:00:53,150 --> 00:00:54,120
whose decision bound to be

28
00:00:54,360 --> 00:00:55,930
just sort of an overly complex

29
00:00:56,620 --> 00:00:58,600
and extremely contortive function that

30
00:00:58,820 --> 00:00:59,680
really isn't such a great

31
00:00:59,790 --> 00:01:01,000
hypothesis for this training

32
00:01:01,350 --> 00:01:02,990
set, and more generally if you have

33
00:01:03,120 --> 00:01:04,890
logistic regression with a lot of features.

34
00:01:05,150 --> 00:01:06,630
Not necessarily polynomial ones, but

35
00:01:06,790 --> 00:01:07,510
just with a lot of

36
00:01:07,670 --> 00:01:09,720
features you can end up with overfitting.

37
00:01:11,620 --> 00:01:14,010
This was our cost function for logistic regression.

38
00:01:14,810 --> 00:01:16,210
And if we want to modify

39
00:01:16,740 --> 00:01:18,820
it to use regularization, all we

40
00:01:18,950 --> 00:01:20,630
need to do is add to

41
00:01:20,820 --> 00:01:22,290
it the following term

42
00:01:22,650 --> 00:01:24,860
plus londer over 2M, sum

43
00:01:25,110 --> 00:01:26,580
from J equals 1, and

44
00:01:26,730 --> 00:01:29,670
as usual sum from J equals 1.

45
00:01:29,800 --> 00:01:31,000
Rather than the sum from J

46
00:01:31,550 --> 00:01:33,670
equals 0, of theta J squared.

47
00:01:34,330 --> 00:01:35,470
And this has to

48
00:01:35,750 --> 00:01:36,960
effect therefore, of penalizing

49
00:01:37,650 --> 00:01:39,140
the parameters theta 1 theta

50
00:01:39,570 --> 00:01:42,600
2 and so on up to theta N from being too large.

51
00:01:43,610 --> 00:01:44,720
And if you do this,

52
00:01:45,720 --> 00:01:46,450
then it will the have the

53
00:01:46,750 --> 00:01:48,870
effect that even though you're fitting

54
00:01:49,250 --> 00:01:51,500
a very high order polynomial with a lot of parameters.

55
00:01:52,210 --> 00:01:53,240
So long as you apply regularization

56
00:01:53,910 --> 00:01:55,090
and keep the parameters small

57
00:01:55,850 --> 00:01:57,580
you're more likely to get a decision boundary.

58
00:01:58,830 --> 00:02:00,040
You know, that maybe looks more like this.

59
00:02:00,320 --> 00:02:01,460
It looks more reasonable for separating

60
00:02:02,500 --> 00:02:03,740
the positive and the negative examples.

61
00:02:05,300 --> 00:02:06,970
So, when using regularization

62
00:02:08,140 --> 00:02:09,080
even when you have a lot

63
00:02:09,220 --> 00:02:11,110
of features, the regularization can

64
00:02:11,620 --> 00:02:13,500
help take care of the overfitting problem.

65
00:02:14,740 --> 00:02:15,790
How do we actually implement this?

66
00:02:16,720 --> 00:02:18,280
Well, for the original gradient descent

67
00:02:18,710 --> 00:02:20,380
algorithm, this was the update we had.

68
00:02:20,670 --> 00:02:22,300
We will repeatedly perform the following

69
00:02:22,750 --> 00:02:24,610
update to theta J. This

70
00:02:24,740 --> 00:02:26,940
slide looks a lot like the previous one for linear regression.

71
00:02:27,510 --> 00:02:28,460
But what I'm going to do is

72
00:02:29,210 --> 00:02:31,390
write the update for theta 0 separately.

73
00:02:31,670 --> 00:02:32,930
So, the first line is

74
00:02:33,060 --> 00:02:34,110
for update for theta 0 and

75
00:02:34,230 --> 00:02:35,470
a second line is now

76
00:02:35,590 --> 00:02:36,730
my update for theta 1

77
00:02:36,880 --> 00:02:38,470
up to theta N.

78
00:02:38,900 --> 00:02:40,740
Because I'm going to treat theta 0 separately.

79
00:02:41,700 --> 00:02:43,140
And in order to

80
00:02:43,700 --> 00:02:45,370
modify this algorithm, to use

81
00:02:46,770 --> 00:02:48,480
a regularized cos function,

82
00:02:49,100 --> 00:02:50,510
all I need to do is

83
00:02:50,950 --> 00:02:51,810
pretty similar to what we

84
00:02:51,930 --> 00:02:53,700
did for linear regression is

85
00:02:53,870 --> 00:02:55,620
actually to just modify this

86
00:02:55,890 --> 00:02:57,480
second update rule as follows.

87
00:02:58,510 --> 00:02:59,800
And, once again, this, you know,

88
00:03:00,380 --> 00:03:02,080
cosmetically looks identical what

89
00:03:02,230 --> 00:03:03,720
we had for linear regression.

90
00:03:04,580 --> 00:03:05,580
But of course is not the

91
00:03:05,660 --> 00:03:06,590
same algorithm as we had,

92
00:03:06,890 --> 00:03:08,370
because now the hypothesis

93
00:03:08,780 --> 00:03:10,420
is defined using this.

94
00:03:10,860 --> 00:03:12,550
So this is not the same algorithm

95
00:03:13,130 --> 00:03:14,390
as regularized linear regression.

96
00:03:14,830 --> 00:03:16,340
Because the hypothesis is different.

97
00:03:16,940 --> 00:03:18,360
Even though this update that I wrote down.

98
00:03:18,630 --> 00:03:20,160
It actually looks cosmetically the

99
00:03:20,350 --> 00:03:22,130
same as what we had earlier.

100
00:03:22,480 --> 00:03:25,310
We're working out gradient descent for regularized linear regression.

101
00:03:26,690 --> 00:03:27,720
And of course, just to wrap

102
00:03:27,830 --> 00:03:29,360
up this discussion, this term

103
00:03:29,560 --> 00:03:30,860
here in the square

104
00:03:31,130 --> 00:03:32,330
brackets, so this term

105
00:03:32,670 --> 00:03:35,120
here, this term is,

106
00:03:35,410 --> 00:03:36,750
of course, the new partial

107
00:03:37,210 --> 00:03:38,590
derivative for respect of

108
00:03:38,660 --> 00:03:41,420
theta J of the new cost function J of theta.

109
00:03:42,300 --> 00:03:43,480
Where J of theta here is

110
00:03:43,700 --> 00:03:44,980
the cost function we defined on

111
00:03:45,180 --> 00:03:48,100
a previous slide that does use regularization.

112
00:03:49,770 --> 00:03:52,060
So, that's gradient descent for regularized linear regression.

113
00:03:55,200 --> 00:03:56,430
Let's talk about how to

114
00:03:56,580 --> 00:03:58,290
get regularized linear regression

115
00:03:58,950 --> 00:04:00,010
to work using the more

116
00:04:00,360 --> 00:04:02,070
advanced optimization methods.

117
00:04:03,180 --> 00:04:05,590
And just to remind you for

118
00:04:05,840 --> 00:04:06,800
those methods what we needed

119
00:04:07,080 --> 00:04:08,390
to do was to define the

120
00:04:08,450 --> 00:04:09,460
function that's called the cost

121
00:04:09,640 --> 00:04:11,160
function, that takes us

122
00:04:11,280 --> 00:04:13,660
input the parameter vector theta and

123
00:04:13,790 --> 00:04:16,180
once again in the equations

124
00:04:16,770 --> 00:04:19,030
we've been writing here we used 0 index vectors.

125
00:04:19,510 --> 00:04:20,690
So we had theta 0 up

126
00:04:21,180 --> 00:04:22,810
to theta N. But

127
00:04:23,020 --> 00:04:25,920
because Octave indexes the vectors starting from 1.

128
00:04:26,820 --> 00:04:28,240
Theta 0 is written

129
00:04:28,560 --> 00:04:29,990
in Octave as theta 1.

130
00:04:30,120 --> 00:04:31,630
Theta 1 is written in

131
00:04:31,860 --> 00:04:32,930
Octave as theta 2, and

132
00:04:33,280 --> 00:04:35,070
so on down to theta

133
00:04:36,270 --> 00:04:36,650
N plus 1.

134
00:04:36,740 --> 00:04:38,450
And what we needed to

135
00:04:38,600 --> 00:04:40,240
do was provide a function.

136
00:04:41,170 --> 00:04:42,370
Let's provide a function called

137
00:04:42,780 --> 00:04:44,140
cost function that we would

138
00:04:44,360 --> 00:04:46,920
then pass in to what we have, what we saw earlier.

139
00:04:47,300 --> 00:04:48,490
We will use the fminunc

140
00:04:49,060 --> 00:04:50,310
and then

141
00:04:50,540 --> 00:04:52,160
you know at cost function,

142
00:04:54,830 --> 00:04:55,430
and so on, right.

143
00:04:55,600 --> 00:04:56,870
But the F min, u

144
00:04:57,030 --> 00:04:58,060
and c was the F min

145
00:04:58,280 --> 00:04:59,310
unconstrained and this will

146
00:04:59,650 --> 00:05:01,230
work with fminunc

147
00:05:01,310 --> 00:05:02,300
was what will take

148
00:05:02,540 --> 00:05:04,340
the cost function and minimize it for us.

149
00:05:05,950 --> 00:05:07,050
So the two main things that

150
00:05:07,170 --> 00:05:08,600
the cost function needed to

151
00:05:08,700 --> 00:05:10,620
return were first J-val.

152
00:05:11,280 --> 00:05:12,400
And for that, we need

153
00:05:12,720 --> 00:05:13,950
to write code to

154
00:05:14,020 --> 00:05:15,710
compute the cost function J of theta.

155
00:05:17,130 --> 00:05:19,030
Now, when we're using regularized logistic

156
00:05:19,450 --> 00:05:20,920
regression, of course the

157
00:05:20,990 --> 00:05:21,960
cost function j of theta

158
00:05:22,280 --> 00:05:23,450
changes and, in particular,

159
00:05:24,480 --> 00:05:25,760
now a cost function needs to

160
00:05:25,870 --> 00:05:29,580
include this additional regularization term at the end as well.

161
00:05:29,850 --> 00:05:30,930
So, when you compute j of

162
00:05:31,030 --> 00:05:33,410
theta be sure to include that term at the end.

163
00:05:34,590 --> 00:05:35,520
And then, the other thing that

164
00:05:36,050 --> 00:05:37,240
this cost function thing

165
00:05:37,690 --> 00:05:39,010
needs to derive with a gradient.

166
00:05:39,530 --> 00:05:41,170
So gradient one needs

167
00:05:41,400 --> 00:05:42,570
to be set to the

168
00:05:42,660 --> 00:05:44,080
partial derivative of J

169
00:05:44,240 --> 00:05:45,520
of theta with respect to theta

170
00:05:45,690 --> 00:05:47,170
zero, gradient two needs

171
00:05:47,580 --> 00:05:49,520
to be set to that, and so on.

172
00:05:49,780 --> 00:05:50,900
Once again, the index is off by one.

173
00:05:51,220 --> 00:05:52,850
Right, because of the indexing from

174
00:05:53,110 --> 00:05:54,450
one Octave users.

175
00:05:55,940 --> 00:05:56,780
And looking at these terms.

176
00:05:57,850 --> 00:05:58,680
This term over here.

177
00:05:59,410 --> 00:06:00,640
We actually worked this out

178
00:06:00,720 --> 00:06:02,840
on a previous slide is actually equal to this.

179
00:06:03,230 --> 00:06:03,640
It doesn't change.

180
00:06:04,120 --> 00:06:07,250
Because the derivative for theta zero doesn't change.

181
00:06:07,650 --> 00:06:09,540
Compared to the version without regularization.

182
00:06:10,960 --> 00:06:13,210
And the other terms do change.

183
00:06:13,840 --> 00:06:16,340
And in particular the derivative respect to theta one.

184
00:06:17,010 --> 00:06:18,830
We worked this out on the previous slide as well.

185
00:06:19,110 --> 00:06:20,670
Is equal to, you know,

186
00:06:20,890 --> 00:06:22,560
the original term and then minus

187
00:06:23,450 --> 00:06:24,870
londer M times theta 1.

188
00:06:25,310 --> 00:06:27,140
Just so we make sure we pass this correctly.

189
00:06:27,800 --> 00:06:29,370
And we can add parentheses here.

190
00:06:29,830 --> 00:06:30,980
Right, so the summation doesn't extend.

191
00:06:31,570 --> 00:06:33,160
And similarly, you know,

192
00:06:33,380 --> 00:06:34,800
this other term here looks

193
00:06:35,130 --> 00:06:36,180
like this, with this additional

194
00:06:37,070 --> 00:06:37,950
term that we had on

195
00:06:38,030 --> 00:06:39,770
the previous slide, that corresponds to

196
00:06:39,950 --> 00:06:41,450
the gradient from their regularization objective.

197
00:06:42,230 --> 00:06:43,650
So if you implement this

198
00:06:43,820 --> 00:06:45,140
cost function and pass

199
00:06:45,720 --> 00:06:47,370
this into fminunc

200
00:06:48,190 --> 00:06:49,160
or to one of those advanced optimization

201
00:06:50,050 --> 00:06:51,940
techniques, that will minimize

202
00:06:52,540 --> 00:06:55,990
the new regularized cost function J of theta.

203
00:06:56,990 --> 00:06:58,220
And the parameters you get out

204
00:06:59,530 --> 00:07:00,740
will be the ones that correspond to

205
00:07:01,450 --> 00:07:02,940
logistic regression with regularization.

206
00:07:04,410 --> 00:07:05,540
So, now you know

207
00:07:05,780 --> 00:07:08,210
how to implement regularized logistic regression.

208
00:07:09,780 --> 00:07:10,920
When I walk around Silicon Valley,

209
00:07:11,380 --> 00:07:12,900
I live here in Silicon Valley, there are

210
00:07:13,100 --> 00:07:14,900
a lot of engineers that are frankly, making

211
00:07:15,420 --> 00:07:16,490
a ton of money for their

212
00:07:16,610 --> 00:07:18,090
companies using machine learning algorithms.

213
00:07:19,180 --> 00:07:20,390
And I know we've

214
00:07:20,600 --> 00:07:22,860
only been, you know, studying this stuff for a little while.

215
00:07:23,620 --> 00:07:25,410
But if you understand linear

216
00:07:26,510 --> 00:07:28,360
regression, the advanced optimization

217
00:07:29,210 --> 00:07:30,710
algorithms and regularization, by

218
00:07:30,950 --> 00:07:32,520
now, frankly, you probably know

219
00:07:32,950 --> 00:07:34,270
quite a lot more machine learning

220
00:07:35,010 --> 00:07:36,290
than many, certainly now,

221
00:07:36,750 --> 00:07:38,050
but you probably know quite a

222
00:07:38,180 --> 00:07:39,580
lot more machine learning right now

223
00:07:40,240 --> 00:07:41,670
than frankly, many of the

224
00:07:41,820 --> 00:07:44,760
Silicon Valley engineers out there having very successful careers.

225
00:07:45,300 --> 00:07:46,420
You know, making tons of money for the companies.

226
00:07:47,050 --> 00:07:49,250
Or building products using machine learning algorithms.

227
00:07:50,370 --> 00:07:50,960
So, congratulations.

228
00:07:52,080 --> 00:07:53,120
You've actually come a long ways.

229
00:07:53,490 --> 00:07:54,550
And you can actually, you

230
00:07:54,780 --> 00:07:55,990
actually know enough to

231
00:07:56,310 --> 00:07:58,210
apply this stuff and get to work for many problems.

232
00:07:59,260 --> 00:08:00,580
So congratulations for that.

233
00:08:00,780 --> 00:08:01,880
But of course, there's

234
00:08:02,350 --> 00:08:03,280
still a lot more that we

235
00:08:03,400 --> 00:08:05,180
want to teach you, and in

236
00:08:05,380 --> 00:08:06,540
the next set of videos after

237
00:08:06,560 --> 00:08:07,850
this, we'll start to talk

238
00:08:08,030 --> 00:08:10,890
about a very powerful cause of non-linear classifier.

239
00:08:11,680 --> 00:08:13,350
So whereas linear regression, logistic

240
00:08:13,690 --> 00:08:14,940
regression, you know, you can

241
00:08:15,080 --> 00:08:17,310
form polynomial terms, but it

242
00:08:17,460 --> 00:08:18,350
turns out that there are much

243
00:08:18,510 --> 00:08:21,150
more powerful nonlinear quantifiers that

244
00:08:21,460 --> 00:08:23,650
can then sort of polynomial regression.

245
00:08:24,640 --> 00:08:25,780
And in the next set

246
00:08:25,810 --> 00:08:28,280
of videos after this one, I'll start telling you about them.

247
00:08:28,510 --> 00:08:29,560
So that you have even more

248
00:08:29,760 --> 00:08:30,440
powerful learning algorithms than you have

249
00:08:31,380 --> 00:08:32,870
now to apply to different problems.