1
00:00:00,260 --> 00:00:01,490
For linear regression, we had

2
00:00:01,680 --> 00:00:03,130
previously worked out two learning

3
00:00:03,490 --> 00:00:05,010
algorithms, one based on

4
00:00:05,180 --> 00:00:07,650
gradient descent and one based on the normal equation.

5
00:00:08,750 --> 00:00:09,740
In this video we will take

6
00:00:09,890 --> 00:00:11,640
those two algorithms and generalize

7
00:00:12,290 --> 00:00:13,380
them to the case of regularized

8
00:00:14,330 --> 00:00:17,640
linear regression. Here's the

9
00:00:18,100 --> 00:00:19,540
optimization objective, that we

10
00:00:20,200 --> 00:00:22,380
came up with last time for regularized linear regression.

11
00:00:23,360 --> 00:00:24,580
This first part is our

12
00:00:24,980 --> 00:00:27,240
usual, objective for linear regression,

13
00:00:28,170 --> 00:00:29,300
and we now have this additional

14
00:00:30,200 --> 00:00:31,750
regularization term, where londer

15
00:00:32,450 --> 00:00:34,960
is our regularization parameter, and

16
00:00:35,220 --> 00:00:36,690
we like to find parameters theta,

17
00:00:37,160 --> 00:00:38,550
that minimizes this cost function,

18
00:00:39,030 --> 00:00:41,280
this regularized cost function, J of theta.

19
00:00:41,840 --> 00:00:43,030
Previously, we were using

20
00:00:43,440 --> 00:00:45,180
gradient descent for the original

21
00:00:46,620 --> 00:00:48,060
cost function, without the regularization

22
00:00:48,770 --> 00:00:49,820
term, and we had

23
00:00:50,060 --> 00:00:51,990
the following algorithm for regular

24
00:00:52,370 --> 00:00:53,620
linear regression, without regularization.

25
00:00:54,660 --> 00:00:56,260
We will repeatedly update the

26
00:00:56,330 --> 00:00:57,670
parameters theta J as follows

27
00:00:58,270 --> 00:01:00,030
for J equals 1,2 up

28
00:01:00,400 --> 00:01:02,110
through n. Let me

29
00:01:02,530 --> 00:01:03,960
take this and just write

30
00:01:04,240 --> 00:01:06,580
the case for theta zero separately.

31
00:01:07,210 --> 00:01:08,400
So, you know, I'm just gonna

32
00:01:08,720 --> 00:01:09,900
write the update for theta

33
00:01:10,160 --> 00:01:12,500
zero separately, then for

34
00:01:12,680 --> 00:01:14,380
the update for the parameters

35
00:01:14,780 --> 00:01:17,090
1, 2, 3, and so on up

36
00:01:17,370 --> 00:01:19,760
to n. So, I haven't changed anything yet, right?

37
00:01:19,970 --> 00:01:21,070
This is just writing the update

38
00:01:21,300 --> 00:01:23,300
for theta zero separately from the

39
00:01:23,550 --> 00:01:25,240
updates from theta 1, theta

40
00:01:25,510 --> 00:01:26,980
2, theta 3, up to theta n. And

41
00:01:27,040 --> 00:01:27,900
the reason I want to do this

42
00:01:28,230 --> 00:01:29,320
is you may remember

43
00:01:29,880 --> 00:01:31,260
that for our regularized linear regression,

44
00:01:32,620 --> 00:01:33,970
we penalize the parameters theta

45
00:01:34,440 --> 00:01:35,540
1, theta 2, and so

46
00:01:35,860 --> 00:01:38,360
on up to theta n, but we don't penalize theta zero.

47
00:01:38,820 --> 00:01:40,250
So when we modify this

48
00:01:40,410 --> 00:01:42,400
algorithm for regularized

49
00:01:42,750 --> 00:01:44,050
linear regression, we're going to

50
00:01:44,710 --> 00:01:46,870
end up treating theta zero slightly differently.

51
00:01:48,560 --> 00:01:50,360
Concretely, if we

52
00:01:50,500 --> 00:01:52,170
want to take this algorithm and

53
00:01:52,300 --> 00:01:53,780
modify it to use the

54
00:01:53,870 --> 00:01:55,630
regularized objective, all we

55
00:01:55,740 --> 00:01:57,170
need to do is take this

56
00:01:57,350 --> 00:02:00,010
term at the bottom and modify as follows.

57
00:02:00,460 --> 00:02:01,860
We're gonna take this term and add

58
00:02:02,670 --> 00:02:05,310
minus londer M,

59
00:02:06,330 --> 00:02:08,920
times theta J. And

60
00:02:09,100 --> 00:02:10,850
if you implement this, then you

61
00:02:11,000 --> 00:02:13,230
have gradient descent for trying

62
00:02:13,960 --> 00:02:15,920
to minimize the regularized cost

63
00:02:16,160 --> 00:02:18,200
function J of F theta, and concretely,

64
00:02:19,520 --> 00:02:20,570
I'm not gonna do the

65
00:02:20,680 --> 00:02:22,260
calculus to prove it, but

66
00:02:22,390 --> 00:02:23,480
concretely if you look

67
00:02:23,690 --> 00:02:26,580
at this term, this term that's written is square brackets.

68
00:02:27,730 --> 00:02:28,930
If you know calculus, it's possible

69
00:02:29,380 --> 00:02:31,150
to prove that that term is

70
00:02:31,370 --> 00:02:33,150
the partial derivative, with respect of

71
00:02:33,980 --> 00:02:35,400
J of theta, using the new

72
00:02:35,660 --> 00:02:37,520
definition of J of theta

73
00:02:38,140 --> 00:02:39,330
with the regularization term.

74
00:02:39,510 --> 00:02:42,490
And somebody on this

75
00:02:42,760 --> 00:02:43,960
term up on top,

76
00:02:44,750 --> 00:02:45,570
which I guess I am

77
00:02:45,680 --> 00:02:47,240
drawing the salient box

78
00:02:48,000 --> 00:02:49,270
that's still the partial derivative

79
00:02:49,940 --> 00:02:52,700
respect of theta zero for J of theta.

80
00:02:53,680 --> 00:02:54,900
If you look at the update for

81
00:02:55,600 --> 00:02:56,710
theta J, it's possible to

82
00:02:56,910 --> 00:02:59,190
show's something pretty interesting, concretely

83
00:02:59,860 --> 00:03:01,100
theta J gets updated as

84
00:03:01,280 --> 00:03:03,400
theta J, minus alpha times,

85
00:03:04,090 --> 00:03:05,010
and then you have this other term

86
00:03:05,380 --> 00:03:06,730
here that depends on theta J

87
00:03:06,910 --> 00:03:08,310
. So if you

88
00:03:08,420 --> 00:03:09,410
group all the terms together

89
00:03:10,030 --> 00:03:11,690
that depending on theta J. We

90
00:03:11,780 --> 00:03:13,190
can show that this update can

91
00:03:13,670 --> 00:03:15,100
be written equivalently as

92
00:03:15,200 --> 00:03:16,160
follows and all I did

93
00:03:16,470 --> 00:03:17,620
was have, you know, theta J

94
00:03:18,310 --> 00:03:20,100
here is theta J times

95
00:03:20,450 --> 00:03:21,950
1 and this term is

96
00:03:22,910 --> 00:03:24,830
londer over M. There's also an alpha

97
00:03:25,140 --> 00:03:25,990
here, so you end up

98
00:03:26,180 --> 00:03:27,650
with alpha londer over

99
00:03:27,970 --> 00:03:31,450
m, multiply them to

100
00:03:31,820 --> 00:03:33,660
theta J and this term here, one minus

101
00:03:34,230 --> 00:03:36,300
alpha times londer M, is

102
00:03:36,600 --> 00:03:39,470
a pretty interesting term, it has a pretty interesting effect.

103
00:03:42,310 --> 00:03:43,710
Concretely, this term one

104
00:03:43,890 --> 00:03:45,320
minus alpha times londer over

105
00:03:45,730 --> 00:03:46,780
M, is going to be

106
00:03:46,870 --> 00:03:48,740
a number that's, you know, usually a number

107
00:03:48,800 --> 00:03:50,390
that's a loop and less than 1,

108
00:03:50,610 --> 00:03:51,670
right? Because of

109
00:03:51,920 --> 00:03:53,580
alpha times londer over M is

110
00:03:54,070 --> 00:03:55,920
going to be positive and usually, if you're learning rate is small and M is large.

111
00:03:58,650 --> 00:03:58,860
That's usually pretty small.

112
00:03:59,650 --> 00:04:00,680
So this term here, it's going

113
00:04:00,740 --> 00:04:03,060
to be a number, it's usually, you know, a little bit less than one.

114
00:04:03,340 --> 00:04:04,150
So think of it as

115
00:04:04,330 --> 00:04:05,860
a number like 0.99, let's say

116
00:04:07,380 --> 00:04:08,800
and so, the effect of our

117
00:04:09,120 --> 00:04:10,550
updates of theta J is we're

118
00:04:10,690 --> 00:04:11,950
going to say that theta J

119
00:04:12,410 --> 00:04:15,420
gets replaced by thetata J times 0.99.

120
00:04:15,770 --> 00:04:17,500
Alright so theta J

121
00:04:18,490 --> 00:04:20,940
times 0.99 has the effect of

122
00:04:21,280 --> 00:04:23,560
shrinking theta J a little bit towards 0.

123
00:04:23,670 --> 00:04:25,690
So this makes theta J a bit smaller.

124
00:04:26,220 --> 00:04:28,080
More formally, this you know, this

125
00:04:28,420 --> 00:04:29,750
square norm of theta J

126
00:04:29,870 --> 00:04:31,580
is smaller and then

127
00:04:31,720 --> 00:04:33,430
after that the second

128
00:04:33,910 --> 00:04:35,400
term here, that's actually

129
00:04:35,980 --> 00:04:37,930
exactly the same as the

130
00:04:38,050 --> 00:04:40,270
original gradient descent updated that we had.

131
00:04:40,750 --> 00:04:42,840
Before we added all this regularization stuff.

132
00:04:44,270 --> 00:04:46,920
So, hopefully this gradient

133
00:04:47,380 --> 00:04:48,630
descent, hopefully this update makes

134
00:04:48,880 --> 00:04:51,350
sense, when we're using regularized linear

135
00:04:51,550 --> 00:04:52,920
regression what we're doing is on

136
00:04:53,320 --> 00:04:55,210
every regularization were multiplying data

137
00:04:55,420 --> 00:04:56,310
J by a number that

138
00:04:56,400 --> 00:04:57,300
is a little bit less than one, so

139
00:04:57,400 --> 00:04:58,900
we're shrinking the parameter a

140
00:04:59,230 --> 00:05:00,340
little bit, and then we're

141
00:05:00,500 --> 00:05:03,000
performing a, you know, similar update as before.

142
00:05:04,170 --> 00:05:05,460
Of course that's just the

143
00:05:05,610 --> 00:05:08,310
intuition behind what this particular update is doing.

144
00:05:08,910 --> 00:05:10,130
Mathematically, what it's doing

145
00:05:10,580 --> 00:05:12,950
is exactly gradient descent on

146
00:05:13,130 --> 00:05:14,330
the cost function J of theta

147
00:05:15,150 --> 00:05:16,020
that we defined on the previous

148
00:05:16,480 --> 00:05:18,820
slide that uses the regularization term.

149
00:05:19,780 --> 00:05:21,210
Gradient descent was just

150
00:05:21,470 --> 00:05:23,050
one our two algorithms for

151
00:05:24,470 --> 00:05:25,530
fitting a linear regression model.

152
00:05:26,630 --> 00:05:28,090
The second algorithm was the

153
00:05:28,160 --> 00:05:29,130
one based on the normal

154
00:05:29,680 --> 00:05:31,650
equation where, what we

155
00:05:31,740 --> 00:05:32,980
did was we created the

156
00:05:33,060 --> 00:05:34,770
design matrix "x" where each

157
00:05:35,080 --> 00:05:37,830
row corresponded to a separate training example.

158
00:05:38,520 --> 00:05:39,790
And we created a vector

159
00:05:40,170 --> 00:05:41,780
Y, so this is

160
00:05:41,940 --> 00:05:43,320
a vector that is an

161
00:05:43,590 --> 00:05:45,520
M dimensional vector and that

162
00:05:46,010 --> 00:05:47,750
contain the labels from a training set.

163
00:05:48,470 --> 00:05:49,600
So whereas X is an

164
00:05:49,830 --> 00:05:52,660
M by N plus 1 dimensional matrix.

165
00:05:53,590 --> 00:05:55,220
Y is an M dimensional

166
00:05:55,780 --> 00:05:57,550
vector and in order

167
00:05:58,030 --> 00:05:59,200
to minimize the cost

168
00:05:59,470 --> 00:06:00,940
function change we found

169
00:06:01,470 --> 00:06:03,000
that of one way

170
00:06:03,230 --> 00:06:04,440
to do is to set

171
00:06:04,670 --> 00:06:06,790
theta to be equal to this.

172
00:06:07,540 --> 00:06:09,040
We have X transpose X,

173
00:06:10,860 --> 00:06:12,770
inverse X transpose Y.

174
00:06:13,020 --> 00:06:13,920
I am leaving room here, to fill

175
00:06:14,120 --> 00:06:17,160
in stuff of course. And what this

176
00:06:17,650 --> 00:06:18,820
value for theta does, is

177
00:06:19,180 --> 00:06:20,980
this minimizes the cost

178
00:06:21,250 --> 00:06:22,710
function J of theta when

179
00:06:22,840 --> 00:06:26,280
we were not using regularization. Now

180
00:06:26,460 --> 00:06:28,580
that we are using regularization, if

181
00:06:28,780 --> 00:06:30,290
you were to derive what the

182
00:06:30,520 --> 00:06:31,820
minimum is, and just to

183
00:06:31,910 --> 00:06:32,760
give you a sense of how to

184
00:06:32,980 --> 00:06:34,110
derive the minimum, the way

185
00:06:34,220 --> 00:06:35,220
you derive it is you know,

186
00:06:35,930 --> 00:06:37,910
take partial derivatives in respect

187
00:06:38,340 --> 00:06:40,600
to each parameter, set this

188
00:06:40,830 --> 00:06:41,910
to zero, and then do

189
00:06:42,060 --> 00:06:42,920
a bunch of math, and you can

190
00:06:43,100 --> 00:06:45,060
then show that is a formula

191
00:06:45,550 --> 00:06:47,640
like this that minimizes the cost function.

192
00:06:48,590 --> 00:06:52,130
And concretely, if you

193
00:06:52,240 --> 00:06:54,080
are using regularization then this

194
00:06:54,250 --> 00:06:56,320
formula changes as follows. Inside this

195
00:06:56,480 --> 00:06:59,120
parenthesis, you end up with a matrix like this.

196
00:06:59,460 --> 00:07:00,940
Zero, one, one, one

197
00:07:01,800 --> 00:07:03,520
and so on, one until the bottom.

198
00:07:04,510 --> 00:07:05,510
So this thing over here is

199
00:07:05,630 --> 00:07:07,810
a matrix, who's upper leftmost entry is zero.

200
00:07:08,560 --> 00:07:10,080
There's ones on the diagonals and

201
00:07:10,190 --> 00:07:11,960
then the zeros everywhere else on this matrix.

202
00:07:13,050 --> 00:07:14,020
Because I am drawing this a little bit sloppy.

203
00:07:15,180 --> 00:07:16,790
But as a concrete

204
00:07:17,060 --> 00:07:18,210
example if N equals 2,

205
00:07:19,090 --> 00:07:21,110
then this matrix

206
00:07:21,840 --> 00:07:23,500
is going to be a three by three matrix.

207
00:07:24,300 --> 00:07:26,210
More generally, this matrix is

208
00:07:26,360 --> 00:07:27,660
a N plus one

209
00:07:28,270 --> 00:07:30,290
by N plus one dimensional matrix.

210
00:07:31,620 --> 00:07:33,150
So then N equals two then that

211
00:07:33,370 --> 00:07:35,410
matrix becomes something that looks like this.

212
00:07:35,980 --> 00:07:37,360
Zero, and then ones

213
00:07:37,640 --> 00:07:39,020
on the diagonals, and then

214
00:07:39,160 --> 00:07:41,100
zeros on the rest of the diagonals.

215
00:07:42,390 --> 00:07:43,990
And once again, you know, I'm not going to those this derivation.

216
00:07:44,620 --> 00:07:46,280
Which is frankly somewhat long and involved.

217
00:07:46,620 --> 00:07:47,530
But it is possible to prove

218
00:07:47,970 --> 00:07:49,550
that if you are

219
00:07:49,940 --> 00:07:50,770
using the new definition of

220
00:07:51,250 --> 00:07:53,730
J of theta, with the regularization objective.

221
00:07:54,780 --> 00:07:56,070
Then this new formula for

222
00:07:56,220 --> 00:07:57,180
theta is the one

223
00:07:57,390 --> 00:08:00,080
that will give you the global minimum of J of theta.

224
00:08:01,420 --> 00:08:02,460
So finally, I want to

225
00:08:02,610 --> 00:08:05,460
just quickly describe the issue of non-invertibility.

226
00:08:06,800 --> 00:08:08,110
This is relatively advanced material.

227
00:08:08,600 --> 00:08:09,530
So you should consider this as

228
00:08:09,770 --> 00:08:11,600
optional and feel free

229
00:08:11,750 --> 00:08:12,520
to skip it or if you

230
00:08:12,660 --> 00:08:14,180
listen to it and you know, possibly it

231
00:08:14,320 --> 00:08:15,680
don't really make sense, don't worry about it either.

232
00:08:16,400 --> 00:08:18,950
But earlier when I talked the normal equation method.

233
00:08:19,700 --> 00:08:20,920
We also had an optional video

234
00:08:21,800 --> 00:08:22,960
on the non-invertability issue.

235
00:08:23,700 --> 00:08:25,740
So this is another optional part,

236
00:08:26,170 --> 00:08:27,070
that is sort of add on

237
00:08:27,700 --> 00:08:30,100
earlier optional video on non-invertibility.

238
00:08:31,610 --> 00:08:33,350
Now considering setting where M

239
00:08:33,850 --> 00:08:35,340
the number of examples is less

240
00:08:35,690 --> 00:08:37,530
than or equal to N the number features.

241
00:08:38,650 --> 00:08:40,080
If you have fewer examples then

242
00:08:40,200 --> 00:08:41,480
features then this matrix

243
00:08:42,170 --> 00:08:43,870
X transpose X will be

244
00:08:44,070 --> 00:08:47,770
non-invertible or singular, or

245
00:08:48,060 --> 00:08:50,120
the other term

246
00:08:50,360 --> 00:08:51,470
for this is the matrix will

247
00:08:51,530 --> 00:08:53,390
be degenerate and if

248
00:08:53,860 --> 00:08:54,780
you implement this in Octave

249
00:08:55,300 --> 00:08:56,380
anyway, and you use the

250
00:08:56,620 --> 00:08:58,570
P in function to take the psuedo inverse.

251
00:08:58,850 --> 00:08:59,800
It will kind of do the

252
00:09:00,080 --> 00:09:01,900
right thing that is not

253
00:09:02,240 --> 00:09:03,450
clear that it will

254
00:09:03,560 --> 00:09:04,570
give you a very good hypothesis

255
00:09:05,410 --> 00:09:07,720
even though numerically the octave

256
00:09:08,370 --> 00:09:09,670
P in function

257
00:09:10,020 --> 00:09:11,050
will give you a result that

258
00:09:11,340 --> 00:09:13,210
kind of makes sense.

259
00:09:13,440 --> 00:09:15,460
But, if you were doing this in a different language.

260
00:09:16,270 --> 00:09:17,590
And if you were

261
00:09:17,710 --> 00:09:19,030
taking just the regular inverse

262
00:09:20,470 --> 00:09:22,070
which an octave is denoted with the function INV.

263
00:09:23,240 --> 00:09:24,010
We're trying to take the regular

264
00:09:24,330 --> 00:09:25,620
inverse of X transpose X,

265
00:09:26,300 --> 00:09:28,030
then in this setting you

266
00:09:28,150 --> 00:09:30,340
find that X transpose X

267
00:09:30,450 --> 00:09:32,750
is singular, is non-invertible and

268
00:09:32,790 --> 00:09:33,740
if you're doing this in a different

269
00:09:33,990 --> 00:09:35,830
programming language and using some

270
00:09:36,230 --> 00:09:39,160
linear algebra library try to take the inverse of this matrix.

271
00:09:39,840 --> 00:09:41,080
It just might not work because that

272
00:09:41,220 --> 00:09:43,060
matrix is non-invertible or singular.

273
00:09:44,650 --> 00:09:47,110
Fortunately, regularization also takes

274
00:09:47,110 --> 00:09:49,850
care of this for us, and concretely, so

275
00:09:50,010 --> 00:09:53,370
long as the regularization parameter is strictly greater than zero.

276
00:09:53,870 --> 00:09:55,220
It is actually possible to

277
00:09:55,300 --> 00:09:56,840
prove that this matrix X

278
00:09:57,080 --> 00:09:58,690
transpose X plus parameter time, you know,s

279
00:09:59,080 --> 00:10:00,400
this funny matrix here,

280
00:10:00,970 --> 00:10:02,250
is possible to prove that this

281
00:10:02,470 --> 00:10:03,650
matrix will not be

282
00:10:03,760 --> 00:10:05,710
singular and that this matrix will be invertible.

283
00:10:07,450 --> 00:10:09,430
So using regularization also takes

284
00:10:09,700 --> 00:10:11,910
care of any non-invertibility issues

285
00:10:12,580 --> 00:10:14,470
of the X transpose X matrix as well.

286
00:10:15,260 --> 00:10:18,000
So, you now know how to implement regularize linear regression.

287
00:10:18,870 --> 00:10:19,910
Using this, you'll be able

288
00:10:20,300 --> 00:10:21,970
to avoid overfitting, even

289
00:10:22,210 --> 00:10:24,720
if you have lots of features in a relatively small training set.

290
00:10:25,360 --> 00:10:26,630
And this should let you get

291
00:10:26,980 --> 00:10:29,000
linear regression to work much better for many problems.

292
00:10:30,060 --> 00:10:31,190
In the next video, we'll take

293
00:10:31,390 --> 00:10:34,310
this regularization idea and apply it to logistic regression.

294
00:10:35,140 --> 00:10:36,170
So that you'll be able to

295
00:10:36,280 --> 00:10:37,630
get logistic impression to avoid

296
00:10:37,920 --> 00:10:39,830
overfitting and perform much better as well.