1
00:00:00,144 --> 00:00:02,011
In this video, I'd like to

2
00:00:02,011 --> 00:00:03,990
convey to you, the main intuitions

3
00:00:03,990 --> 00:00:05,771
behind how regularization works.

4
00:00:05,771 --> 00:00:07,386
And, we'll also write down

5
00:00:07,386 --> 00:00:11,724
the cost function that we'll use, when we were using regularization.

6
00:00:11,780 --> 00:00:13,327
With the hand drawn examples that

7
00:00:13,327 --> 00:00:14,916
we have on these slides, I

8
00:00:14,950 --> 00:00:17,642
think I'll be able to convey part of the intuition.

9
00:00:17,700 --> 00:00:19,608
But, an even better

10
00:00:19,608 --> 00:00:21,192
way to see for yourself, how

11
00:00:21,192 --> 00:00:22,643
regularization works, is if

12
00:00:22,643 --> 00:00:25,869
you implement it, and, see it work for yourself.

13
00:00:25,869 --> 00:00:26,888
And, if you do the

14
00:00:26,888 --> 00:00:28,603
appropriate exercises after this,

15
00:00:28,603 --> 00:00:30,053
you get the chance

16
00:00:30,053 --> 00:00:33,927
to self see regularization in action for yourself.

17
00:00:33,930 --> 00:00:36,519
So, here is the intuition.

18
00:00:36,519 --> 00:00:38,233
In the previous video, we saw

19
00:00:38,233 --> 00:00:39,771
that, if we were to fit

20
00:00:39,771 --> 00:00:41,420
a quadratic function to this

21
00:00:41,420 --> 00:00:44,283
data, it gives us a pretty good fit to the data.

22
00:00:44,283 --> 00:00:45,286
Whereas, if we were to

23
00:00:45,310 --> 00:00:47,175
fit an overly high order

24
00:00:47,210 --> 00:00:48,823
degree polynomial, we end

25
00:00:48,850 --> 00:00:50,111
up with a curve that may fit

26
00:00:50,111 --> 00:00:51,760
the training set very well, but,

27
00:00:51,760 --> 00:00:53,381
really not be a,

28
00:00:53,420 --> 00:00:54,497
but overfit the data

29
00:00:54,497 --> 00:00:57,225
poorly, and, not generalize well.

30
00:00:57,900 --> 00:01:00,453
Consider the following, suppose we

31
00:01:00,453 --> 00:01:02,088
were to penalize, and, make

32
00:01:02,088 --> 00:01:04,753
the parameters theta 3 and theta 4 really small.

33
00:01:04,753 --> 00:01:06,543
Here's what I

34
00:01:06,543 --> 00:01:09,676
mean, here is our optimization

35
00:01:09,690 --> 00:01:10,859
objective, or here is our

36
00:01:10,870 --> 00:01:12,574
optimization problem, where we minimize

37
00:01:12,580 --> 00:01:15,526
our usual squared error cause function.

38
00:01:15,526 --> 00:01:17,350
Let's say I take this objective

39
00:01:17,370 --> 00:01:19,125
and modify it and add

40
00:01:19,160 --> 00:01:23,291
to it, plus 1000 theta

41
00:01:23,291 --> 00:01:28,334
3 squared, plus 1000 theta 4 squared.

42
00:01:28,334 --> 00:01:32,354
1000 I am just writing down as some huge number.

43
00:01:32,354 --> 00:01:33,538
Now, if we were to

44
00:01:33,540 --> 00:01:35,127
minimize this function, the

45
00:01:35,140 --> 00:01:36,688
only way to make this

46
00:01:36,710 --> 00:01:38,620
new cost function small is

47
00:01:38,620 --> 00:01:40,769
if theta 3 and data

48
00:01:40,769 --> 00:01:42,133
4 are small, right?

49
00:01:42,133 --> 00:01:43,264
Because otherwise, if you have

50
00:01:43,264 --> 00:01:44,956
a thousand times theta 3, this

51
00:01:44,970 --> 00:01:48,103
new cost functions gonna be big.

52
00:01:48,140 --> 00:01:49,245
So when we minimize this

53
00:01:49,245 --> 00:01:50,402
new function we are going

54
00:01:50,402 --> 00:01:52,107
to end up with theta 3

55
00:01:52,110 --> 00:01:53,776
close to 0 and theta

56
00:01:53,776 --> 00:01:56,700
4 close to 0, and as

57
00:01:56,700 --> 00:01:59,691
if we're getting rid

58
00:01:59,691 --> 00:02:03,206
of these two terms over there.

59
00:02:03,710 --> 00:02:05,282
And if we do that, well then,

60
00:02:05,290 --> 00:02:06,783
if theta 3 and theta 4

61
00:02:06,783 --> 00:02:07,973
close to 0 then we are

62
00:02:07,973 --> 00:02:09,643
being left with a quadratic function,

63
00:02:09,643 --> 00:02:11,089
and, so, we end up with

64
00:02:11,110 --> 00:02:13,343
a fit to the data, that's, you know, quadratic

65
00:02:13,343 --> 00:02:15,463
function plus maybe, tiny

66
00:02:15,463 --> 00:02:17,856
contributions from small terms,

67
00:02:17,860 --> 00:02:20,207
theta 3, theta 4, that they may be very close to 0.

68
00:02:20,207 --> 00:02:27,293
And, so, we end up with

69
00:02:27,293 --> 00:02:29,386
essentially, a quadratic function, which is good.

70
00:02:29,386 --> 00:02:30,544
Because this is a

71
00:02:30,544 --> 00:02:34,060
much better hypothesis.

72
00:02:34,104 --> 00:02:36,666
In this particular example, we looked at the effect

73
00:02:36,700 --> 00:02:39,023
of penalizing two of

74
00:02:39,023 --> 00:02:41,446
the parameter values being large.

75
00:02:41,446 --> 00:02:46,510
More generally, here is the idea behind regularization.

76
00:02:46,980 --> 00:02:48,924
The idea is that, if we

77
00:02:48,924 --> 00:02:50,303
have small values for the

78
00:02:50,303 --> 00:02:53,083
parameters, then, having

79
00:02:53,083 --> 00:02:55,250
small values for the parameters,

80
00:02:55,250 --> 00:02:57,866
will somehow, will usually correspond

81
00:02:57,866 --> 00:03:00,386
to having a simpler hypothesis.

82
00:03:00,386 --> 00:03:02,279
So, for our last example, we

83
00:03:02,279 --> 00:03:04,024
penalize just theta 3 and

84
00:03:04,024 --> 00:03:05,666
theta 4 and when both

85
00:03:05,666 --> 00:03:07,046
of these were close to zero,

86
00:03:07,046 --> 00:03:08,450
we wound up with a much simpler

87
00:03:08,480 --> 00:03:12,549
hypothesis that was essentially a quadratic function.

88
00:03:12,549 --> 00:03:13,991
But more broadly, if we penalize all

89
00:03:13,991 --> 00:03:15,989
the parameters usually that, we

90
00:03:15,989 --> 00:03:17,416
can think of that, as trying

91
00:03:17,420 --> 00:03:19,076
to give us a simpler hypothesis

92
00:03:19,110 --> 00:03:20,943
as well because when, you

93
00:03:20,943 --> 00:03:22,380
know, these parameters are

94
00:03:22,410 --> 00:03:23,700
as close as you in this

95
00:03:23,700 --> 00:03:26,105
example, that gave us a quadratic function.

96
00:03:26,105 --> 00:03:29,038
But more generally, it is

97
00:03:29,038 --> 00:03:30,493
possible to show that having

98
00:03:30,530 --> 00:03:32,536
smaller values of the parameters

99
00:03:32,540 --> 00:03:34,416
corresponds to usually smoother

100
00:03:34,416 --> 00:03:36,780
functions as well for the simpler.

101
00:03:36,780 --> 00:03:41,667
And which are therefore, also, less prone to overfitting.

102
00:03:41,680 --> 00:03:43,245
I realize that the reasoning for

103
00:03:43,245 --> 00:03:45,441
why having all the parameters be small.

104
00:03:45,441 --> 00:03:46,944
Why that corresponds to a simpler

105
00:03:46,960 --> 00:03:48,916
hypothesis; I realize that

106
00:03:48,916 --> 00:03:51,572
reasoning may not be entirely clear to you right now.

107
00:03:51,590 --> 00:03:52,784
And it is kind of hard

108
00:03:52,784 --> 00:03:54,477
to explain unless you implement

109
00:03:54,480 --> 00:03:56,446
yourself and see it for yourself.

110
00:03:56,470 --> 00:03:58,247
But I hope that the example of

111
00:03:58,247 --> 00:03:59,610
having theta 3 and theta

112
00:03:59,650 --> 00:04:01,230
4 be small and how

113
00:04:01,230 --> 00:04:02,535
that gave us a simpler

114
00:04:02,540 --> 00:04:04,776
hypothesis, I hope that

115
00:04:04,800 --> 00:04:06,314
helps explain why, at least give

116
00:04:06,330 --> 00:04:09,320
some intuition as to why this might be true.

117
00:04:09,320 --> 00:04:11,476
Lets look at the specific example.

118
00:04:12,010 --> 00:04:13,873
For housing price prediction we

119
00:04:13,873 --> 00:04:15,465
may have our hundred features

120
00:04:15,480 --> 00:04:17,223
that we talked about where may

121
00:04:17,250 --> 00:04:18,756
be x1 is the size, x2

122
00:04:18,756 --> 00:04:20,096
is the number of bedrooms, x3

123
00:04:20,096 --> 00:04:21,963
is the number of floors and so on.

124
00:04:21,963 --> 00:04:24,502
And we may we may have a hundred features.

125
00:04:24,502 --> 00:04:26,896
And unlike the polynomial

126
00:04:26,920 --> 00:04:28,459
example, we don't know, right,

127
00:04:28,460 --> 00:04:29,826
we don't know that theta 3,

128
00:04:29,826 --> 00:04:32,641
theta 4, are the high order polynomial terms.

129
00:04:32,641 --> 00:04:34,515
So, if we have just a

130
00:04:34,540 --> 00:04:35,863
bag, if we have just a

131
00:04:35,863 --> 00:04:38,074
set of a hundred features, it's hard

132
00:04:38,100 --> 00:04:40,210
to pick in advance which are

133
00:04:40,260 --> 00:04:42,729
the ones that are less likely to be relevant.

134
00:04:42,729 --> 00:04:45,773
So we have a hundred or a hundred one parameters.

135
00:04:45,780 --> 00:04:47,340
And we don't know which

136
00:04:47,340 --> 00:04:48,987
ones to pick, we

137
00:04:49,010 --> 00:04:50,445
don't know which

138
00:04:50,450 --> 00:04:54,272
parameters to try to pick, to try to shrink.

139
00:04:54,430 --> 00:04:56,237
So, in regularization, what we're

140
00:04:56,237 --> 00:04:58,438
going to do, is take our

141
00:04:58,438 --> 00:05:01,213
cost function, here's my cost function for linear regression.

142
00:05:01,213 --> 00:05:02,656
And what I'm going to do

143
00:05:02,660 --> 00:05:04,326
is, modify this cost

144
00:05:04,340 --> 00:05:06,246
function to shrink all

145
00:05:06,270 --> 00:05:07,643
of my parameters, because, you know,

146
00:05:07,643 --> 00:05:09,059
I don't know which

147
00:05:09,059 --> 00:05:10,440
one or two to try to shrink.

148
00:05:10,440 --> 00:05:11,690
So I am going to modify my

149
00:05:11,690 --> 00:05:16,732
cost function to add a term at the end.

150
00:05:17,390 --> 00:05:20,436
Like so we have square brackets here as well.

151
00:05:20,440 --> 00:05:22,212
When I add an extra

152
00:05:22,212 --> 00:05:23,516
regularization term at the

153
00:05:23,530 --> 00:05:25,510
end to shrink every

154
00:05:25,560 --> 00:05:27,286
single parameter and so
this

155
00:05:27,320 --> 00:05:28,745
term we tend to shrink

156
00:05:28,760 --> 00:05:30,747
all of my parameters theta 1,

157
00:05:30,747 --> 00:05:32,746
theta 2, theta 3 up

158
00:05:32,746 --> 00:05:35,490
to theta 100.

159
00:05:36,790 --> 00:05:39,629
By the way, by convention the summation

160
00:05:39,629 --> 00:05:41,007
here starts from one so I

161
00:05:41,007 --> 00:05:43,341
am not actually going penalize theta

162
00:05:43,360 --> 00:05:45,416
zero being large.

163
00:05:45,470 --> 00:05:46,435
That sort of the convention that,

164
00:05:46,435 --> 00:05:48,664
the sum I equals one through

165
00:05:48,664 --> 00:05:50,185
N, rather than I equals zero

166
00:05:50,190 --> 00:05:51,953
through N. But in practice,

167
00:05:51,960 --> 00:05:53,464
it makes very little difference, and,

168
00:05:53,490 --> 00:05:54,788
whether you include, you know,

169
00:05:54,788 --> 00:05:56,221
theta zero or not, in

170
00:05:56,221 --> 00:05:59,532
practice, make very little difference to the results.

171
00:05:59,540 --> 00:06:01,804
But by convention, usually, we regularize

172
00:06:01,804 --> 00:06:03,356
only theta  through theta

173
00:06:03,360 --> 00:06:06,084
100. Writing down

174
00:06:06,084 --> 00:06:08,978
our regularized optimization objective,

175
00:06:08,978 --> 00:06:10,655
our regularized cost function again.

176
00:06:10,655 --> 00:06:11,718
Here it is. Here's J of

177
00:06:11,718 --> 00:06:13,903
theta where, this term

178
00:06:13,970 --> 00:06:15,863
on the right is a regularization

179
00:06:15,863 --> 00:06:17,548
term and lambda

180
00:06:17,570 --> 00:06:23,950
here is called the regularization parameter and

181
00:06:23,973 --> 00:06:26,334
what lambda does, is it

182
00:06:26,334 --> 00:06:28,480
controls a trade off

183
00:06:28,510 --> 00:06:30,636
between two different goals.

184
00:06:30,636 --> 00:06:32,478
The first goal, capture it

185
00:06:32,500 --> 00:06:34,399
by the first goal objective, is

186
00:06:34,399 --> 00:06:36,081
that we would like to train,

187
00:06:36,090 --> 00:06:38,350
is that we would like to fit the training data well.

188
00:06:38,390 --> 00:06:41,083
We would like to fit the training set well.

189
00:06:41,083 --> 00:06:42,954
And the second goal is,

190
00:06:42,954 --> 00:06:44,474
we want to keep the parameters

191
00:06:44,474 --> 00:06:46,053
small, and that's captured by

192
00:06:46,060 --> 00:06:49,103
the second term, by the regularization objective. And by the regularization term.

193
00:06:49,103 --> 00:06:53,583
And what lambda, the regularization

194
00:06:53,583 --> 00:06:55,937
parameter does is the controls the trade of

195
00:06:55,937 --> 00:06:57,694
between these two

196
00:06:57,694 --> 00:06:58,938
goals, between the goal of fitting the training set well

197
00:06:58,960 --> 00:07:00,562
and the

198
00:07:00,562 --> 00:07:02,043
goal of keeping the parameter plan

199
00:07:02,080 --> 00:07:05,688
small and therefore keeping the hypothesis relatively

200
00:07:05,688 --> 00:07:09,134
simple to avoid overfitting.

201
00:07:09,290 --> 00:07:11,026
For our housing price prediction

202
00:07:11,030 --> 00:07:13,026
example, whereas, previously, if

203
00:07:13,030 --> 00:07:14,256
we had fit a very high

204
00:07:14,256 --> 00:07:15,968
order polynomial, we may

205
00:07:15,968 --> 00:07:17,461
have wound up with a very,

206
00:07:17,480 --> 00:07:19,020
sort of wiggly or curvy function like

207
00:07:19,020 --> 00:07:22,460
this. If you still fit a high order polynomial

208
00:07:22,460 --> 00:07:24,120
with all the polynomial

209
00:07:24,120 --> 00:07:26,038
features in there, but instead,

210
00:07:26,038 --> 00:07:27,956
you just make sure, to use

211
00:07:27,970 --> 00:07:30,798
this sole of regularized objective, then what

212
00:07:30,798 --> 00:07:32,272
you can get out is in

213
00:07:32,272 --> 00:07:34,332
fact a curve that isn't

214
00:07:34,340 --> 00:07:36,465
quite a quadratic function, but is

215
00:07:36,490 --> 00:07:38,510
much smoother and much simpler

216
00:07:38,510 --> 00:07:39,870
and maybe a curve like the magenta

217
00:07:39,870 --> 00:07:42,261
line that, you know, gives a

218
00:07:42,261 --> 00:07:45,445
much better hypothesis for this data.

219
00:07:45,445 --> 00:07:46,613
Once again, I realize

220
00:07:46,613 --> 00:07:47,919
it can be a bit difficult to see why strengthening the

221
00:07:47,919 --> 00:07:50,064
parameters can have

222
00:07:50,064 --> 00:07:51,668
this effect, but if you

223
00:07:51,690 --> 00:07:54,584
implement yourselves with regularization

224
00:07:54,650 --> 00:07:56,063
you will be able to see

225
00:07:56,090 --> 00:07:58,859
this effect firsthand.

226
00:08:00,620 --> 00:08:02,777
In regularized linear regression, if

227
00:08:02,777 --> 00:08:05,748
the regularization parameter monitor

228
00:08:05,748 --> 00:08:07,669
is set to be very large,

229
00:08:07,669 --> 00:08:09,542
then what will happen is

230
00:08:09,542 --> 00:08:11,698
we will end up penalizing the

231
00:08:11,698 --> 00:08:13,513
parameters theta 1, theta

232
00:08:13,520 --> 00:08:15,207
2, theta 3, theta

233
00:08:15,230 --> 00:08:17,409
4 very highly.

234
00:08:17,430 --> 00:08:21,916
That is, if our hypothesis is this is one down at the bottom.

235
00:08:21,930 --> 00:08:23,674
And if we end up penalizing

236
00:08:23,674 --> 00:08:24,913
theta 1, theta 2, theta

237
00:08:24,990 --> 00:08:26,145
3, theta 4 very heavily, then we

238
00:08:26,145 --> 00:08:29,463
end up with all of these parameters close to zero, right?

239
00:08:29,463 --> 00:08:32,240
Theta 1 will be close to zero; theta 2 will be close to zero.

240
00:08:32,240 --> 00:08:34,410
Theta three and theta four

241
00:08:34,410 --> 00:08:36,646
will end up being close to zero.

242
00:08:36,646 --> 00:08:37,810
And if we do that, it's as

243
00:08:37,810 --> 00:08:39,143
if we're getting rid of these

244
00:08:39,160 --> 00:08:41,189
terms in the hypothesis so that

245
00:08:41,189 --> 00:08:43,597
we're just left with a hypothesis

246
00:08:43,597 --> 00:08:44,224
that will say that.

247
00:08:44,230 --> 00:08:46,020
It says that, well, housing

248
00:08:46,020 --> 00:08:48,624
prices are equal to theta zero,

249
00:08:48,650 --> 00:08:50,830
and that is akin to fitting

250
00:08:50,830 --> 00:08:54,679
a flat horizontal straight line to the data.

251
00:08:54,679 --> 00:08:56,533
And this is an

252
00:08:56,570 --> 00:08:58,773
example of underfitting, and

253
00:08:58,773 --> 00:09:00,926
in particular this hypothesis, this

254
00:09:00,950 --> 00:09:02,552
straight line it just fails

255
00:09:02,570 --> 00:09:04,063
to fit the training set

256
00:09:04,070 --> 00:09:05,423
well. It's just a fat straight

257
00:09:05,423 --> 00:09:07,173
line, it doesn't go, you know, go near.

258
00:09:07,173 --> 00:09:10,432
It doesn't go anywhere near most of the training examples.

259
00:09:10,432 --> 00:09:11,592
And another way of saying this

260
00:09:11,592 --> 00:09:13,697
is that this hypothesis has

261
00:09:13,720 --> 00:09:15,410
too strong a preconception or

262
00:09:15,450 --> 00:09:17,091
too high bias that housing

263
00:09:17,120 --> 00:09:18,446
prices are just equal

264
00:09:18,460 --> 00:09:20,183
to theta zero, and despite

265
00:09:20,230 --> 00:09:22,123
the clear data to the contrary,

266
00:09:22,123 --> 00:09:23,207
you know chooses to fit a sort

267
00:09:23,207 --> 00:09:25,648
of, flat line, just a

268
00:09:25,650 --> 00:09:28,230
flat horizontal line. I didn't draw that very well.

269
00:09:28,230 --> 00:09:30,447
This just a horizontal flat line

270
00:09:30,447 --> 00:09:33,059
to the data. So for

271
00:09:33,060 --> 00:09:35,626
regularization to work well, some

272
00:09:35,626 --> 00:09:37,835
care should be taken,

273
00:09:37,850 --> 00:09:39,903
to choose a good choice for

274
00:09:39,903 --> 00:09:42,991
the regularization parameter lambda as well.

275
00:09:42,991 --> 00:09:44,908
And when we talk about multi-selection

276
00:09:44,920 --> 00:09:46,717
later in this course, we'll talk

277
00:09:46,717 --> 00:09:48,413
about a way, a variety

278
00:09:48,420 --> 00:09:50,803
of ways for automatically choosing

279
00:09:50,810 --> 00:09:54,833
the regularization parameter lambda as well. So, that's

280
00:09:54,833 --> 00:09:56,570
the idea of the high regularization

281
00:09:56,570 --> 00:09:58,254
and the cost function reviews in

282
00:09:58,254 --> 00:10:00,454
order to use regularization In the

283
00:10:00,454 --> 00:10:01,885
next two videos, lets take

284
00:10:01,885 --> 00:10:03,736
these ideas and apply them

285
00:10:03,750 --> 00:10:05,440
to linear regression and to

286
00:10:05,440 --> 00:10:07,111
logistic regression, so that

287
00:10:07,111 --> 00:10:09,020
we can then get them to

288
00:10:09,060 --> 00:10:10,982
avoid overfitting.