1
00:00:00,360 --> 00:00:01,753
By now, you've seen a

2
00:00:01,760 --> 00:00:04,097
couple different learning algorithms, linear

3
00:00:04,097 --> 00:00:06,504
regression and logistic regression.

4
00:00:06,510 --> 00:00:08,583
They work well for many problems,

5
00:00:08,583 --> 00:00:09,684
but when you apply them

6
00:00:09,684 --> 00:00:11,903
to certain machine learning applications, they

7
00:00:11,903 --> 00:00:13,889
can run into a problem called

8
00:00:13,900 --> 00:00:18,052
overfitting that can cause them to perform very poorly.

9
00:00:18,052 --> 00:00:18,866
What I'd like to do in

10
00:00:18,866 --> 00:00:20,393
this video is explain to

11
00:00:20,393 --> 00:00:22,400
you what is this overfitting

12
00:00:22,400 --> 00:00:24,083
problem, and in the

13
00:00:24,083 --> 00:00:25,861
next few videos after this,

14
00:00:25,861 --> 00:00:27,759
we'll talk about a technique called

15
00:00:27,760 --> 00:00:29,787
regularization, that will allow

16
00:00:29,787 --> 00:00:31,529
us to ameliorate or to

17
00:00:31,529 --> 00:00:33,607
reduce this overfitting problem and

18
00:00:33,607 --> 00:00:36,844
get these learning algorithms to maybe work much better.

19
00:00:36,860 --> 00:00:39,607
So what is overfitting?

20
00:00:39,607 --> 00:00:41,616
Let's keep using our running

21
00:00:41,620 --> 00:00:44,030
example of predicting housing

22
00:00:44,050 --> 00:00:46,146
prices with linear regression

23
00:00:46,146 --> 00:00:47,123
where we want to predict the

24
00:00:47,123 --> 00:00:50,730
price as a function of the size of the house.

25
00:00:50,730 --> 00:00:51,870
One thing we could do is

26
00:00:51,910 --> 00:00:53,620
fit a linear function to

27
00:00:53,620 --> 00:00:54,892
this data, and if we

28
00:00:54,892 --> 00:00:56,296
do that, maybe we get

29
00:00:56,296 --> 00:00:58,913
that sort of straight line fit to the data.

30
00:00:58,913 --> 00:01:01,012
But this isn't a very good model.

31
00:01:01,012 --> 00:01:02,543
Looking at the data, it seems

32
00:01:02,560 --> 00:01:04,100
pretty clear that as the

33
00:01:04,100 --> 00:01:06,274
size of the housing increases, the

34
00:01:06,274 --> 00:01:08,268
housing prices plateau, or kind

35
00:01:08,270 --> 00:01:11,721
of flattens out as we move to the right and so

36
00:01:11,740 --> 00:01:14,020
this algorithm does not

37
00:01:14,020 --> 00:01:15,898
fit the training and we

38
00:01:15,898 --> 00:01:19,166
call this problem underfitting, and

39
00:01:19,180 --> 00:01:20,494
another term for this is

40
00:01:20,500 --> 00:01:24,666
that this algorithm has high bias.

41
00:01:25,140 --> 00:01:26,841
Both of these roughly

42
00:01:26,890 --> 00:01:30,760
mean that it's just not even fitting the training data very well.

43
00:01:30,760 --> 00:01:32,328
The term is kind of

44
00:01:32,328 --> 00:01:34,515
a historical or technical one,

45
00:01:34,515 --> 00:01:36,109
but the idea is that

46
00:01:36,110 --> 00:01:37,303
if a fitting a straight line to

47
00:01:37,303 --> 00:01:38,909
the data, then, it's as

48
00:01:38,920 --> 00:01:40,290
if the algorithm has a

49
00:01:40,330 --> 00:01:42,638
very strong preconception, or a

50
00:01:42,638 --> 00:01:44,633
very strong bias that housing

51
00:01:44,650 --> 00:01:46,339
prices are going to vary

52
00:01:46,339 --> 00:01:49,988
linearly with their size and despite the data to the contrary.

53
00:01:50,000 --> 00:01:51,281
Despite the evidence of the

54
00:01:51,290 --> 00:01:54,174
contrary is preconceptions still

55
00:01:54,174 --> 00:01:55,413
are bias, still closes

56
00:01:55,440 --> 00:01:56,974
it to fit a straight line

57
00:01:56,974 --> 00:02:00,638
and this ends up being a poor fit to the data.

58
00:02:00,638 --> 00:02:02,173
Now, in the middle, we could

59
00:02:02,210 --> 00:02:04,626
fit a quadratic functions enter and,

60
00:02:04,626 --> 00:02:06,222
with this data set, we fit the

61
00:02:06,222 --> 00:02:07,793
quadratic function, maybe, we get

62
00:02:07,810 --> 00:02:10,211
that kind of curve

63
00:02:10,211 --> 00:02:14,361
and, that works pretty well.

64
00:02:14,361 --> 00:02:17,543
And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data.

65
00:02:17,550 --> 00:02:19,442
So, here we have five parameters,

66
00:02:19,470 --> 00:02:23,196
theta zero through theta four,

67
00:02:23,210 --> 00:02:23,926
and, with that, we can actually fill a curve

68
00:02:23,926 --> 00:02:26,727
that process through all five of our training examples.

69
00:02:26,727 --> 00:02:29,507
You might get a curve that looks like this.

70
00:02:31,260 --> 00:02:32,454
That, on the one

71
00:02:32,460 --> 00:02:33,791
hand, seems to do

72
00:02:33,791 --> 00:02:35,052
a very good job fitting the

73
00:02:35,052 --> 00:02:36,291
training set and, that is

74
00:02:36,291 --> 00:02:38,269
processed through all of my data, at least.

75
00:02:38,270 --> 00:02:40,284
But, this is still a very wiggly curve, right?

76
00:02:40,300 --> 00:02:41,660
So, it's going up and down all

77
00:02:41,660 --> 00:02:43,430
over the place, and, we don't actually

78
00:02:43,430 --> 00:02:46,996
think that's such a good model for predicting housing prices.

79
00:02:47,000 --> 00:02:48,924
So, this problem we

80
00:02:48,924 --> 00:02:51,967
call overfitting, and, another

81
00:02:51,970 --> 00:02:53,165
term for this is that

82
00:02:53,170 --> 00:02:57,304
this algorithm has high variance..

83
00:02:57,890 --> 00:02:59,951
The term high variance is another

84
00:02:59,951 --> 00:03:02,110
historical or technical one.

85
00:03:02,130 --> 00:03:03,797
But, the intuition is that,

86
00:03:03,800 --> 00:03:05,080
if we're fitting such a high

87
00:03:05,080 --> 00:03:07,326
order polynomial, then, the

88
00:03:07,330 --> 00:03:08,603
hypothesis can fit, you know,

89
00:03:08,620 --> 00:03:09,584
it's almost as if it can

90
00:03:09,584 --> 00:03:11,995
fit almost any function and

91
00:03:11,995 --> 00:03:14,159
this face of possible hypothesis

92
00:03:14,159 --> 00:03:16,601
is just too large, it's too variable.

93
00:03:16,610 --> 00:03:18,052
And we don't have enough data

94
00:03:18,052 --> 00:03:19,279
to constrain it to give

95
00:03:19,279 --> 00:03:22,714
us a good hypothesis so that's called overfitting.

96
00:03:22,740 --> 00:03:24,340
And in the middle, there isn't really

97
00:03:24,350 --> 00:03:26,990
a name but I'm just going to write, you know, just right.

98
00:03:26,990 --> 00:03:29,911
Where a second degree polynomial, quadratic function

99
00:03:29,911 --> 00:03:32,559
seems to be just right for fitting this data.

100
00:03:32,559 --> 00:03:34,684
To recap a bit the

101
00:03:34,690 --> 00:03:37,042
problem of over fitting comes

102
00:03:37,042 --> 00:03:38,258
when if we have

103
00:03:38,258 --> 00:03:40,729
too many features, then to

104
00:03:40,729 --> 00:03:43,881
learn hypothesis may fit the training side very well.

105
00:03:43,881 --> 00:03:46,023
So, your cost function

106
00:03:46,023 --> 00:03:47,344
may actually be very close

107
00:03:47,344 --> 00:03:48,446
to zero or may be

108
00:03:48,446 --> 00:03:50,750
even zero exactly, but you

109
00:03:50,750 --> 00:03:52,063
may then end up with a

110
00:03:52,063 --> 00:03:53,950
curve like this that, you

111
00:03:53,950 --> 00:03:55,314
know tries too hard to

112
00:03:55,314 --> 00:03:57,103
fit the training set, so that it

113
00:03:57,110 --> 00:03:59,233
even fails to generalize to

114
00:03:59,250 --> 00:04:01,117
new examples and fails to

115
00:04:01,120 --> 00:04:03,018
predict prices on new examples

116
00:04:03,050 --> 00:04:04,337
as well, and here the

117
00:04:04,350 --> 00:04:06,853
term generalized refers to

118
00:04:06,853 --> 00:04:10,868
how well a hypothesis applies even to new examples.

119
00:04:10,868 --> 00:04:12,274
That is to data to

120
00:04:12,320 --> 00:04:16,467
houses that it has not seen in the training set.

121
00:04:16,600 --> 00:04:17,910
On this slide, we looked at

122
00:04:17,910 --> 00:04:20,802
over fitting for the case of linear regression.

123
00:04:20,810 --> 00:04:24,182
A similar thing can apply to logistic regression as well.

124
00:04:24,190 --> 00:04:26,090
Here is a logistic regression

125
00:04:26,090 --> 00:04:28,871
example with two features X1 and x2.

126
00:04:28,910 --> 00:04:30,136
One thing we could do, is

127
00:04:30,140 --> 00:04:31,522
fit logistic regression with

128
00:04:31,522 --> 00:04:34,518
just a simple hypothesis like this,

129
00:04:34,530 --> 00:04:38,076
where, as usual, G is my sigmoid function.

130
00:04:38,120 --> 00:04:39,334
And if you do that, you end up

131
00:04:39,334 --> 00:04:41,593
with a hypothesis, trying to

132
00:04:41,600 --> 00:04:42,923
use, maybe, just a straight

133
00:04:42,923 --> 00:04:45,713
line to separate the positive and the negative examples.

134
00:04:45,713 --> 00:04:49,071
And this doesn't look like a very good fit to the hypothesis.

135
00:04:49,100 --> 00:04:50,659
So, once again, this

136
00:04:50,659 --> 00:04:52,577
is an example of underfitting

137
00:04:52,577 --> 00:04:56,040
or of the hypothesis having high bias.

138
00:04:56,210 --> 00:04:57,504
In contrast, if you were

139
00:04:57,504 --> 00:04:59,146
to add to your features

140
00:04:59,170 --> 00:05:01,032
these quadratic terms, then,

141
00:05:01,032 --> 00:05:02,613
you could get a decision

142
00:05:02,613 --> 00:05:05,620
boundary that might look more like this.

143
00:05:05,620 --> 00:05:07,784
And, you know, that's a pretty good fit to the data.

144
00:05:07,784 --> 00:05:10,838
Probably, about as

145
00:05:10,860 --> 00:05:13,991
good as we could get, on this training set.

146
00:05:14,010 --> 00:05:15,157
And, finally, at the other

147
00:05:15,170 --> 00:05:16,169
extreme, if you were to

148
00:05:16,169 --> 00:05:18,207
fit a very high-order polynomial, if

149
00:05:18,207 --> 00:05:20,036
you were to generate lots of

150
00:05:20,036 --> 00:05:22,461
high-order polynomial terms of speeches,

151
00:05:22,490 --> 00:05:24,730
then, logistical regression may contort

152
00:05:24,750 --> 00:05:26,551
itself, may try really

153
00:05:26,560 --> 00:05:28,233
hard to find a

154
00:05:28,233 --> 00:05:31,742
decision boundary that fits

155
00:05:31,742 --> 00:05:33,013
your training data or go

156
00:05:33,030 --> 00:05:35,006
to great lengths to contort itself,

157
00:05:35,006 --> 00:05:37,689
to fit every single training example well.

158
00:05:37,700 --> 00:05:38,757
And, you know, if the

159
00:05:38,757 --> 00:05:39,547
features X1 and

160
00:05:39,550 --> 00:05:41,435
X2 offer predicting, maybe,

161
00:05:41,435 --> 00:05:43,350
the cancer to the,

162
00:05:43,390 --> 00:05:46,448
you know, cancer is a malignant, benign breast tumors.

163
00:05:46,448 --> 00:05:47,988
This doesn't, this really doesn't

164
00:05:47,988 --> 00:05:51,893
look like a very good hypothesis, for making predictions.

165
00:05:51,930 --> 00:05:53,463
And so, once again, this is

166
00:05:53,463 --> 00:05:55,432
an instance of overfitting

167
00:05:55,432 --> 00:05:57,128
and, of a hypothesis having

168
00:05:57,128 --> 00:05:59,403
high variance and not really,

169
00:05:59,403 --> 00:06:04,243
and, being unlikely to generalize well to new examples.

170
00:06:04,560 --> 00:06:06,158
Later, in this course, when we

171
00:06:06,158 --> 00:06:08,453
talk about debugging and diagnosing

172
00:06:08,460 --> 00:06:09,794
things that can go wrong with

173
00:06:09,810 --> 00:06:11,490
learning algorithms, we'll give you

174
00:06:11,490 --> 00:06:13,297
specific tools to recognize

175
00:06:13,297 --> 00:06:14,953
when overfitting and, also,

176
00:06:14,953 --> 00:06:17,503
when underfitting may be occurring.

177
00:06:17,503 --> 00:06:18,775
But, for now, lets talk about

178
00:06:18,780 --> 00:06:20,342
the problem of, if we

179
00:06:20,360 --> 00:06:22,206
think overfitting is occurring,

180
00:06:22,250 --> 00:06:24,864
what can we do to address it?

181
00:06:24,864 --> 00:06:26,640
In the previous examples, we had

182
00:06:26,660 --> 00:06:28,701
one or two dimensional data so,

183
00:06:28,701 --> 00:06:31,335
we could just plot the hypothesis and see what was going

184
00:06:31,335 --> 00:06:34,612
on and select the appropriate degree polynomial.

185
00:06:34,620 --> 00:06:36,836
So, earlier for the housing

186
00:06:36,836 --> 00:06:38,405
prices example, we could just

187
00:06:38,410 --> 00:06:40,597
plot the hypothesis and, you

188
00:06:40,600 --> 00:06:41,628
know, maybe see that it

189
00:06:41,628 --> 00:06:42,830
was fitting the sort of

190
00:06:42,830 --> 00:06:46,339
very wiggly function that goes all over the place to predict housing prices.

191
00:06:46,339 --> 00:06:47,701
And we could then use figures

192
00:06:47,740 --> 00:06:50,667
like these to select an appropriate degree polynomial.

193
00:06:50,680 --> 00:06:54,166
So plotting the hypothesis, could

194
00:06:54,166 --> 00:06:55,728
be one way to try to

195
00:06:55,750 --> 00:06:58,160
decide what degree polynomial to use.

196
00:06:58,160 --> 00:07:00,163
But that doesn't always work.

197
00:07:00,180 --> 00:07:02,019
And, in fact more often we

198
00:07:02,019 --> 00:07:06,075
may have learning problems that where we just have a lot of features.

199
00:07:06,075 --> 00:07:07,563
And there is not

200
00:07:07,563 --> 00:07:10,599
just a matter of selecting what degree polynomial.

201
00:07:10,630 --> 00:07:12,147
And, in fact, when we

202
00:07:12,170 --> 00:07:13,779
have so many features, it also

203
00:07:13,779 --> 00:07:15,593
becomes much harder to plot

204
00:07:15,630 --> 00:07:17,698
the data and it becomes

205
00:07:17,710 --> 00:07:19,211
much harder to visualize it,

206
00:07:19,211 --> 00:07:22,396
to decide what features to keep or not.

207
00:07:22,420 --> 00:07:24,142
So concretely, if we're trying

208
00:07:24,160 --> 00:07:27,849
predict housing prices sometimes we can just have a lot of different features.

209
00:07:27,880 --> 00:07:31,373
And all of these features seem, you know, maybe they seem kind of useful.

210
00:07:31,373 --> 00:07:32,609
But, if we have a

211
00:07:32,609 --> 00:07:34,123
lot of features, and, very little

212
00:07:34,123 --> 00:07:35,820
training data, then, over

213
00:07:35,840 --> 00:07:37,776
fitting can become a problem.

214
00:07:37,776 --> 00:07:39,180
In order to address over

215
00:07:39,180 --> 00:07:40,651
fitting, there are two

216
00:07:40,651 --> 00:07:43,780
main options for things that we can do.

217
00:07:43,780 --> 00:07:45,759
The first option is, to try

218
00:07:45,770 --> 00:07:47,976
to reduce the number of features.

219
00:07:47,990 --> 00:07:49,337
Concretely, one thing we

220
00:07:49,337 --> 00:07:51,383
could do is manually look through

221
00:07:51,383 --> 00:07:53,236
the list of features, and, use

222
00:07:53,236 --> 00:07:54,894
that to try to decide which

223
00:07:54,894 --> 00:07:57,256
are the more important features, and, therefore,

224
00:07:57,256 --> 00:07:58,476
which are the features we should

225
00:07:58,476 --> 00:08:01,844
keep, and, which are the features we should throw out.

226
00:08:01,844 --> 00:08:03,401
Later in this course, where also

227
00:08:03,401 --> 00:08:06,018
talk about model selection algorithms.

228
00:08:06,040 --> 00:08:08,361
Which are algorithms for automatically

229
00:08:08,361 --> 00:08:09,788
deciding which features

230
00:08:09,800 --> 00:08:12,500
to keep and, which features to throw out.

231
00:08:12,500 --> 00:08:13,987
This idea of reducing the

232
00:08:13,987 --> 00:08:15,562
number of features can work

233
00:08:15,562 --> 00:08:17,853
well, and, can reduce over fitting.

234
00:08:17,853 --> 00:08:19,383
And, when we talk about model

235
00:08:19,383 --> 00:08:22,534
selection, we'll go into this in much greater depth.

236
00:08:22,534 --> 00:08:24,386
But, the disadvantage is that, by

237
00:08:24,386 --> 00:08:25,603
throwing away some of the

238
00:08:25,603 --> 00:08:27,010
features, is also throwing

239
00:08:27,370 --> 00:08:30,615
away some of the information you have about the problem.

240
00:08:30,650 --> 00:08:31,942
For example, maybe, all of

241
00:08:31,942 --> 00:08:33,760
those features are actually useful

242
00:08:33,780 --> 00:08:35,050
for predicting the price of a

243
00:08:35,070 --> 00:08:36,636
house, so, maybe, we don't actually

244
00:08:36,640 --> 00:08:37,687
want to throw some of

245
00:08:37,687 --> 00:08:40,990
our information or throw some of our features away.

246
00:08:41,540 --> 00:08:44,515
The second option, which we'll

247
00:08:44,515 --> 00:08:45,995
talk about in the

248
00:08:46,010 --> 00:08:49,268
next few videos, is regularization.

249
00:08:49,268 --> 00:08:50,390
Here, we're going to keep

250
00:08:50,390 --> 00:08:52,579
all the features, but we're

251
00:08:52,579 --> 00:08:55,063
going to reduce the magnitude

252
00:08:55,063 --> 00:08:56,506
or the values of the parameters

253
00:08:56,520 --> 00:08:58,745
theta J. And, this

254
00:08:58,750 --> 00:09:00,690
method works well, we'll see,

255
00:09:00,690 --> 00:09:01,925
when we have a lot of

256
00:09:01,925 --> 00:09:03,822
features, each of which contributes

257
00:09:03,822 --> 00:09:05,502
a little bit to predicting

258
00:09:05,502 --> 00:09:07,723
the value of Y, like we

259
00:09:07,740 --> 00:09:10,283
saw in the housing price prediction example.

260
00:09:10,283 --> 00:09:11,413
Where we could have a lot

261
00:09:11,413 --> 00:09:12,720
of features, each of which

262
00:09:12,750 --> 00:09:16,902
are, you know, somewhat useful, so, maybe, we don't want to throw them away.

263
00:09:16,930 --> 00:09:19,247
So, this subscribes the

264
00:09:19,250 --> 00:09:22,790
idea of regularization at a very high level.

265
00:09:22,790 --> 00:09:24,354
And, I realize that, all

266
00:09:24,360 --> 00:09:26,763
of these details probably don't make sense to you yet.

267
00:09:26,763 --> 00:09:28,316
But, in the next video, we'll

268
00:09:28,316 --> 00:09:30,960
start to formulate exactly how

269
00:09:30,960 --> 00:09:35,117
to apply regularization and, exactly what regularization means.

270
00:09:35,140 --> 00:09:36,810
And, then we'll start to

271
00:09:36,810 --> 00:09:38,310
figure out, how to use this,

272
00:09:38,310 --> 00:09:40,412
to make how learning algorithms work

273
00:09:40,412 --> 00:09:42,460
well and avoid overfitting.