1 00:00:00,360 --> 00:00:01,753 By now, you've seen a 2 00:00:01,760 --> 00:00:04,097 couple different learning algorithms, linear 3 00:00:04,097 --> 00:00:06,504 regression and logistic regression. 4 00:00:06,510 --> 00:00:08,583 They work well for many problems, 5 00:00:08,583 --> 00:00:09,684 but when you apply them 6 00:00:09,684 --> 00:00:11,903 to certain machine learning applications, they 7 00:00:11,903 --> 00:00:13,889 can run into a problem called 8 00:00:13,900 --> 00:00:18,052 overfitting that can cause them to perform very poorly. 9 00:00:18,052 --> 00:00:18,866 What I'd like to do in 10 00:00:18,866 --> 00:00:20,393 this video is explain to 11 00:00:20,393 --> 00:00:22,400 you what is this overfitting 12 00:00:22,400 --> 00:00:24,083 problem, and in the 13 00:00:24,083 --> 00:00:25,861 next few videos after this, 14 00:00:25,861 --> 00:00:27,759 we'll talk about a technique called 15 00:00:27,760 --> 00:00:29,787 regularization, that will allow 16 00:00:29,787 --> 00:00:31,529 us to ameliorate or to 17 00:00:31,529 --> 00:00:33,607 reduce this overfitting problem and 18 00:00:33,607 --> 00:00:36,844 get these learning algorithms to maybe work much better. 19 00:00:36,860 --> 00:00:39,607 So what is overfitting? 20 00:00:39,607 --> 00:00:41,616 Let's keep using our running 21 00:00:41,620 --> 00:00:44,030 example of predicting housing 22 00:00:44,050 --> 00:00:46,146 prices with linear regression 23 00:00:46,146 --> 00:00:47,123 where we want to predict the 24 00:00:47,123 --> 00:00:50,730 price as a function of the size of the house. 25 00:00:50,730 --> 00:00:51,870 One thing we could do is 26 00:00:51,910 --> 00:00:53,620 fit a linear function to 27 00:00:53,620 --> 00:00:54,892 this data, and if we 28 00:00:54,892 --> 00:00:56,296 do that, maybe we get 29 00:00:56,296 --> 00:00:58,913 that sort of straight line fit to the data. 30 00:00:58,913 --> 00:01:01,012 But this isn't a very good model. 31 00:01:01,012 --> 00:01:02,543 Looking at the data, it seems 32 00:01:02,560 --> 00:01:04,100 pretty clear that as the 33 00:01:04,100 --> 00:01:06,274 size of the housing increases, the 34 00:01:06,274 --> 00:01:08,268 housing prices plateau, or kind 35 00:01:08,270 --> 00:01:11,721 of flattens out as we move to the right and so 36 00:01:11,740 --> 00:01:14,020 this algorithm does not 37 00:01:14,020 --> 00:01:15,898 fit the training and we 38 00:01:15,898 --> 00:01:19,166 call this problem underfitting, and 39 00:01:19,180 --> 00:01:20,494 another term for this is 40 00:01:20,500 --> 00:01:24,666 that this algorithm has high bias. 41 00:01:25,140 --> 00:01:26,841 Both of these roughly 42 00:01:26,890 --> 00:01:30,760 mean that it's just not even fitting the training data very well. 43 00:01:30,760 --> 00:01:32,328 The term is kind of 44 00:01:32,328 --> 00:01:34,515 a historical or technical one, 45 00:01:34,515 --> 00:01:36,109 but the idea is that 46 00:01:36,110 --> 00:01:37,303 if a fitting a straight line to 47 00:01:37,303 --> 00:01:38,909 the data, then, it's as 48 00:01:38,920 --> 00:01:40,290 if the algorithm has a 49 00:01:40,330 --> 00:01:42,638 very strong preconception, or a 50 00:01:42,638 --> 00:01:44,633 very strong bias that housing 51 00:01:44,650 --> 00:01:46,339 prices are going to vary 52 00:01:46,339 --> 00:01:49,988 linearly with their size and despite the data to the contrary. 53 00:01:50,000 --> 00:01:51,281 Despite the evidence of the 54 00:01:51,290 --> 00:01:54,174 contrary is preconceptions still 55 00:01:54,174 --> 00:01:55,413 are bias, still closes 56 00:01:55,440 --> 00:01:56,974 it to fit a straight line 57 00:01:56,974 --> 00:02:00,638 and this ends up being a poor fit to the data. 58 00:02:00,638 --> 00:02:02,173 Now, in the middle, we could 59 00:02:02,210 --> 00:02:04,626 fit a quadratic functions enter and, 60 00:02:04,626 --> 00:02:06,222 with this data set, we fit the 61 00:02:06,222 --> 00:02:07,793 quadratic function, maybe, we get 62 00:02:07,810 --> 00:02:10,211 that kind of curve 63 00:02:10,211 --> 00:02:14,361 and, that works pretty well. 64 00:02:14,361 --> 00:02:17,543 And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data. 65 00:02:17,550 --> 00:02:19,442 So, here we have five parameters, 66 00:02:19,470 --> 00:02:23,196 theta zero through theta four, 67 00:02:23,210 --> 00:02:23,926 and, with that, we can actually fill a curve 68 00:02:23,926 --> 00:02:26,727 that process through all five of our training examples. 69 00:02:26,727 --> 00:02:29,507 You might get a curve that looks like this. 70 00:02:31,260 --> 00:02:32,454 That, on the one 71 00:02:32,460 --> 00:02:33,791 hand, seems to do 72 00:02:33,791 --> 00:02:35,052 a very good job fitting the 73 00:02:35,052 --> 00:02:36,291 training set and, that is 74 00:02:36,291 --> 00:02:38,269 processed through all of my data, at least. 75 00:02:38,270 --> 00:02:40,284 But, this is still a very wiggly curve, right? 76 00:02:40,300 --> 00:02:41,660 So, it's going up and down all 77 00:02:41,660 --> 00:02:43,430 over the place, and, we don't actually 78 00:02:43,430 --> 00:02:46,996 think that's such a good model for predicting housing prices. 79 00:02:47,000 --> 00:02:48,924 So, this problem we 80 00:02:48,924 --> 00:02:51,967 call overfitting, and, another 81 00:02:51,970 --> 00:02:53,165 term for this is that 82 00:02:53,170 --> 00:02:57,304 this algorithm has high variance.. 83 00:02:57,890 --> 00:02:59,951 The term high variance is another 84 00:02:59,951 --> 00:03:02,110 historical or technical one. 85 00:03:02,130 --> 00:03:03,797 But, the intuition is that, 86 00:03:03,800 --> 00:03:05,080 if we're fitting such a high 87 00:03:05,080 --> 00:03:07,326 order polynomial, then, the 88 00:03:07,330 --> 00:03:08,603 hypothesis can fit, you know, 89 00:03:08,620 --> 00:03:09,584 it's almost as if it can 90 00:03:09,584 --> 00:03:11,995 fit almost any function and 91 00:03:11,995 --> 00:03:14,159 this face of possible hypothesis 92 00:03:14,159 --> 00:03:16,601 is just too large, it's too variable. 93 00:03:16,610 --> 00:03:18,052 And we don't have enough data 94 00:03:18,052 --> 00:03:19,279 to constrain it to give 95 00:03:19,279 --> 00:03:22,714 us a good hypothesis so that's called overfitting. 96 00:03:22,740 --> 00:03:24,340 And in the middle, there isn't really 97 00:03:24,350 --> 00:03:26,990 a name but I'm just going to write, you know, just right. 98 00:03:26,990 --> 00:03:29,911 Where a second degree polynomial, quadratic function 99 00:03:29,911 --> 00:03:32,559 seems to be just right for fitting this data. 100 00:03:32,559 --> 00:03:34,684 To recap a bit the 101 00:03:34,690 --> 00:03:37,042 problem of over fitting comes 102 00:03:37,042 --> 00:03:38,258 when if we have 103 00:03:38,258 --> 00:03:40,729 too many features, then to 104 00:03:40,729 --> 00:03:43,881 learn hypothesis may fit the training side very well. 105 00:03:43,881 --> 00:03:46,023 So, your cost function 106 00:03:46,023 --> 00:03:47,344 may actually be very close 107 00:03:47,344 --> 00:03:48,446 to zero or may be 108 00:03:48,446 --> 00:03:50,750 even zero exactly, but you 109 00:03:50,750 --> 00:03:52,063 may then end up with a 110 00:03:52,063 --> 00:03:53,950 curve like this that, you 111 00:03:53,950 --> 00:03:55,314 know tries too hard to 112 00:03:55,314 --> 00:03:57,103 fit the training set, so that it 113 00:03:57,110 --> 00:03:59,233 even fails to generalize to 114 00:03:59,250 --> 00:04:01,117 new examples and fails to 115 00:04:01,120 --> 00:04:03,018 predict prices on new examples 116 00:04:03,050 --> 00:04:04,337 as well, and here the 117 00:04:04,350 --> 00:04:06,853 term generalized refers to 118 00:04:06,853 --> 00:04:10,868 how well a hypothesis applies even to new examples. 119 00:04:10,868 --> 00:04:12,274 That is to data to 120 00:04:12,320 --> 00:04:16,467 houses that it has not seen in the training set. 121 00:04:16,600 --> 00:04:17,910 On this slide, we looked at 122 00:04:17,910 --> 00:04:20,802 over fitting for the case of linear regression. 123 00:04:20,810 --> 00:04:24,182 A similar thing can apply to logistic regression as well. 124 00:04:24,190 --> 00:04:26,090 Here is a logistic regression 125 00:04:26,090 --> 00:04:28,871 example with two features X1 and x2. 126 00:04:28,910 --> 00:04:30,136 One thing we could do, is 127 00:04:30,140 --> 00:04:31,522 fit logistic regression with 128 00:04:31,522 --> 00:04:34,518 just a simple hypothesis like this, 129 00:04:34,530 --> 00:04:38,076 where, as usual, G is my sigmoid function. 130 00:04:38,120 --> 00:04:39,334 And if you do that, you end up 131 00:04:39,334 --> 00:04:41,593 with a hypothesis, trying to 132 00:04:41,600 --> 00:04:42,923 use, maybe, just a straight 133 00:04:42,923 --> 00:04:45,713 line to separate the positive and the negative examples. 134 00:04:45,713 --> 00:04:49,071 And this doesn't look like a very good fit to the hypothesis. 135 00:04:49,100 --> 00:04:50,659 So, once again, this 136 00:04:50,659 --> 00:04:52,577 is an example of underfitting 137 00:04:52,577 --> 00:04:56,040 or of the hypothesis having high bias. 138 00:04:56,210 --> 00:04:57,504 In contrast, if you were 139 00:04:57,504 --> 00:04:59,146 to add to your features 140 00:04:59,170 --> 00:05:01,032 these quadratic terms, then, 141 00:05:01,032 --> 00:05:02,613 you could get a decision 142 00:05:02,613 --> 00:05:05,620 boundary that might look more like this. 143 00:05:05,620 --> 00:05:07,784 And, you know, that's a pretty good fit to the data. 144 00:05:07,784 --> 00:05:10,838 Probably, about as 145 00:05:10,860 --> 00:05:13,991 good as we could get, on this training set. 146 00:05:14,010 --> 00:05:15,157 And, finally, at the other 147 00:05:15,170 --> 00:05:16,169 extreme, if you were to 148 00:05:16,169 --> 00:05:18,207 fit a very high-order polynomial, if 149 00:05:18,207 --> 00:05:20,036 you were to generate lots of 150 00:05:20,036 --> 00:05:22,461 high-order polynomial terms of speeches, 151 00:05:22,490 --> 00:05:24,730 then, logistical regression may contort 152 00:05:24,750 --> 00:05:26,551 itself, may try really 153 00:05:26,560 --> 00:05:28,233 hard to find a 154 00:05:28,233 --> 00:05:31,742 decision boundary that fits 155 00:05:31,742 --> 00:05:33,013 your training data or go 156 00:05:33,030 --> 00:05:35,006 to great lengths to contort itself, 157 00:05:35,006 --> 00:05:37,689 to fit every single training example well. 158 00:05:37,700 --> 00:05:38,757 And, you know, if the 159 00:05:38,757 --> 00:05:39,547 features X1 and 160 00:05:39,550 --> 00:05:41,435 X2 offer predicting, maybe, 161 00:05:41,435 --> 00:05:43,350 the cancer to the, 162 00:05:43,390 --> 00:05:46,448 you know, cancer is a malignant, benign breast tumors. 163 00:05:46,448 --> 00:05:47,988 This doesn't, this really doesn't 164 00:05:47,988 --> 00:05:51,893 look like a very good hypothesis, for making predictions. 165 00:05:51,930 --> 00:05:53,463 And so, once again, this is 166 00:05:53,463 --> 00:05:55,432 an instance of overfitting 167 00:05:55,432 --> 00:05:57,128 and, of a hypothesis having 168 00:05:57,128 --> 00:05:59,403 high variance and not really, 169 00:05:59,403 --> 00:06:04,243 and, being unlikely to generalize well to new examples. 170 00:06:04,560 --> 00:06:06,158 Later, in this course, when we 171 00:06:06,158 --> 00:06:08,453 talk about debugging and diagnosing 172 00:06:08,460 --> 00:06:09,794 things that can go wrong with 173 00:06:09,810 --> 00:06:11,490 learning algorithms, we'll give you 174 00:06:11,490 --> 00:06:13,297 specific tools to recognize 175 00:06:13,297 --> 00:06:14,953 when overfitting and, also, 176 00:06:14,953 --> 00:06:17,503 when underfitting may be occurring. 177 00:06:17,503 --> 00:06:18,775 But, for now, lets talk about 178 00:06:18,780 --> 00:06:20,342 the problem of, if we 179 00:06:20,360 --> 00:06:22,206 think overfitting is occurring, 180 00:06:22,250 --> 00:06:24,864 what can we do to address it? 181 00:06:24,864 --> 00:06:26,640 In the previous examples, we had 182 00:06:26,660 --> 00:06:28,701 one or two dimensional data so, 183 00:06:28,701 --> 00:06:31,335 we could just plot the hypothesis and see what was going 184 00:06:31,335 --> 00:06:34,612 on and select the appropriate degree polynomial. 185 00:06:34,620 --> 00:06:36,836 So, earlier for the housing 186 00:06:36,836 --> 00:06:38,405 prices example, we could just 187 00:06:38,410 --> 00:06:40,597 plot the hypothesis and, you 188 00:06:40,600 --> 00:06:41,628 know, maybe see that it 189 00:06:41,628 --> 00:06:42,830 was fitting the sort of 190 00:06:42,830 --> 00:06:46,339 very wiggly function that goes all over the place to predict housing prices. 191 00:06:46,339 --> 00:06:47,701 And we could then use figures 192 00:06:47,740 --> 00:06:50,667 like these to select an appropriate degree polynomial. 193 00:06:50,680 --> 00:06:54,166 So plotting the hypothesis, could 194 00:06:54,166 --> 00:06:55,728 be one way to try to 195 00:06:55,750 --> 00:06:58,160 decide what degree polynomial to use. 196 00:06:58,160 --> 00:07:00,163 But that doesn't always work. 197 00:07:00,180 --> 00:07:02,019 And, in fact more often we 198 00:07:02,019 --> 00:07:06,075 may have learning problems that where we just have a lot of features. 199 00:07:06,075 --> 00:07:07,563 And there is not 200 00:07:07,563 --> 00:07:10,599 just a matter of selecting what degree polynomial. 201 00:07:10,630 --> 00:07:12,147 And, in fact, when we 202 00:07:12,170 --> 00:07:13,779 have so many features, it also 203 00:07:13,779 --> 00:07:15,593 becomes much harder to plot 204 00:07:15,630 --> 00:07:17,698 the data and it becomes 205 00:07:17,710 --> 00:07:19,211 much harder to visualize it, 206 00:07:19,211 --> 00:07:22,396 to decide what features to keep or not. 207 00:07:22,420 --> 00:07:24,142 So concretely, if we're trying 208 00:07:24,160 --> 00:07:27,849 predict housing prices sometimes we can just have a lot of different features. 209 00:07:27,880 --> 00:07:31,373 And all of these features seem, you know, maybe they seem kind of useful. 210 00:07:31,373 --> 00:07:32,609 But, if we have a 211 00:07:32,609 --> 00:07:34,123 lot of features, and, very little 212 00:07:34,123 --> 00:07:35,820 training data, then, over 213 00:07:35,840 --> 00:07:37,776 fitting can become a problem. 214 00:07:37,776 --> 00:07:39,180 In order to address over 215 00:07:39,180 --> 00:07:40,651 fitting, there are two 216 00:07:40,651 --> 00:07:43,780 main options for things that we can do. 217 00:07:43,780 --> 00:07:45,759 The first option is, to try 218 00:07:45,770 --> 00:07:47,976 to reduce the number of features. 219 00:07:47,990 --> 00:07:49,337 Concretely, one thing we 220 00:07:49,337 --> 00:07:51,383 could do is manually look through 221 00:07:51,383 --> 00:07:53,236 the list of features, and, use 222 00:07:53,236 --> 00:07:54,894 that to try to decide which 223 00:07:54,894 --> 00:07:57,256 are the more important features, and, therefore, 224 00:07:57,256 --> 00:07:58,476 which are the features we should 225 00:07:58,476 --> 00:08:01,844 keep, and, which are the features we should throw out. 226 00:08:01,844 --> 00:08:03,401 Later in this course, where also 227 00:08:03,401 --> 00:08:06,018 talk about model selection algorithms. 228 00:08:06,040 --> 00:08:08,361 Which are algorithms for automatically 229 00:08:08,361 --> 00:08:09,788 deciding which features 230 00:08:09,800 --> 00:08:12,500 to keep and, which features to throw out. 231 00:08:12,500 --> 00:08:13,987 This idea of reducing the 232 00:08:13,987 --> 00:08:15,562 number of features can work 233 00:08:15,562 --> 00:08:17,853 well, and, can reduce over fitting. 234 00:08:17,853 --> 00:08:19,383 And, when we talk about model 235 00:08:19,383 --> 00:08:22,534 selection, we'll go into this in much greater depth. 236 00:08:22,534 --> 00:08:24,386 But, the disadvantage is that, by 237 00:08:24,386 --> 00:08:25,603 throwing away some of the 238 00:08:25,603 --> 00:08:27,010 features, is also throwing 239 00:08:27,370 --> 00:08:30,615 away some of the information you have about the problem. 240 00:08:30,650 --> 00:08:31,942 For example, maybe, all of 241 00:08:31,942 --> 00:08:33,760 those features are actually useful 242 00:08:33,780 --> 00:08:35,050 for predicting the price of a 243 00:08:35,070 --> 00:08:36,636 house, so, maybe, we don't actually 244 00:08:36,640 --> 00:08:37,687 want to throw some of 245 00:08:37,687 --> 00:08:40,990 our information or throw some of our features away. 246 00:08:41,540 --> 00:08:44,515 The second option, which we'll 247 00:08:44,515 --> 00:08:45,995 talk about in the 248 00:08:46,010 --> 00:08:49,268 next few videos, is regularization. 249 00:08:49,268 --> 00:08:50,390 Here, we're going to keep 250 00:08:50,390 --> 00:08:52,579 all the features, but we're 251 00:08:52,579 --> 00:08:55,063 going to reduce the magnitude 252 00:08:55,063 --> 00:08:56,506 or the values of the parameters 253 00:08:56,520 --> 00:08:58,745 theta J. And, this 254 00:08:58,750 --> 00:09:00,690 method works well, we'll see, 255 00:09:00,690 --> 00:09:01,925 when we have a lot of 256 00:09:01,925 --> 00:09:03,822 features, each of which contributes 257 00:09:03,822 --> 00:09:05,502 a little bit to predicting 258 00:09:05,502 --> 00:09:07,723 the value of Y, like we 259 00:09:07,740 --> 00:09:10,283 saw in the housing price prediction example. 260 00:09:10,283 --> 00:09:11,413 Where we could have a lot 261 00:09:11,413 --> 00:09:12,720 of features, each of which 262 00:09:12,750 --> 00:09:16,902 are, you know, somewhat useful, so, maybe, we don't want to throw them away. 263 00:09:16,930 --> 00:09:19,247 So, this subscribes the 264 00:09:19,250 --> 00:09:22,790 idea of regularization at a very high level. 265 00:09:22,790 --> 00:09:24,354 And, I realize that, all 266 00:09:24,360 --> 00:09:26,763 of these details probably don't make sense to you yet. 267 00:09:26,763 --> 00:09:28,316 But, in the next video, we'll 268 00:09:28,316 --> 00:09:30,960 start to formulate exactly how 269 00:09:30,960 --> 00:09:35,117 to apply regularization and, exactly what regularization means. 270 00:09:35,140 --> 00:09:36,810 And, then we'll start to 271 00:09:36,810 --> 00:09:38,310 figure out, how to use this, 272 00:09:38,310 --> 00:09:40,412 to make how learning algorithms work 273 00:09:40,412 --> 00:09:42,460 well and avoid overfitting.