1 00:00:00,025 --> 00:00:04,790 [MUSIC] 2 00:00:04,790 --> 00:00:14,790 [MUSIC] 3 00:00:16,660 --> 00:00:21,380 So, in particular we an also face this issue of overfitting when we get lots and 4 00:00:21,380 --> 00:00:22,160 lots of inputs. 5 00:00:22,160 --> 00:00:31,370 [MUSIC] 6 00:00:31,370 --> 00:00:35,220 That represents a very flexible model that can run into the same issues that we saw 7 00:00:35,220 --> 00:00:37,790 in our demo for polynomial regression. 8 00:00:37,790 --> 00:00:40,930 Or more generally, we can say just if we have lots of features. 9 00:00:40,930 --> 00:00:44,560 So we'll say that capital D is very large. 10 00:00:44,560 --> 00:00:47,390 And this could be different functions of our input. 11 00:00:47,390 --> 00:00:50,960 But when you include lots and lots of these functions of our inputs, 12 00:00:50,960 --> 00:00:55,060 in our regression model then again we're in this place where the model has 13 00:00:55,060 --> 00:00:59,455 a lot of flexibility to explain the data and we're subject to becoming overfit. 14 00:01:01,060 --> 00:01:05,470 But this issue of overfitting with respect to increasing model complexity 15 00:01:05,470 --> 00:01:08,300 is really relative to how much data that we have. 16 00:01:08,300 --> 00:01:11,480 So let's talk about overfitting as a function of the number of 17 00:01:11,480 --> 00:01:13,000 observations that we have. 18 00:01:13,000 --> 00:01:15,170 As well as a function of the number of inputs. 19 00:01:15,170 --> 00:01:16,830 Or the complexity of the model. 20 00:01:18,210 --> 00:01:22,040 So in particular if we have very few observations and 21 00:01:22,040 --> 00:01:27,520 it's small, then our models can rapidly become overfit to the data. 22 00:01:27,520 --> 00:01:30,560 Because we have only a few points and as we're increasing in 23 00:01:30,560 --> 00:01:33,090 our model complexity like the order of the polynomial, 24 00:01:33,090 --> 00:01:36,170 it becomes very easy to hit all of our observations, but 25 00:01:36,170 --> 00:01:39,780 in between where we have those observations, things can go very wild. 26 00:01:41,230 --> 00:01:46,540 On the other hand, if we have lots and lots and lots of observations, even with 27 00:01:46,540 --> 00:01:51,805 really, really complex models, we're not gonna as quickly become 28 00:01:51,805 --> 00:01:56,920 overfit because we have dense observations across our input, 29 00:01:56,920 --> 00:02:01,920 so the function is pinned down basically everywhere. 30 00:02:01,920 --> 00:02:04,660 In this example as a function of square feet. 31 00:02:04,660 --> 00:02:07,430 And it's not able to hit every observation, 32 00:02:07,430 --> 00:02:10,070 it's not able to do these really crazy wiggly things. 33 00:02:11,390 --> 00:02:12,180 Okay. 34 00:02:12,180 --> 00:02:16,920 So, on the other hand when we have just one input 35 00:02:18,050 --> 00:02:23,740 like number of square feet of a house in order to avoid overfitting, 36 00:02:23,740 --> 00:02:27,880 we need to have observations that are very dense across number of square feet. 37 00:02:27,880 --> 00:02:33,630 So we need to have lots of representative examples of square feet and 38 00:02:33,630 --> 00:02:34,880 house value pairs. 39 00:02:37,218 --> 00:02:40,080 So this is actually pretty hard to do, to have lots of 40 00:02:40,080 --> 00:02:44,450 examples of houses of every possible square feet that you might see. 41 00:02:44,450 --> 00:02:46,710 So this is already a hard problem, but 42 00:02:46,710 --> 00:02:51,610 it becomes even harder when I increase the number of inputs in my model. 43 00:02:51,610 --> 00:02:55,050 So, for example, just think of a model where I have square feet and 44 00:02:55,050 --> 00:02:57,150 number of bathrooms. 45 00:02:57,150 --> 00:03:01,600 And I want to cover all possible combinations of those two inputs 46 00:03:01,600 --> 00:03:05,220 in order to provide representative examples and avoid overfitting. 47 00:03:05,220 --> 00:03:06,180 Well that's really really hard. 48 00:03:06,180 --> 00:03:10,940 [MUSIC] 49 00:03:10,940 --> 00:03:12,840 [MUSIC] 50 00:03:12,840 --> 00:03:20,530 [MUSIC] 51 00:03:20,530 --> 00:03:26,880 [MUSIC] 52 00:03:26,880 --> 00:03:31,669 [MUSIC]