1 00:00:00,000 --> 00:00:04,595 [MUSIC] 2 00:00:04,595 --> 00:00:06,756 This course consists of six different modules, 3 00:00:06,756 --> 00:00:08,610 and let's preview what these are about. 4 00:00:09,760 --> 00:00:13,090 To start with, we're gonna talk about simple regression, and 5 00:00:13,090 --> 00:00:14,850 why is it called simple regression? 6 00:00:14,850 --> 00:00:18,780 That's because we're gonna focus on just having a single input, for 7 00:00:18,780 --> 00:00:23,810 example the size of your house and when we think about defining 8 00:00:23,810 --> 00:00:29,320 the relationship between the single input and our output, the house sales price, 9 00:00:29,320 --> 00:00:34,290 we're just gonna look at fitting a very simple line to this data. 10 00:00:34,290 --> 00:00:36,160 And that's why it's called simple regression. 11 00:00:38,060 --> 00:00:42,580 And one thing we're gonna need to do in order to discover what the best fitted 12 00:00:42,580 --> 00:00:46,480 line is is we're gonna need to define something called a goodness-of-fit metric 13 00:00:46,480 --> 00:00:52,250 that allows us to say that some lines fit better than other lines and 14 00:00:52,250 --> 00:00:55,720 according to this metric we can then search over all possible lines to find 15 00:00:55,720 --> 00:00:58,170 the one that fits the data the best. 16 00:00:58,170 --> 00:01:02,650 So in this context of this goodness-of-fit metric we're then going to discuss some 17 00:01:02,650 --> 00:01:07,000 important optimization techniques that are much more general than regression itself. 18 00:01:07,000 --> 00:01:10,240 So this is an example of one of these general concepts that we're going to see 19 00:01:10,240 --> 00:01:13,450 again, many times in the specialization. 20 00:01:13,450 --> 00:01:18,800 And the method that we're going to explore here is something called gradient decent. 21 00:01:18,800 --> 00:01:21,670 So in the context of our regression problem, and 22 00:01:21,670 --> 00:01:24,870 we have just a simple line, a question is well what's the slope and 23 00:01:24,870 --> 00:01:30,810 what's the intercept of this line that minimizes this goodness-of-fit metric. 24 00:01:30,810 --> 00:01:35,260 So this blue curve I'm showing here is showing how good the fit is, 25 00:01:35,260 --> 00:01:39,780 where lower means actually a better fit and we're gonna try and 26 00:01:39,780 --> 00:01:43,340 minimize over all possible combinations of intercepts and slopes. 27 00:01:44,860 --> 00:01:46,450 To find the once that fits best. 28 00:01:47,500 --> 00:01:50,610 This gradient descent algorithm is an iterative algorithm that's gonna take 29 00:01:50,610 --> 00:01:52,210 multiple steps. 30 00:01:52,210 --> 00:01:59,090 And eventually go to this optimal solution that minimizes this metric. 31 00:01:59,090 --> 00:02:00,750 Measuring how well this line fits. 32 00:02:02,090 --> 00:02:06,190 Then once we've found this minimum, so we've found the best intercepting slope, 33 00:02:06,190 --> 00:02:11,090 what we're gonna talk about is how to think about interpreting these 34 00:02:11,090 --> 00:02:15,620 estimated parameters, and then how to use these parameters to form our predictions. 35 00:02:16,970 --> 00:02:20,790 We're then gonna move from simple regression to multiple regression, and 36 00:02:20,790 --> 00:02:25,160 multiple regression allows us to fit more complicated relationships 37 00:02:25,160 --> 00:02:27,620 between our single input and an output. 38 00:02:29,410 --> 00:02:33,270 So, more complicated than just a line as well as to consider 39 00:02:33,270 --> 00:02:34,820 more inputs in our model. 40 00:02:34,820 --> 00:02:36,470 So, for our housing application, 41 00:02:36,470 --> 00:02:39,220 maybe in addition to measure up the size of the house, 42 00:02:39,220 --> 00:02:43,470 we have something talking about how many bedrooms there are, how many bathrooms. 43 00:02:43,470 --> 00:02:44,600 What the lot size is, 44 00:02:44,600 --> 00:02:47,850 what year the house was built and we might have a whole bunch of different 45 00:02:47,850 --> 00:02:51,830 attributes of the house that we wanna incorporate into forming our predictions. 46 00:02:51,830 --> 00:02:54,020 And that's what multiple regression allows us to do. 47 00:02:55,260 --> 00:02:58,570 Having learned how to fit these models to data, we're then gonna turn to 48 00:02:58,570 --> 00:03:03,150 how to assess whether the fin is any good for example in forming predictions. 49 00:03:03,150 --> 00:03:07,020 For example, this fit shown here between house size and 50 00:03:07,020 --> 00:03:12,250 house price seems to be pretty good, but if we go to a much more complicated model, 51 00:03:12,250 --> 00:03:16,490 we see that it's doing really crazy things, but if you look carefully this 52 00:03:16,490 --> 00:03:22,740 fitted function is actually explaining our observed data these circles really well. 53 00:03:22,740 --> 00:03:26,730 But what it's not doing is it's not able to generalize 54 00:03:26,730 --> 00:03:29,880 to predictions of new houses that we haven't observed, 55 00:03:29,880 --> 00:03:34,350 because at least I wouldn't use this to form predictions of my house value. 56 00:03:34,350 --> 00:03:39,020 I wouldn't really believe that this is the relationship between house size and price. 57 00:03:39,020 --> 00:03:41,410 So question is, what's going wrong here? 58 00:03:41,410 --> 00:03:44,310 And this is something that we call over fitting to the data, 59 00:03:44,310 --> 00:03:50,040 and we're gonna talk about this in great length in this module of the course. 60 00:03:50,040 --> 00:03:53,240 In particular we're gonna define three different measures of air. 61 00:03:53,240 --> 00:03:57,020 One called training air, one is called test air, and 62 00:03:57,020 --> 00:03:59,750 one is called our generalization or true air, and 63 00:03:59,750 --> 00:04:04,230 that's actually the air we really want to form, but we're gonna show that 64 00:04:04,230 --> 00:04:07,470 that is something that we can't actually attain and we have to use proxies instead. 65 00:04:07,470 --> 00:04:10,720 And we're gonna look at these measures of error 66 00:04:10,720 --> 00:04:15,290 as a function of model complexity to describe this notion of over-fitting. 67 00:04:15,290 --> 00:04:18,010 Then we're gonna discuss these ideas in the context of something called 68 00:04:18,010 --> 00:04:23,280 the bias-variance trade-off and what this describes is the fact that 69 00:04:23,280 --> 00:04:28,290 simple models tend to be really well behaved but they're 70 00:04:28,290 --> 00:04:32,210 often just too simple to really describe the relationship that we're interested in. 71 00:04:32,210 --> 00:04:35,850 On the other hand, really complex models are really flexible, so 72 00:04:35,850 --> 00:04:41,060 they have the potential to fit really intricate relationships that might 73 00:04:41,060 --> 00:04:45,520 describe pretty well what's happening with the data but unfortunately these really, 74 00:04:45,520 --> 00:04:49,550 really complex models can have very strange behavior. 75 00:04:49,550 --> 00:04:53,780 And so this leads to what's called this bias-variance trade-off. 76 00:04:53,780 --> 00:04:57,344 And machine learning is all about exploring this trade off. 77 00:04:57,344 --> 00:05:01,579 [MUSIC]