1 00:00:00,000 --> 00:00:04,267 [MUSIC] 2 00:00:04,267 --> 00:00:07,702 Okay so in the first course of this specialization, 3 00:00:07,702 --> 00:00:11,930 we introduced this block diagram or flow chart for regression. 4 00:00:11,930 --> 00:00:14,230 As well as a bunch of other machine learning tasks, but 5 00:00:14,230 --> 00:00:15,770 I just want to walk through this again. 6 00:00:17,290 --> 00:00:19,590 So we're going to assume that we have some training data. 7 00:00:19,590 --> 00:00:21,710 Okay, so all of a sudden I've said the word training data, 8 00:00:21,710 --> 00:00:23,800 and I hadn't said that to this point. 9 00:00:23,800 --> 00:00:26,950 That's a topic we discussed at a very high level 10 00:00:26,950 --> 00:00:28,560 in the first course of the specialization, 11 00:00:28,560 --> 00:00:33,660 and we're gonna discuss the concepts of training data, test data, validation sets, 12 00:00:33,660 --> 00:00:38,720 lots of other ideas about fitting our models and accessing our fits. 13 00:00:38,720 --> 00:00:42,000 And chosing between models and all these different topics. 14 00:00:42,000 --> 00:00:44,080 We're gonna discuss that in this course. 15 00:00:44,080 --> 00:00:48,800 But for right now, if you don't know what it means to have training data, go back to 16 00:00:48,800 --> 00:00:54,040 the first course of this specialization, watch that video or look it up online. 17 00:00:54,040 --> 00:00:57,770 For now, I'm gonna assume that you know that we've some training data and 18 00:00:57,770 --> 00:01:02,610 we're using our training data to fit a specific model. 19 00:01:02,610 --> 00:01:06,510 And for the rest of this module, that's gonna suffice, 20 00:01:06,510 --> 00:01:08,360 we're only looking at training data. 21 00:01:08,360 --> 00:01:11,620 And all the other discussion about validation sets, test sets, 22 00:01:11,620 --> 00:01:13,780 and everything like this, choosing between models, 23 00:01:13,780 --> 00:01:17,930 assessing the fit of the models, that's going to come later in this course. 24 00:01:17,930 --> 00:01:21,040 Okay, so when I talk about data in the rest of this module, 25 00:01:21,040 --> 00:01:23,453 I'm assuming we're talking about just our training set, 26 00:01:23,453 --> 00:01:27,870 that we've already done some split and we have our training data, okay. 27 00:01:28,900 --> 00:01:31,740 Think I've said enough on that. 28 00:01:31,740 --> 00:01:37,109 So, we have our training data and in this case, 29 00:01:37,109 --> 00:01:41,671 that's gonna be some table of house id, 30 00:01:41,671 --> 00:01:47,329 house square feet and then the house sales price. 31 00:01:51,844 --> 00:01:56,210 And we have this big table of all these quantities. 32 00:01:56,210 --> 00:02:00,700 And then what we're going to do, is extract some features. 33 00:02:00,700 --> 00:02:04,510 So maybe, actually, at this point instead of specifically saying how square feet, 34 00:02:04,510 --> 00:02:07,150 I can just say that there's some set of house attributes. 35 00:02:07,150 --> 00:02:12,300 We might have things in addition to square feet. 36 00:02:12,300 --> 00:02:17,790 And so one step we're gonna have to do is figure out what input we're gonna use for 37 00:02:17,790 --> 00:02:19,570 our regression model. 38 00:02:19,570 --> 00:02:22,340 We're gonna talk about more general features later in the course, but for 39 00:02:22,340 --> 00:02:27,340 now we're just assuming a simple setup where we're gonna choose one of our house 40 00:02:27,340 --> 00:02:31,810 attributes, call that the input to this regression model, and work from there. 41 00:02:31,810 --> 00:02:36,880 So, we've talked about using square feet as our selected input 42 00:02:38,140 --> 00:02:40,350 and then we're gonna use our Machine Learning model, 43 00:02:40,350 --> 00:02:46,700 which is regression, to predict our house sales prices. 44 00:02:46,700 --> 00:02:51,936 So y hat represents our predicted, 45 00:02:53,654 --> 00:02:57,416 house sales price. 46 00:02:57,416 --> 00:02:59,630 And how are we predicting it? 47 00:02:59,630 --> 00:03:05,346 Well, based on some estimated line or curve, 48 00:03:05,346 --> 00:03:11,369 so this is our, f hat is our estimated function. 49 00:03:14,701 --> 00:03:21,040 That's fit from our data set, or, specifically our training data set. 50 00:03:21,040 --> 00:03:25,540 And how are we gonna determine f hat? 51 00:03:25,540 --> 00:03:28,540 Or how are we gonna estimate this function? 52 00:03:28,540 --> 00:03:33,770 Well, first we're gonna need to describe some quality metric that says 53 00:03:33,770 --> 00:03:35,690 I should describe what Y is. 54 00:03:35,690 --> 00:03:38,510 I already mentioned this earlier, but let's write it down explicitly. 55 00:03:38,510 --> 00:03:41,520 This is the sales price, the actual 56 00:03:43,450 --> 00:03:48,290 sales price of our houses, and we're gonna compare the actual 57 00:03:48,290 --> 00:03:53,720 sales price to the predicted sales price using the any given F hat. 58 00:03:53,720 --> 00:03:55,290 We're gonna say how well did we do? 59 00:03:56,770 --> 00:03:58,540 And that's what the quality metric is. 60 00:03:58,540 --> 00:04:03,828 So there's gonna be some error in our predicted values. 61 00:04:09,285 --> 00:04:13,966 And the machine learning algorithm we're gonna use to fit these 62 00:04:13,966 --> 00:04:18,140 regression models is gonna try to minimize that error. 63 00:04:18,140 --> 00:04:22,600 So it's gonna search over all these functions to reduce the error 64 00:04:24,480 --> 00:04:26,550 in these predicted values. 65 00:04:26,550 --> 00:04:29,008 Okay, so this is our overall flow chart and 66 00:04:29,008 --> 00:04:33,246 we're gonna walk through the various components of this throughout this 67 00:04:33,246 --> 00:04:36,202 module and in more depth in the rest of this course. 68 00:04:36,202 --> 00:04:36,702 [MUSIC]