1 00:00:00,569 --> 00:00:04,588 [MUSIC] 2 00:00:04,588 --> 00:00:09,077 As one example of a way to handle this bias variance trade off, we're gonna talk 3 00:00:09,077 --> 00:00:13,231 about something called ridge regression, which not only includes a term 4 00:00:13,231 --> 00:00:17,318 that measures the fit of the function to the data, which is what we talked 5 00:00:17,318 --> 00:00:22,740 about before, but also incorporates a term that encodes what the model complexity is. 6 00:00:22,740 --> 00:00:28,310 Not quite directly, pretty indirectly as we're gonna describe in this module. 7 00:00:28,310 --> 00:00:33,190 But a key question then is how are we gonna define the balance 8 00:00:33,190 --> 00:00:38,860 between how much we emphasize the fit to data versus this model complexity term. 9 00:00:38,860 --> 00:00:41,460 For this in ridge regression there's a parameter 10 00:00:41,460 --> 00:00:44,470 that balances between these two terms. 11 00:00:44,470 --> 00:00:47,210 And to define this parameter, 12 00:00:47,210 --> 00:00:51,770 we're gonna discuss choosing it using something called cross validation. 13 00:00:51,770 --> 00:00:56,850 And this again is a tool that's much more general than just for regression, 14 00:00:56,850 --> 00:01:00,820 and it's an idea for how to choose these tuning parameters in 15 00:01:00,820 --> 00:01:02,920 any machine learning model that we might look at. 16 00:01:03,950 --> 00:01:06,760 Next we're gonna discuss a feature selection task. 17 00:01:06,760 --> 00:01:11,140 So for example, I have my house that I wanna list for sale and I might have 18 00:01:11,140 --> 00:01:16,360 a really, really long list of house attributes associated with this house. 19 00:01:16,360 --> 00:01:20,710 And I wanna figure out which are those subset of attributes 20 00:01:20,710 --> 00:01:24,180 that are really informative for assessing the value of my house. 21 00:01:24,180 --> 00:01:27,650 So, for example, maybe it doesn't really matter the fact that my 22 00:01:27,650 --> 00:01:32,560 house has a microwave when I'm going to predict the value of the house. 23 00:01:32,560 --> 00:01:35,090 So, for reasons of interpretability, 24 00:01:35,090 --> 00:01:40,250 it can be really useful to do this feature selection task. 25 00:01:40,250 --> 00:01:45,020 And in addition, we're gonna show that if we have just a few set of 26 00:01:45,020 --> 00:01:50,220 features in our model, after we've done this feature selection, then that can 27 00:01:50,220 --> 00:01:55,410 lead to significant increases in efficiencies in forming our predictions. 28 00:01:56,970 --> 00:02:01,632 And so, to do this feature selection task, the first thing we're gonna talk about is 29 00:02:01,632 --> 00:02:07,190 ways to explicitly search between models, that include different sets of features. 30 00:02:07,190 --> 00:02:11,490 But than we're gonna turn to a method that's really really similar in spirit to 31 00:02:11,490 --> 00:02:16,360 ridge regression that allows us to do this feature selection task implicitly. 32 00:02:16,360 --> 00:02:21,350 In particular, again we're gonna have this measure of fit of our function to our data 33 00:02:21,350 --> 00:02:23,790 and a measure of the model complexity but 34 00:02:23,790 --> 00:02:26,630 it's gonna be a different measure that what we use for ridge. 35 00:02:26,630 --> 00:02:29,980 And this measure in particular is what's gonna lead to these, 36 00:02:29,980 --> 00:02:33,180 what are called sparse solutions where only a few 37 00:02:33,180 --> 00:02:37,060 of the features are actually present in our estimated model. 38 00:02:37,060 --> 00:02:42,520 And we're gonna use this lasso regression task as an opportunity 39 00:02:42,520 --> 00:02:47,000 to teach about another optimization method that's called coordinate descent. 40 00:02:47,000 --> 00:02:49,580 So we talked about gradient descent earlier, and 41 00:02:49,580 --> 00:02:52,830 this is another one of these really important optimization methods that we're 42 00:02:52,830 --> 00:02:55,970 gonna see again later in this specialization. 43 00:02:55,970 --> 00:03:00,346 And what coordinate descent does Is instead of solving a big, 44 00:03:00,346 --> 00:03:07,140 high dimensional optimization objective, it's gonna go coordinate by coordinate. 45 00:03:07,140 --> 00:03:10,220 So variable by variable, optimizing each in turn. 46 00:03:10,220 --> 00:03:13,530 So we're gonna end up making these axis aligned moves 47 00:03:13,530 --> 00:03:15,400 as we iterate in this algorithm. 48 00:03:15,400 --> 00:03:18,770 So again, just like radiant descent, it's an iterative procedure. 49 00:03:18,770 --> 00:03:21,770 But is fundamentally a different formulation for 50 00:03:21,770 --> 00:03:23,120 how these iterates are defined. 51 00:03:24,920 --> 00:03:29,400 Finally we're gonna conclude by discussing something called nearest neighbor 52 00:03:29,400 --> 00:03:34,490 regression, which is a really simple, but very, very powerful technique. 53 00:03:34,490 --> 00:03:38,230 So in the simplest case that we're gonna describe, if I'm interested in predicting 54 00:03:38,230 --> 00:03:42,180 the value of my house, what I'm gonna do is I'm gonna go through my data set and 55 00:03:42,180 --> 00:03:44,840 I'm gonna find the most similar house to mine. 56 00:03:44,840 --> 00:03:49,890 Then, I'm simply gonna look at how much that house sold for and 57 00:03:49,890 --> 00:03:54,520 I'm gonna say that's what I'm predicting my house sale's price to be. 58 00:03:54,520 --> 00:03:57,670 Well you can generalize this idea of just looking at the most similar house to 59 00:03:57,670 --> 00:03:59,510 looking at a set of similar houses and 60 00:03:59,510 --> 00:04:04,060 then taking the average value of those houses as your prediction, but 61 00:04:04,060 --> 00:04:07,380 what you can also do is something that's called kernel regression. 62 00:04:07,380 --> 00:04:09,990 Where you actually include every observation in 63 00:04:09,990 --> 00:04:14,190 your data set informing your predicted value. 64 00:04:14,190 --> 00:04:18,600 But when you go to computer this average you're gonna weight the houses 65 00:04:18,600 --> 00:04:21,230 by how close they are to you. 66 00:04:21,230 --> 00:04:27,760 So houses that are very similar which are quote, unquote, nearby to you in the space 67 00:04:27,760 --> 00:04:31,970 of similarity are gonna be weighted very heavily in this weighted average, and 68 00:04:31,970 --> 00:04:35,220 houses that are very dissimilar are gonna be down weighted a lot. 69 00:04:35,220 --> 00:04:41,950 And this leads to these really nice fits for regression and they're very adaptive. 70 00:04:41,950 --> 00:04:46,020 As you get more data you can describe more and more complicated relationships. 71 00:04:47,020 --> 00:04:50,190 So these methods are useful when you have lots of data, and 72 00:04:50,190 --> 00:04:55,020 we're gonna discuss this data versus complexity trade off in this module. 73 00:04:55,020 --> 00:04:59,480 So in summary, we're gonna cover a lot of ground in this course. 74 00:04:59,480 --> 00:05:04,190 So we're gonna talk about all different kinds of models for regression, but 75 00:05:04,190 --> 00:05:07,350 we're also gonna talk about very general purpose 76 00:05:07,350 --> 00:05:10,940 optimization algorithms like gradient descent and coordinate descent, and 77 00:05:10,940 --> 00:05:14,870 a whole bunch of concepts that are really foundational to machine learning. 78 00:05:14,870 --> 00:05:18,473 Including things like the bias variance trade off, cross validation for 79 00:05:18,473 --> 00:05:21,902 selecting tuning parameters, ideas of sparsity and over fitting and 80 00:05:21,902 --> 00:05:24,870 how to do model selection and feature selection. 81 00:05:24,870 --> 00:05:29,485 So, this is gonna be a really, really important course in our specialization. 82 00:05:29,485 --> 00:05:33,719 [MUSIC]