1
00:00:00,569 --> 00:00:04,588
[MUSIC]

2
00:00:04,588 --> 00:00:09,077
As one example of a way to handle this
bias variance trade off, we're gonna talk

3
00:00:09,077 --> 00:00:13,231
about something called ridge regression,
which not only includes a term

4
00:00:13,231 --> 00:00:17,318
that measures the fit of the function
to the data, which is what we talked

5
00:00:17,318 --> 00:00:22,740
about before, but also incorporates a term
that encodes what the model complexity is.

6
00:00:22,740 --> 00:00:28,310
Not quite directly, pretty indirectly
as we're gonna describe in this module.

7
00:00:28,310 --> 00:00:33,190
But a key question then is how
are we gonna define the balance

8
00:00:33,190 --> 00:00:38,860
between how much we emphasize the fit to
data versus this model complexity term.

9
00:00:38,860 --> 00:00:41,460
For this in ridge regression
there's a parameter

10
00:00:41,460 --> 00:00:44,470
that balances between these two terms.

11
00:00:44,470 --> 00:00:47,210
And to define this parameter,

12
00:00:47,210 --> 00:00:51,770
we're gonna discuss choosing it using
something called cross validation.

13
00:00:51,770 --> 00:00:56,850
And this again is a tool that's much
more general than just for regression,

14
00:00:56,850 --> 00:01:00,820
and it's an idea for
how to choose these tuning parameters in

15
00:01:00,820 --> 00:01:02,920
any machine learning model
that we might look at.

16
00:01:03,950 --> 00:01:06,760
Next we're gonna discuss
a feature selection task.

17
00:01:06,760 --> 00:01:11,140
So for example, I have my house that I
wanna list for sale and I might have

18
00:01:11,140 --> 00:01:16,360
a really, really long list of house
attributes associated with this house.

19
00:01:16,360 --> 00:01:20,710
And I wanna figure out which
are those subset of attributes

20
00:01:20,710 --> 00:01:24,180
that are really informative for
assessing the value of my house.

21
00:01:24,180 --> 00:01:27,650
So, for example, maybe it doesn't
really matter the fact that my

22
00:01:27,650 --> 00:01:32,560
house has a microwave when I'm going
to predict the value of the house.

23
00:01:32,560 --> 00:01:35,090
So, for reasons of interpretability,

24
00:01:35,090 --> 00:01:40,250
it can be really useful to do
this feature selection task.

25
00:01:40,250 --> 00:01:45,020
And in addition, we're gonna show
that if we have just a few set of

26
00:01:45,020 --> 00:01:50,220
features in our model, after we've done
this feature selection, then that can

27
00:01:50,220 --> 00:01:55,410
lead to significant increases in
efficiencies in forming our predictions.

28
00:01:56,970 --> 00:02:01,632
And so, to do this feature selection task,
the first thing we're gonna talk about is

29
00:02:01,632 --> 00:02:07,190
ways to explicitly search between models,
that include different sets of features.

30
00:02:07,190 --> 00:02:11,490
But than we're gonna turn to a method
that's really really similar in spirit to

31
00:02:11,490 --> 00:02:16,360
ridge regression that allows us to do
this feature selection task implicitly.

32
00:02:16,360 --> 00:02:21,350
In particular, again we're gonna have this
measure of fit of our function to our data

33
00:02:21,350 --> 00:02:23,790
and a measure of the model complexity but

34
00:02:23,790 --> 00:02:26,630
it's gonna be a different measure
that what we use for ridge.

35
00:02:26,630 --> 00:02:29,980
And this measure in particular
is what's gonna lead to these,

36
00:02:29,980 --> 00:02:33,180
what are called sparse
solutions where only a few

37
00:02:33,180 --> 00:02:37,060
of the features are actually
present in our estimated model.

38
00:02:37,060 --> 00:02:42,520
And we're gonna use this lasso
regression task as an opportunity

39
00:02:42,520 --> 00:02:47,000
to teach about another optimization
method that's called coordinate descent.

40
00:02:47,000 --> 00:02:49,580
So we talked about gradient
descent earlier, and

41
00:02:49,580 --> 00:02:52,830
this is another one of these really
important optimization methods that we're

42
00:02:52,830 --> 00:02:55,970
gonna see again later
in this specialization.

43
00:02:55,970 --> 00:03:00,346
And what coordinate descent does
Is instead of solving a big,

44
00:03:00,346 --> 00:03:07,140
high dimensional optimization objective,
it's gonna go coordinate by coordinate.

45
00:03:07,140 --> 00:03:10,220
So variable by variable,
optimizing each in turn.

46
00:03:10,220 --> 00:03:13,530
So we're gonna end up making
these axis aligned moves

47
00:03:13,530 --> 00:03:15,400
as we iterate in this algorithm.

48
00:03:15,400 --> 00:03:18,770
So again, just like radiant descent,
it's an iterative procedure.

49
00:03:18,770 --> 00:03:21,770
But is fundamentally
a different formulation for

50
00:03:21,770 --> 00:03:23,120
how these iterates are defined.

51
00:03:24,920 --> 00:03:29,400
Finally we're gonna conclude by discussing
something called nearest neighbor

52
00:03:29,400 --> 00:03:34,490
regression, which is a really simple,
but very, very powerful technique.

53
00:03:34,490 --> 00:03:38,230
So in the simplest case that we're gonna
describe, if I'm interested in predicting

54
00:03:38,230 --> 00:03:42,180
the value of my house, what I'm gonna do
is I'm gonna go through my data set and

55
00:03:42,180 --> 00:03:44,840
I'm gonna find the most
similar house to mine.

56
00:03:44,840 --> 00:03:49,890
Then, I'm simply gonna look at
how much that house sold for and

57
00:03:49,890 --> 00:03:54,520
I'm gonna say that's what I'm
predicting my house sale's price to be.

58
00:03:54,520 --> 00:03:57,670
Well you can generalize this idea of just
looking at the most similar house to

59
00:03:57,670 --> 00:03:59,510
looking at a set of similar houses and

60
00:03:59,510 --> 00:04:04,060
then taking the average value of
those houses as your prediction, but

61
00:04:04,060 --> 00:04:07,380
what you can also do is something
that's called kernel regression.

62
00:04:07,380 --> 00:04:09,990
Where you actually include
every observation in

63
00:04:09,990 --> 00:04:14,190
your data set informing
your predicted value.

64
00:04:14,190 --> 00:04:18,600
But when you go to computer this
average you're gonna weight the houses

65
00:04:18,600 --> 00:04:21,230
by how close they are to you.

66
00:04:21,230 --> 00:04:27,760
So houses that are very similar which are
quote, unquote, nearby to you in the space

67
00:04:27,760 --> 00:04:31,970
of similarity are gonna be weighted very
heavily in this weighted average, and

68
00:04:31,970 --> 00:04:35,220
houses that are very dissimilar
are gonna be down weighted a lot.

69
00:04:35,220 --> 00:04:41,950
And this leads to these really nice fits
for regression and they're very adaptive.

70
00:04:41,950 --> 00:04:46,020
As you get more data you can describe
more and more complicated relationships.

71
00:04:47,020 --> 00:04:50,190
So these methods are useful
when you have lots of data, and

72
00:04:50,190 --> 00:04:55,020
we're gonna discuss this data versus
complexity trade off in this module.

73
00:04:55,020 --> 00:04:59,480
So in summary, we're gonna cover
a lot of ground in this course.

74
00:04:59,480 --> 00:05:04,190
So we're gonna talk about all different
kinds of models for regression, but

75
00:05:04,190 --> 00:05:07,350
we're also gonna talk
about very general purpose

76
00:05:07,350 --> 00:05:10,940
optimization algorithms like gradient
descent and coordinate descent, and

77
00:05:10,940 --> 00:05:14,870
a whole bunch of concepts that are really
foundational to machine learning.

78
00:05:14,870 --> 00:05:18,473
Including things like the bias variance
trade off, cross validation for

79
00:05:18,473 --> 00:05:21,902
selecting tuning parameters,
ideas of sparsity and over fitting and

80
00:05:21,902 --> 00:05:24,870
how to do model selection and
feature selection.

81
00:05:24,870 --> 00:05:29,485
So, this is gonna be a really, really
important course in our specialization.

82
00:05:29,485 --> 00:05:33,719
[MUSIC]