1
00:00:00,000 --> 00:00:04,334
[MUSIC]

2
00:00:04,334 --> 00:00:07,040
The fifth module was then
all about feature selection.

3
00:00:08,370 --> 00:00:11,740
So, to motivate this, we talked about the
fact that every house might have a really

4
00:00:11,740 --> 00:00:16,830
long list of The attributes
associated with it and for reasons of

5
00:00:16,830 --> 00:00:21,870
both interpretability as well as
efficiency in forming our predictions.

6
00:00:21,870 --> 00:00:26,930
We want to select a sparse subset of
these features to include in our model.

7
00:00:28,240 --> 00:00:32,070
So to perform this feature selection the
first thing that we talked about we a set

8
00:00:32,070 --> 00:00:38,010
of methods that Explicitly searched over
models with different numbers of features

9
00:00:38,010 --> 00:00:43,610
and the exhaustive approach was something
that's called all subsets selection.

10
00:00:43,610 --> 00:00:47,370
But then we also talked about greedy
procedures like forward selection and

11
00:00:47,370 --> 00:00:51,250
saw that these gave,
perhaps suboptimal solutions, but

12
00:00:51,250 --> 00:00:53,670
we're much more efficient than
the all subsets procedure.

13
00:00:56,080 --> 00:00:59,310
But instead of explicitly searching over
models with different sets of features,

14
00:00:59,310 --> 00:01:03,300
we talked about how to use lasso
regression to implicitly do this feature

15
00:01:03,300 --> 00:01:07,720
selection where the objective looks just
like but instead of using the L2 norm,

16
00:01:07,720 --> 00:01:10,840
we're using L1 norm of our coefficients.

17
00:01:10,840 --> 00:01:13,580
And we showed how that
led to sparse solutions.

18
00:01:13,580 --> 00:01:17,310
So in particular, if we look at
the coefficient path associated with lasso

19
00:01:17,310 --> 00:01:21,550
We saw that for any value of lambda,
we ended up, typically,

20
00:01:21,550 --> 00:01:26,240
with a sparse solution, getting sparser
and sparser as we increase lambda.

21
00:01:26,240 --> 00:01:28,620
And this was in contrast
to what we saw for ridge,

22
00:01:28,620 --> 00:01:31,320
where the coefficients just got
smaller and smaller, here we actually

23
00:01:31,320 --> 00:01:35,320
end up with the sparse solutions that
lead to this idea of future selection.

24
00:01:37,070 --> 00:01:41,720
Then to optimize this laws of objective,
we talked about a coordinate descent

25
00:01:41,720 --> 00:01:46,110
algorithm where we solved
collection of one deoptimization

26
00:01:46,110 --> 00:01:50,130
problems iterating through the different
dimensions of our objectives.

27
00:01:50,130 --> 00:01:54,000
So in particular the different
features of our regression model.

28
00:01:54,000 --> 00:01:58,910
And what we saw was that for
lasso we ended up Setting our coefficients

29
00:01:58,910 --> 00:02:02,030
according to something that
we called soft thresholding,

30
00:02:02,030 --> 00:02:07,750
where in a certain range
of the correlation,

31
00:02:07,750 --> 00:02:10,600
this correlation coefficient that
we described in this module.

32
00:02:10,600 --> 00:02:13,690
We're gonna set our
coefficient exactly to zero.

33
00:02:13,690 --> 00:02:16,950
And outside that range,
relative to our least squares solution.

34
00:02:16,950 --> 00:02:19,720
We're gonna shrink the value
of the estimated coefficient.

35
00:02:20,980 --> 00:02:25,790
So lasso can lead to these far solutions
and has shown impact in just really,

36
00:02:25,790 --> 00:02:28,040
really large set of
different applied domains.

37
00:02:29,670 --> 00:02:33,980
In our last module, we talked about a set
of parametric techniques called nearest

38
00:02:33,980 --> 00:02:36,630
neighbor and kernel regression.

39
00:02:36,630 --> 00:02:38,140
And one nearest neighbor was a really,

40
00:02:38,140 --> 00:02:42,500
really simple procedure, the most basic
procedure that you would imagine doing.

41
00:02:42,500 --> 00:02:44,710
But we show that it actually
could perform really well,

42
00:02:44,710 --> 00:02:46,140
especially when you have lots of data.

43
00:02:46,140 --> 00:02:49,970
And what this method does is if you're
going to estimate the value of your house,

44
00:02:49,970 --> 00:02:53,040
you just look for the most similar house,
look at its value, and

45
00:02:53,040 --> 00:02:55,990
predict your value to be exactly the same.

46
00:02:55,990 --> 00:02:59,505
Then we talked about making this a little
bit more robust by looking at a set of

47
00:02:59,505 --> 00:03:01,770
k-nearest neighbors and then say,

48
00:03:01,770 --> 00:03:05,980
well you can also think about weighting
these k-nearest neighbors when you're

49
00:03:05,980 --> 00:03:11,250
going to compute your predicted value
by how similar they are to you.

50
00:03:11,250 --> 00:03:15,790
And then average across these ratings
to form your estimated prediction.

51
00:03:17,300 --> 00:03:21,490
And this led directly to an idea of
kernel regression, where instead of

52
00:03:21,490 --> 00:03:26,340
just waiting a collection of neighbors,
you actually weighed every observation

53
00:03:26,340 --> 00:03:31,310
in your data set, but a lot of
the kernels that we specify actually set

54
00:03:31,310 --> 00:03:36,780
those weights to zero outside a certain
range and decay them within a given range.

55
00:03:36,780 --> 00:03:40,730
And so what this leads to is an idea
of these very local fits, and we talked

56
00:03:40,730 --> 00:03:45,530
about how kernel regression was equivalent
to forming these locally constant fits,

57
00:03:45,530 --> 00:03:50,600
which was in contrast to our parametric
models, that formed these global fits.

58
00:03:50,600 --> 00:03:55,730
So here's a visualization of our kernel
regression that we saw in this module, and

59
00:03:55,730 --> 00:03:59,290
we see how it leads to these really,
nice, smooth fits.

60
00:03:59,290 --> 00:04:03,584
And these fits are very adaptive to the
complexity of the data that we see, and

61
00:04:03,584 --> 00:04:06,709
can increase in complexity as
we get more and more data.

62
00:04:06,709 --> 00:04:11,449
[MUSIC]