1
00:00:00,000 --> 00:00:04,795
[MUSIC]

2
00:00:04,795 --> 00:00:07,755
Well, for our third option for
feature selection,

3
00:00:07,755 --> 00:00:12,396
we're gonna explore a completely different
approach which is using regularized

4
00:00:12,396 --> 00:00:16,500
regression to implicitly perform
feature selection for us.

5
00:00:16,500 --> 00:00:18,850
And the algorithm we're gonna
explore is called Lasso.

6
00:00:18,850 --> 00:00:23,620
And it's really fundamentally changed the
field of machine learning, statistics, and

7
00:00:23,620 --> 00:00:24,620
engineering.

8
00:00:24,620 --> 00:00:28,510
It's had a lot of, lot of impact,
in just a number of applications.

9
00:00:28,510 --> 00:00:30,030
And it's a really interesting approach.

10
00:00:31,380 --> 00:00:35,540
Let's recall regularized regression, and
the context of ridge regression first.

11
00:00:35,540 --> 00:00:40,300
Where, remember, we were balancing between
the fit of our model on our training data

12
00:00:40,300 --> 00:00:43,320
and a measure of the magnitude
of our coefficients,

13
00:00:43,320 --> 00:00:46,950
where we said that smaller
magnitudes of coefficients

14
00:00:46,950 --> 00:00:52,290
indicated that things were not as overfit
as if you had crazy, large magnitudes.

15
00:00:52,290 --> 00:00:54,540
And we introduced this tuning parameter,

16
00:00:54,540 --> 00:00:57,900
lambda, which balanced between
these two competing objectives.

17
00:00:59,040 --> 00:01:02,650
So for our measure of fit,
we looked at residual sum of squares.

18
00:01:02,650 --> 00:01:04,130
And in the case of ridge regression,

19
00:01:04,130 --> 00:01:07,220
when we looked at our measure of
the magnitude of the coefficients,

20
00:01:07,220 --> 00:01:13,010
we used what's called the L2 norm, so this
is just the two norm squared in this case,

21
00:01:13,010 --> 00:01:17,680
which is the sum of each of
our feature weights squared.

22
00:01:19,510 --> 00:01:25,882
Okay, this ridge regression penalty we
said encouraged our weights to be small.

23
00:01:25,882 --> 00:01:30,300
But one thing I want to emphasize is
that I encourage them to be small but

24
00:01:30,300 --> 00:01:33,360
not exactly 0.

25
00:01:33,360 --> 00:01:36,590
We can see this if we look at
the coefficient path that we described for

26
00:01:36,590 --> 00:01:40,970
ridge regression, where we see
the magnitude of our coefficients

27
00:01:40,970 --> 00:01:45,470
shrinking and shrinking towards 0,
as we increase our lambda value.

28
00:01:45,470 --> 00:01:50,780
And we said in the limit as lambda
goes to infinity, in that limit,

29
00:01:50,780 --> 00:01:53,310
the coefficients become exactly 0.

30
00:01:53,310 --> 00:01:57,494
But for any finite value of lambda, even
a really really large value of lambda,

31
00:01:57,494 --> 00:02:01,367
we're still just going to have very,
very, very small coefficients but

32
00:02:01,367 --> 00:02:02,820
they won't be exactly 0.

33
00:02:03,870 --> 00:02:06,030
So why does it matter that
they're not exactly 0?

34
00:02:06,030 --> 00:02:09,820
Why am I emphasizing so much this
concept of the coefficients being 0?

35
00:02:09,820 --> 00:02:13,800
Well, this is this concept of
sparsity that we talked about before,

36
00:02:13,800 --> 00:02:18,320
where if we have coefficients
that are exactly 0, well then,

37
00:02:18,320 --> 00:02:23,400
for efficiency of our predictions, that's
really important because we can just

38
00:02:23,400 --> 00:02:28,660
completely remove all the features
where their coefficients are 0 from

39
00:02:28,660 --> 00:02:34,430
our prediction operation and just use the
other coefficients and the other features.

40
00:02:34,430 --> 00:02:39,300
And likewise, for interpretability,
if we say that one of the coefficients is

41
00:02:39,300 --> 00:02:43,830
exactly 0, what we're saying is that
that feature is not in our model.

42
00:02:43,830 --> 00:02:45,830
So that is doing our feature selection.

43
00:02:48,370 --> 00:02:53,225
So a question though, is can we use
regularization to get at this idea

44
00:02:53,225 --> 00:02:57,570
of doing feature selection,
instead of what we talked about before?

45
00:02:57,570 --> 00:03:02,420
Where before, when we're talking about all
subsets, or greedy algorithms, what we

46
00:03:02,420 --> 00:03:06,900
were doing is we were searching over a
discrete set of possible solutions, we're

47
00:03:06,900 --> 00:03:10,460
searching over the solution that included
the first and the fifth feature, or

48
00:03:10,460 --> 00:03:15,770
the second and the seventh, or this entire
collection of these discrete solutions.

49
00:03:17,410 --> 00:03:22,400
But we'd like to ask here is
whether we can start with for

50
00:03:22,400 --> 00:03:24,028
example, our full model.

51
00:03:24,028 --> 00:03:32,030
And then just shrink some coefficients
not towards 0, but exactly to 0.

52
00:03:32,030 --> 00:03:35,880
Because if we shrink them exactly to 0,
then we're knocking out those

53
00:03:35,880 --> 00:03:39,060
coefficients, we're knocking those
features out from our model.

54
00:03:39,060 --> 00:03:45,881
And instead, the non-zero coefficients are
going to indicate our selected features.

55
00:03:45,881 --> 00:03:50,019
[MUSIC]