1
00:00:00,000 --> 00:00:04,940
[MUSIC]

2
00:00:04,940 --> 00:00:08,380
Okay, well in place of our
ridge regression objective.

3
00:00:08,380 --> 00:00:12,670
What if we took our measure of
our magnitude of our coefficients

4
00:00:12,670 --> 00:00:14,260
to be what's called the l1 norm.

5
00:00:14,260 --> 00:00:21,120
Where we're gonna sum over the absolute
value of each one of our coefficients.

6
00:00:21,120 --> 00:00:24,360
So, we actually describe this as
a reasonable measure of the magnitude of

7
00:00:24,360 --> 00:00:27,960
the coefficients when we're discussing
ridge regression last module.

8
00:00:28,960 --> 00:00:32,800
Well, the result of this is something
that leads to sparse solutions.

9
00:00:32,800 --> 00:00:36,430
For reasons that we're gonna go through
in the remainder of this module.

10
00:00:36,430 --> 00:00:40,246
And this objective is referred
to as Lasso regression.

11
00:00:40,246 --> 00:00:45,220
Or L1 regularized regression.

12
00:00:45,220 --> 00:00:50,650
So, just like in ridge regression,
lasso is governed by a tuning parameter,

13
00:00:50,650 --> 00:00:55,100
lambda, that controls how much we're
favoring sparsity of our solutions

14
00:00:55,100 --> 00:00:57,270
relative to the fit on our training data.

15
00:00:58,330 --> 00:01:01,590
And so, just to be clear, here,

16
00:01:01,590 --> 00:01:05,390
we see that when we're doing
our feature selection task,

17
00:01:05,390 --> 00:01:11,010
we're searching over a continuous space,
this space of lambda values.

18
00:01:11,010 --> 00:01:15,320
Lambda's governing the sparsity of
the solution and that's in contrast to,

19
00:01:15,320 --> 00:01:18,060
for example, the all subsets or
greedy approaches,

20
00:01:18,060 --> 00:01:21,790
where we talked about those searching over
a discrete set of possible solutions.

21
00:01:21,790 --> 00:01:26,390
So, it's really a fundamentally different
approach to doing feature selection.

22
00:01:26,390 --> 00:01:30,980
Okay but let's talk about what happens
to our solution as we vary lambda.

23
00:01:33,500 --> 00:01:35,000
And again just to emphasize,

24
00:01:35,000 --> 00:01:40,200
this lambda is a tuning parameter that in
this case is balancing fit and sparsity.

25
00:01:41,570 --> 00:01:47,090
Okay so if lambda is equal to zero,
what's gonna happen?

26
00:01:47,090 --> 00:01:49,540
Well, this penalty term is
completely going to disappear, and

27
00:01:49,540 --> 00:01:53,110
our objective is simply going to be
to minimize residual sum of squares.

28
00:01:53,110 --> 00:01:55,970
That was our old least squares objective.

29
00:01:55,970 --> 00:01:59,590
So, we're going to get W
hat what I'll call lasso.

30
00:01:59,590 --> 00:02:01,980
The solution to our lasso problem

31
00:02:01,980 --> 00:02:06,020
is going to be exactly equal
to W hat least squares.

32
00:02:06,020 --> 00:02:13,970
So, this is equal to our
unregularized solution.

33
00:02:20,620 --> 00:02:24,470
And in contrast if we set
lambda equals to infinity.

34
00:02:24,470 --> 00:02:27,160
This is where we are completely favoring.

35
00:02:27,160 --> 00:02:34,940
This magnitude penalty in completely
ignoring the residual square is fit.

36
00:02:34,940 --> 00:02:39,120
In this case, what's the thing
that minimizes the L1 norm.

37
00:02:40,930 --> 00:02:44,400
So, what value of our
regression coefficients

38
00:02:44,400 --> 00:02:48,040
is gonna have some other absolute
value is being the smallest.

39
00:02:48,040 --> 00:02:52,750
Well again or just like in ridge when
lambda's equal to infinity we're gonna

40
00:02:53,840 --> 00:02:57,380
get W hat lasso equal to the zero vector.

41
00:02:59,100 --> 00:03:04,186
And if lambda is in between we're gonna
get that in this case the one norm

42
00:03:04,186 --> 00:03:10,252
of our lasso solution

43
00:03:12,420 --> 00:03:18,470
It's gonna be less than or equal to the
one norm of our lease square solution and

44
00:03:19,900 --> 00:03:23,980
it's gonna be greater than or
equal to this zero vector.

45
00:03:23,980 --> 00:03:25,160
I mean, this zero number.

46
00:03:25,160 --> 00:03:25,670
Sorry.

47
00:03:25,670 --> 00:03:27,810
Here it's just a number
once we've taken this norm.

48
00:03:29,630 --> 00:03:30,550
Okay.

49
00:03:30,550 --> 00:03:36,020
So, as of yet, it's not clear why
this L 1 norm is leading to sparsity,

50
00:03:36,020 --> 00:03:40,010
and we're going to get to that, but
let's first just explore this visually.

51
00:03:40,010 --> 00:03:42,790
And one way we can see this
is from the coefficient path.

52
00:03:42,790 --> 00:03:45,230
But first, let's just remember
the coefficient path for

53
00:03:45,230 --> 00:03:49,830
ridge regression, where we saw that
even for a large value of lambda

54
00:03:51,870 --> 00:03:55,888
Everything was in our model,
just with small coefficients.

55
00:03:55,888 --> 00:04:03,520
So, everything has W hat J

56
00:04:03,520 --> 00:04:09,456
greater than zero but all W hat J.

57
00:04:09,456 --> 00:04:13,302
Are small for

58
00:04:13,302 --> 00:04:17,930
large values of our
tuning parameter lambda.

59
00:04:19,030 --> 00:04:21,530
In contrast,
when we look at the coefficient path for

60
00:04:21,530 --> 00:04:25,010
lasso, we see a very different pattern.

61
00:04:25,010 --> 00:04:30,740
What we see is that at certain critical
values of this tuning parameter lambda.

62
00:04:32,050 --> 00:04:34,290
Certain ones of our features
jump out of our model.

63
00:04:34,290 --> 00:04:39,400
So, for example here we had square feet of
the lot size disappears from the model.

64
00:04:39,400 --> 00:04:44,700
Here number of bedrooms almost
simultaneously with number of floors and

65
00:04:44,700 --> 00:04:46,280
number of bathrooms.

66
00:04:46,280 --> 00:04:49,580
Followed by the year the house was built.

67
00:04:49,580 --> 00:04:54,910
And then, but one thing that we see,
so let me just be clear,

68
00:04:54,910 --> 00:05:00,320
that for
let's say a value of lambda like this,

69
00:05:03,270 --> 00:05:08,980
we have a sparse set of
features included in our model.

70
00:05:08,980 --> 00:05:10,671
So, the ones I've circled.

71
00:05:16,260 --> 00:05:21,474
Are the only feature, sorry.

72
00:05:21,474 --> 00:05:26,629
Only features in our model.

73
00:05:28,762 --> 00:05:34,030
And all the other ones,
have dropped completely exactly to zero.

74
00:05:35,820 --> 00:05:40,820
And one thing that we see is that when
lambda is very large, like the large value

75
00:05:40,820 --> 00:05:46,680
I showed on the previous plot, the only
thing in our model is square feet living.

76
00:05:48,180 --> 00:05:53,790
And note that square feet living still has
a really significantly large weight on it.

77
00:05:53,790 --> 00:05:58,431
So, I'll say large

78
00:05:58,431 --> 00:06:03,345
weight on square feet

79
00:06:03,345 --> 00:06:08,806
living when everything

80
00:06:08,806 --> 00:06:15,190
else is out of the model.

81
00:06:15,190 --> 00:06:16,680
Meaning not included in the model.

82
00:06:19,070 --> 00:06:25,210
So, square feet living is still
very valuable to our predictions,

83
00:06:25,210 --> 00:06:30,500
and it would take quite a large
lambda value to say that

84
00:06:30,500 --> 00:06:33,540
square feet living,
even that was not relevant.

85
00:06:33,540 --> 00:06:37,480
Eventually, square feet living
would be shrunk exactly to 0.

86
00:06:37,480 --> 00:06:38,930
But for a much large value of land.

87
00:06:38,930 --> 00:06:42,060
But, if I go back to my
ridge regression solution.

88
00:06:42,060 --> 00:06:46,490
I see that I had a much smaller
value on square feet living,

89
00:06:48,620 --> 00:06:54,190
because I was distributing weights
across many other features in the model.

90
00:06:54,190 --> 00:06:58,421
So, that individual impact of
square feet living wasn't as clear.

91
00:06:58,421 --> 00:07:03,279
[MUSIC]