1
00:00:00,124 --> 00:00:04,610
[MUSIC]

2
00:00:04,610 --> 00:00:07,834
So we've gone through the coordinate
descent algorithm for

3
00:00:07,834 --> 00:00:11,056
solving our lasso objective for
a specific value of lambda,

4
00:00:11,056 --> 00:00:15,500
and that begs the question well how do we
choose the lambda tuning parameter value?

5
00:00:16,880 --> 00:00:21,160
Well, It's exactly the same
as in ridge regression.

6
00:00:21,160 --> 00:00:25,670
If we have enough data, we can think
about holding out a validation set and

7
00:00:25,670 --> 00:00:30,170
using that to choose amongst these
different model complexities lambda.

8
00:00:31,920 --> 00:00:36,410
Or if we don't have enough data,
we talked about doing cross validation.

9
00:00:36,410 --> 00:00:42,560
So these are two very reasonable options
for choosing this tuning parameter lambda.

10
00:00:42,560 --> 00:00:46,350
But in the case of lasso, I just want
to mention that using these types of

11
00:00:46,350 --> 00:00:51,850
procedures, assessing the error on a
validation set or doing cross validation,

12
00:00:51,850 --> 00:00:57,390
it's choosing lambda that provides
the best predictive accuracy.

13
00:00:57,390 --> 00:01:02,160
But what that ends up tending to do is
choosing a lambda value that's a bit

14
00:01:02,160 --> 00:01:06,320
smaller than might be optimal for
doing model selection,

15
00:01:06,320 --> 00:01:11,650
because for predictive accuracy
having slightly less solutions

16
00:01:11,650 --> 00:01:16,270
can actually lead to a little bit better
predictions on any finite data set,

17
00:01:16,270 --> 00:01:22,010
than possibly the true model with
the sparsest set of features possible.

18
00:01:23,370 --> 00:01:27,970
So instead, there are other ways that you
can choose this tuning parameter lambda

19
00:01:27,970 --> 00:01:33,070
and I'll just refer you to other texts
like this textbook by Kevin Murphy,

20
00:01:33,070 --> 00:01:35,560
Machine Leaning A
Probabilistic Perspective for

21
00:01:35,560 --> 00:01:36,990
further discussion on this issue.

22
00:01:38,860 --> 00:01:42,710
So let's just conclude by discussing
a few practical issues with lasso.

23
00:01:44,210 --> 00:01:48,440
The first is the fact that, as we've seen
in multiple different ways throughout this

24
00:01:48,440 --> 00:01:53,435
module, lasso shrinks the coefficients
relative to the least square solution.

25
00:01:53,435 --> 00:01:58,590
So what it's doing is increasing
the bias of the solution in exchange for

26
00:01:58,590 --> 00:02:00,300
having lower variance.

27
00:02:00,300 --> 00:02:04,860
So this is doing this automatic
bias variance tradeoff but

28
00:02:04,860 --> 00:02:08,110
we might wanna still have
a low bias solution, so

29
00:02:08,110 --> 00:02:12,170
we can actually think about reducing the
bias of our solution in the following way.

30
00:02:12,170 --> 00:02:18,980
This is called debiasing the lasso
solution, where we run our lasso solver

31
00:02:18,980 --> 00:02:24,040
and we get out a set of selected features,
so those are the features whose weights

32
00:02:24,040 --> 00:02:29,400
were not set exactly to zero, and then
what we do is we take that reduced model.

33
00:02:29,400 --> 00:02:31,680
The model with these selected features and

34
00:02:31,680 --> 00:02:36,780
we run just standard lease squares
regression on that reduced model.

35
00:02:36,780 --> 00:02:42,930
And in this case, what happens is these
features that were deemed relevant to our

36
00:02:44,390 --> 00:02:49,290
task, their weights after
doing this debiasing procedure

37
00:02:49,290 --> 00:02:53,750
will not be shrunk,
relative to the weights of

38
00:02:53,750 --> 00:02:58,160
a least square solution if we had
started exactly with that reduced model.

39
00:02:58,160 --> 00:03:00,950
But, of course, that was the whole point,
we didn't know which model, so

40
00:03:00,950 --> 00:03:04,880
the lasso is allowing us to
choose out this model, and

41
00:03:04,880 --> 00:03:07,910
then just run least squares on that model.

42
00:03:07,910 --> 00:03:14,020
So these plots show a little illustration
of the benefits of debiasing.

43
00:03:14,020 --> 00:03:18,790
So the top figure shows
the true coefficients for data,

44
00:03:18,790 --> 00:03:23,500
so it's generated with 4,096 different

45
00:03:23,500 --> 00:03:27,030
coefficients or
different features in the model, but

46
00:03:27,030 --> 00:03:33,480
only 160 of these had positive
coefficients associated with them.

47
00:03:33,480 --> 00:03:38,410
So it's a very sparse setup and
if you look at the L one reconstruction,

48
00:03:38,410 --> 00:03:42,765
that's the second row of this plot,
you see that it's discovered

49
00:03:42,765 --> 00:03:48,090
1,024 features that have non zero weights,

50
00:03:48,090 --> 00:03:53,882
has mean squared error of 0.0072,

51
00:03:53,882 --> 00:03:58,680
but if you take those 1,024 non zero

52
00:03:58,680 --> 00:04:03,590
weight features and just run least squares
regression on them, you get the third row.

53
00:04:03,590 --> 00:04:09,234
And that has significantly, significantly,
lower mean square but in contrast,

54
00:04:09,234 --> 00:04:14,162
how do you run least squares on
the full model with 4,096 features?

55
00:04:14,162 --> 00:04:17,530
You would get a really, really poor
estimate of all that's going on and

56
00:04:17,530 --> 00:04:19,070
a very large mean square there.

57
00:04:19,070 --> 00:04:22,980
So this shows the importance
of doing both lasso and

58
00:04:22,980 --> 00:04:26,120
possibly this debiasing on top of that.

59
00:04:27,990 --> 00:04:32,470
Another issue with lasso is, if you
have a collection of strongly correlated

60
00:04:32,470 --> 00:04:38,330
features, lasso will tend to just select
amongst them pretty much arbitrarily.

61
00:04:38,330 --> 00:04:39,100
And what I mean is that,

62
00:04:39,100 --> 00:04:44,710
a small tweak in the data might lead
to one variable included, whereas

63
00:04:44,710 --> 00:04:49,350
a different tweak of the data would have a
different one of these variables included.

64
00:04:49,350 --> 00:04:51,380
So we're now housing an application.

65
00:04:51,380 --> 00:04:55,590
Maybe you could imagine that square feet
and lot size are very correlated, and

66
00:04:55,590 --> 00:05:00,640
we might just arbitrarily choose
between these, but in a lot of cases,

67
00:05:00,640 --> 00:05:06,300
you actually wanna include the whole
set of correlated variables.

68
00:05:06,300 --> 00:05:12,860
And another issue is the fact that, it's
been shown empirically that in many cases,

69
00:05:12,860 --> 00:05:19,620
rich regression actually outperforms
lasso in terms of predictive performance.

70
00:05:19,620 --> 00:05:23,360
So there are other variants of lasso,
something called elastic net.

71
00:05:23,360 --> 00:05:26,360
That tries to address these set of issues.

72
00:05:26,360 --> 00:05:30,520
And what it does is,
it fuses both the objectives of ridge and

73
00:05:30,520 --> 00:05:35,730
lasso, including both an L one and
an L two penalty.

74
00:05:35,730 --> 00:05:40,269
And you can see this paper for further
discussion of these and other issues with

75
00:05:40,269 --> 00:05:44,144
the original lasso objective,
and how elastic net addresses it.

76
00:05:44,144 --> 00:05:47,575
[MUSIC]