1
00:00:00,580 --> 00:00:04,573
[MUSIC]

2
00:00:04,573 --> 00:00:08,630
Okay, well maybe we can just take
our retrogression solution, and

3
00:00:08,630 --> 00:00:14,680
just take all the little coefficients and
just say they're 0, just get rid of those.

4
00:00:14,680 --> 00:00:16,770
We're gonna call those
thresholding those away.

5
00:00:16,770 --> 00:00:21,000
We're gonna choose some value where,
below that value of the magnitude of

6
00:00:21,000 --> 00:00:23,850
the coefficient is below
the threshold that we choose,

7
00:00:23,850 --> 00:00:25,800
just going to say it's not in the model.

8
00:00:25,800 --> 00:00:27,480
So, let's explore this idea a little bit.

9
00:00:28,750 --> 00:00:33,320
So, here I'm just showing an illustration
of a little cartoon of what

10
00:00:33,320 --> 00:00:37,520
the weights might look like on a set
of features in our housing application.

11
00:00:37,520 --> 00:00:43,300
And, I'm choosing some threshold
which is this dashed black line.

12
00:00:43,300 --> 00:00:45,957
And if the magnitude
exceeds that threshold,

13
00:00:45,957 --> 00:00:48,622
then I'm gonna say that
features in my model.

14
00:00:48,622 --> 00:00:53,420
So here in, pink or fuchsia.

15
00:00:53,420 --> 00:00:55,190
Carlos, what color is this?

16
00:00:55,190 --> 00:00:56,240
>> Fuchsia.

17
00:00:56,240 --> 00:00:57,330
>> Fuchsia.

18
00:00:57,330 --> 00:00:58,850
This is Carlos's color scheme.

19
00:00:58,850 --> 00:01:01,000
He's very attached to it, fuchsia.

20
00:01:01,000 --> 00:01:06,410
So in fuchsia, I'm showing the features
that have been selected to be

21
00:01:06,410 --> 00:01:10,840
in my model after doing this thresholding
of my ridge regression coefficients.

22
00:01:12,280 --> 00:01:15,378
Might seem like a reasonable approach, but
let's dig into this a little bit more.

23
00:01:15,378 --> 00:01:19,940
And in particula,r let's look
at two very related features.

24
00:01:19,940 --> 00:01:22,690
So if you look at this list of features,
you see, in green,

25
00:01:22,690 --> 00:01:26,110
I've highlighted number of bathrooms and
number of showers.

26
00:01:27,260 --> 00:01:30,810
So, these numbers tend to be very,
very close to one another.

27
00:01:31,890 --> 00:01:35,940
Because lots of bathrooms have showers,
and

28
00:01:35,940 --> 00:01:39,600
as the number of showers grow,
clearly the number of bathrooms grow

29
00:01:39,600 --> 00:01:43,700
because you're very unlikely to
have a shower not in a bathroom.

30
00:01:43,700 --> 00:01:46,760
But what's happened here?

31
00:01:46,760 --> 00:01:51,700
Well, our model has included nothing
having to do with bathrooms, or showers,

32
00:01:51,700 --> 00:01:52,920
or anything of this concept.

33
00:01:55,230 --> 00:01:56,910
So that doesn't really
make a lot of sense.

34
00:01:56,910 --> 00:02:01,490
To me it seems like something having to do
with how many bathrooms are in the house

35
00:02:01,490 --> 00:02:06,570
should be a valuable feature to include
when I'm assessing the value of the house.

36
00:02:06,570 --> 00:02:08,420
So what's going wrong?

37
00:02:08,420 --> 00:02:11,500
Well, if I hadn't included
number of showers.

38
00:02:11,500 --> 00:02:14,230
Let's just for simplicity's sake,

39
00:02:14,230 --> 00:02:19,360
treat the number of showers as exactly
equivalent to the number of bathrooms.

40
00:02:19,360 --> 00:02:22,570
It might not be exactly equivalent,
but they're very strongly related.

41
00:02:24,270 --> 00:02:26,922
But like I said for simplicity,
let's say they're exactly the same.

42
00:02:26,922 --> 00:02:31,368
So if I hadn't included number of
showers in my model to begin with,

43
00:02:31,368 --> 00:02:36,360
in the full model, then when I did my
ridge search, it would've placed that

44
00:02:36,360 --> 00:02:41,810
weight that had been on number of showers,
on the number of bathrooms.

45
00:02:41,810 --> 00:02:44,690
Because remember, it's a linear model,
we're summing over

46
00:02:44,690 --> 00:02:48,450
weight times number of bathrooms
plus weight times number of showers.

47
00:02:48,450 --> 00:02:50,240
So if number of bathrooms
equals number of showers,

48
00:02:50,240 --> 00:02:55,320
it's equivalent to the sum of those two
weights just times number of bathrooms,

49
00:02:55,320 --> 00:02:57,900
excluding number of
showers from the model.

50
00:02:57,900 --> 00:03:02,730
Okay so, the point here is that if I
hadn't included this redundant feature,

51
00:03:02,730 --> 00:03:07,940
number of showers, what I see now
visually, is that number of bathrooms

52
00:03:07,940 --> 00:03:12,610
would have been included in my selected
model doing the threshholding.

53
00:03:12,610 --> 00:03:14,890
So, the issue that I'm getting at here.

54
00:03:14,890 --> 00:03:17,890
It's not specific to the number of
bathrooms and number of showers.

55
00:03:17,890 --> 00:03:21,620
It's an issue that, if you have
a whole collection, maybe not two,

56
00:03:21,620 --> 00:03:25,980
maybe a whole set of
strongly related features.

57
00:03:25,980 --> 00:03:30,079
More formally, statistically I will
call these strongly correlated features.

58
00:03:31,250 --> 00:03:37,550
Then ridge regression is gonna prefer
a solution that places a bunch of smaller

59
00:03:37,550 --> 00:03:42,220
weights on all the features, rather than
one large weight on one of the features.

60
00:03:42,220 --> 00:03:43,540
Because remember the cost

61
00:03:44,690 --> 00:03:49,010
under the ridge regression model is
the size of that feature squared.

62
00:03:49,010 --> 00:03:51,950
And so if you have one really big one,
that's really gonna blow up that cost,

63
00:03:51,950 --> 00:03:54,720
that L2 penalty term.

64
00:03:54,720 --> 00:03:58,080
Whereas the fit of the model is
gonna be basically about the same.

65
00:03:58,080 --> 00:04:00,960
Whether I distribute the weights
over redundant features or

66
00:04:00,960 --> 00:04:04,518
if I put a big one on just for
one of them and zeros elsewhere.

67
00:04:04,518 --> 00:04:07,410
So what's gonna happen is I'm going to
get a bunch of these small weights over

68
00:04:07,410 --> 00:04:08,710
the redundant features.

69
00:04:08,710 --> 00:04:10,830
And if I think about simply thresholding,

70
00:04:10,830 --> 00:04:13,860
I'm gonna discard all of
these redundant features.

71
00:04:13,860 --> 00:04:15,270
Whereas one of them, or

72
00:04:15,270 --> 00:04:19,773
potentially the whole set, really
were relevant to my prediction task.

73
00:04:19,773 --> 00:04:23,132
So hopefully, it's clear from this
illustration that just taking ridge

74
00:04:23,132 --> 00:04:26,880
regression and
thresholding out these small weights,

75
00:04:26,880 --> 00:04:29,158
is not a solution to our
feature selection problem.

76
00:04:29,158 --> 00:04:32,211
So instead we're left
with this question of,

77
00:04:32,211 --> 00:04:36,374
can we use regularization to
directly optimize for sparsity?

78
00:04:36,374 --> 00:04:40,499
[MUSIC]