1
00:00:00,000 --> 00:00:04,629
[MUSIC]

2
00:00:04,629 --> 00:00:09,309
Now we have these two terms that we're
trying to balance between each other.

3
00:00:09,309 --> 00:00:13,074
And there's going to be a parameter
just like in regression,

4
00:00:13,074 --> 00:00:17,289
that helps us explore how much we
put emphasis on fitting the data,

5
00:00:17,289 --> 00:00:22,584
versus how much emphasis we put on making
the magnitude of the coefficients small.

6
00:00:22,584 --> 00:00:26,698
And this parameter, we would call
Lambda or the tuning parameter, or

7
00:00:26,698 --> 00:00:29,430
the magic parameter,
or the magic constant.

8
00:00:30,660 --> 00:00:34,580
And so, if you think about it, there's
three regimes here for us to explore.

9
00:00:34,580 --> 00:00:38,990
Where Lambda is equal to zero,
let's see what happens.

10
00:00:38,990 --> 00:00:43,360
So when Lambda is equal to zero,
this problem reduces to just optimizing.

11
00:00:44,980 --> 00:00:51,070
So maximizing over W of the likelihood
only, so only the likelihood term.

12
00:00:51,070 --> 00:00:57,143
Which means that we get to the standard
maximum likelihood solution,

13
00:00:57,143 --> 00:01:02,130
an unpenalized MLE solution.

14
00:01:02,130 --> 00:01:07,040
So, that's probably not a good idea to
set it to zero, because I don't, I have

15
00:01:07,040 --> 00:01:10,770
this really bad over fitting problems,
and not preventing the over fitting.

16
00:01:10,770 --> 00:01:13,330
Now, if I set Lambda to be too large,
for example,

17
00:01:13,330 --> 00:01:16,030
if I set it to be infinity, what happens?

18
00:01:16,030 --> 00:01:23,170
Well, the optimization
becomes the maximum over W.

19
00:01:23,170 --> 00:01:30,105
Or if L of W minus infinity times
the norm of the parameters,

20
00:01:30,105 --> 00:01:34,419
which means the LW gets drowned out.

21
00:01:34,419 --> 00:01:38,958
All I care about is that infinity term and
so,

22
00:01:38,958 --> 00:01:45,181
that pushes me to only care
about penalizing the parameters.

23
00:01:45,181 --> 00:01:50,590
About penalizing the coefficient, say,

24
00:01:50,590 --> 00:01:55,548
another parameter, so penalizing W,

25
00:01:55,548 --> 00:02:00,668
or penalizing that large coefficient.

26
00:02:00,668 --> 00:02:05,570
Which will lead to just setting
all of the Ws equal to zero.

27
00:02:05,570 --> 00:02:06,790
Everything be zero.

28
00:02:06,790 --> 00:02:10,354
Also now, I've got a good idea because
I'm not fitting the data at all,

29
00:02:10,354 --> 00:02:14,573
I set all the parameters to zero, it's not
doing anything good, ignoring the data.

30
00:02:14,573 --> 00:02:18,269
So the area that we care about
is somewhere in between.

31
00:02:18,269 --> 00:02:23,540
So a Lambda between zero and infinity,

32
00:02:23,540 --> 00:02:28,333
which balances the data fit against

33
00:02:28,333 --> 00:02:32,816
magnitude of the coefficients.

34
00:02:36,773 --> 00:02:37,930
Very good.

35
00:02:37,930 --> 00:02:39,390
So we're going to try to find the Lambda.

36
00:02:39,390 --> 00:02:42,240
If it's between zero and
infinity, it fits our data well.

37
00:02:43,620 --> 00:02:48,130
And this process,
where we're trying to find a Lambda and

38
00:02:48,130 --> 00:02:52,760
we're trying to fit the data
with this L2 penalty,

39
00:02:52,760 --> 00:02:57,020
it's called L2 regularized
logistic regression.

40
00:02:57,020 --> 00:02:59,000
In the regression case,
we called this ridge regression,

41
00:02:59,000 --> 00:03:02,239
here it doesn't have a fancy name, it's
just L2 regularized logistic regression.

42
00:03:03,580 --> 00:03:07,670
Now, you might ask this point,
how do I pick Lambda?

43
00:03:07,670 --> 00:03:11,270
Well, if you took the regression course,
you should know the answer already.

44
00:03:12,930 --> 00:03:16,936
Now, use your training data,
because as Lambda goes to zero,

45
00:03:16,936 --> 00:03:19,519
you going to fit the training data better.

46
00:03:19,519 --> 00:03:21,609
You're not going to be able
to pick Lambda that way.

47
00:03:21,609 --> 00:03:24,960
Never ever use your test data, ever.

48
00:03:24,960 --> 00:03:29,250
So, you either use a validation set,
if you have lots of data or

49
00:03:29,250 --> 00:03:32,680
use cross validation for
smaller data sets.

50
00:03:32,680 --> 00:03:37,257
So in the regression course, we cover
this picking the parameter Lambda for

51
00:03:37,257 --> 00:03:40,992
the regression study, and
this is the same kind of idea here.

52
00:03:40,992 --> 00:03:45,519
Use a validation set or
use cross-validation always.

53
00:03:47,655 --> 00:03:51,907
Lambda can be viewed as
a parameter that helps us go

54
00:03:51,907 --> 00:03:56,780
between the high variance model and
the high bias model.

55
00:03:56,780 --> 00:03:59,449
And try to find a way
to balance the bias and

56
00:03:59,449 --> 00:04:02,660
variance in terms of
the bias variance tradeoff.

57
00:04:02,660 --> 00:04:06,979
So when Lambda is very large,
we have W is going to zero, and so

58
00:04:06,979 --> 00:04:11,982
we have large bias and we know,
they are not fitting the data very well.

59
00:04:11,982 --> 00:04:14,457
We have low variance,
no matter where your data set is,

60
00:04:14,457 --> 00:04:16,730
you get the same kind of parameters.

61
00:04:16,730 --> 00:04:17,380
In extreme,

62
00:04:17,380 --> 00:04:20,760
when Lambda is extremely large, you get
zero no matter what data set you have.

63
00:04:21,980 --> 00:04:28,420
If Lambda is very small, you get
a very good fit to the training data,

64
00:04:28,420 --> 00:04:32,550
so you have low bias but
you can have a very high variance.

65
00:04:32,550 --> 00:04:35,679
If the data changes a little bit, you get
a completely different decision boundary.

66
00:04:36,970 --> 00:04:40,793
And so in that sense, Lambda controls
the bias of variance trade off for

67
00:04:40,793 --> 00:04:45,280
this regularization setting in logistic
regression or in classification.

68
00:04:45,280 --> 00:04:47,627
Just like you did in regular regression.

69
00:04:47,627 --> 00:04:51,899
[MUSIC]