[MUSIC] Now we have these two terms that we're
trying to balance between each other. And there's going to be a parameter
just like in regression, that helps us explore how much we
put emphasis on fitting the data, versus how much emphasis we put on making
the magnitude of the coefficients small. And this parameter, we would call
Lambda or the tuning parameter, or the magic parameter,
or the magic constant. And so, if you think about it, there's
three regimes here for us to explore. Where Lambda is equal to zero,
let's see what happens. So when Lambda is equal to zero,
this problem reduces to just optimizing. So maximizing over W of the likelihood
only, so only the likelihood term. Which means that we get to the standard
maximum likelihood solution, an unpenalized MLE solution. So, that's probably not a good idea to
set it to zero, because I don't, I have this really bad over fitting problems,
and not preventing the over fitting. Now, if I set Lambda to be too large,
for example, if I set it to be infinity, what happens? Well, the optimization
becomes the maximum over W. Or if L of W minus infinity times
the norm of the parameters, which means the LW gets drowned out. All I care about is that infinity term and
so, that pushes me to only care
about penalizing the parameters. About penalizing the coefficient, say, another parameter, so penalizing W, or penalizing that large coefficient. Which will lead to just setting
all of the Ws equal to zero. Everything be zero. Also now, I've got a good idea because
I'm not fitting the data at all, I set all the parameters to zero, it's not
doing anything good, ignoring the data. So the area that we care about
is somewhere in between. So a Lambda between zero and infinity, which balances the data fit against magnitude of the coefficients. Very good. So we're going to try to find the Lambda. If it's between zero and
infinity, it fits our data well. And this process,
where we're trying to find a Lambda and we're trying to fit the data
with this L2 penalty, it's called L2 regularized
logistic regression. In the regression case,
we called this ridge regression, here it doesn't have a fancy name, it's
just L2 regularized logistic regression. Now, you might ask this point,
how do I pick Lambda? Well, if you took the regression course,
you should know the answer already. Now, use your training data,
because as Lambda goes to zero, you going to fit the training data better. You're not going to be able
to pick Lambda that way. Never ever use your test data, ever. So, you either use a validation set,
if you have lots of data or use cross validation for
smaller data sets. So in the regression course, we cover
this picking the parameter Lambda for the regression study, and
this is the same kind of idea here. Use a validation set or
use cross-validation always. Lambda can be viewed as
a parameter that helps us go between the high variance model and
the high bias model. And try to find a way
to balance the bias and variance in terms of
the bias variance tradeoff. So when Lambda is very large,
we have W is going to zero, and so we have large bias and we know,
they are not fitting the data very well. We have low variance,
no matter where your data set is, you get the same kind of parameters. In extreme, when Lambda is extremely large, you get
zero no matter what data set you have. If Lambda is very small, you get
a very good fit to the training data, so you have low bias but
you can have a very high variance. If the data changes a little bit, you get
a completely different decision boundary. And so in that sense, Lambda controls
the bias of variance trade off for this regularization setting in logistic
regression or in classification. Just like you did in regular regression. [MUSIC]