[MUSIC] Now we've seen multiple ways
that overfitting can be bad for classification especially
from logistic regression and how very massive parameters
can be a really bad thing. So, what we're going to do next is
introduce a notion of regularization just like we did regression to
penalize this really large parameters in order to get a more reasonable outcome. So we're still talking about
the same logistical regression model where we take data we do some
feature extraction to it we fit this model one over one plus
e to the minus w transpose x. But the quality metric for this machine
in the algorithm is going to change to push us away from
really large coefficients. So in particular we're going to
balance how well we fit the data with the magnitude of coefficients as
to avoid this massive coefficients. In the context of logistic regression, we're balancing two things
to measure total quality. The measure of fit. Which was the data likelihood,
the thing that's bigger is better, how well I fit the data, and
then the magnitude of the coefficients, where the coefficients
are too big are problematic. So we have one thing that we want
to be big, the likelihood, and the other thing we want to be small,
which is the minds of coefficients, we're going to optimize the quality
minus this complex metric here. And so we want to balance between the two. So what do those mean? Let's substantiate that more clearly
In the context of logistic regression. The quality metric in logistic
regression is the data likelihood, and we talked about it quite
a bit in the previous module. Now one little side note here that
we're going to use in this module is that we don't typically optimize
the data likelihood directly. We optimize the log of
the data likelihood, because that makes math a lot simpler. And it makes the gradients
behave a lot better. In the options section
in the previous model, we talked about this quite a bit,
and we explored it in detail. If you skipped that section, just think about the log as a way
to make those numbers less extreme. So we take the log. So the method for quality is going to
be the log of the data likelihood, and we're going to make that
log as big as possible. So we see that the likelihood is going
to be what we're going to optimize when we try to make it big. But at the same time,
we're trying to make something small, which is the magnitude of the coefficient. So there are different metrics for magnitude of coefficients,
just like we explored in regression. There's two that we're
going to use in this module. One is the sum of the squares, also called
the L2 norm, the square of the L2 norm. And because it's noted by W2 squared and
it's just very simple. It's the square of the first
coefficient plus the square of the second coefficient plus the square
of the third coefficient and so on plus the square of the last
coefficient, w d squared. That's if you used the L2 norm. We can also use the sum of the absolute
values, also called the L1 norm, and it's denoted by this here. And instead of being the squares,
it's w0, absolute value, plus W one absolute value plus w two absolute value all the way to the absolute
value of the last coefficient. Now in the regression course we
explored these notions quite a bit but the main reason we take the square of
the absolute value is that we want to make sure to penalize highly positive and
highly negative numbers in the same way, so by doing the search squaring of some
value, i'll make the output here positive. When I make this norms as low as possible. So both of these approaches
are penalize larger weight. Actually, i should say
penalize large coefficients. However, as we saw in the regression
class by using the L one norm, I'm also going to get what's
called a sparse solution. So the sparse doesn't point
play in regression but it also plays a role in classification for
example. And in this module we're going to explore
a little bit of both of these concepts. And we're going to start with the L2 norm,
or the sum of the squares. So now that we've reviewed these concepts,
we can now formalize the problem, the quality that we're trying to maximize. And so I want to maximize over my choice parameter W's of
the trade off between two things. The likelihood of my data and
actually the log of it. So, log of the data likelihood. And some notion of penalty for
the magnitude of the coefficients, which it will start with
this L2 penalty notion. [MUSIC]