[MUSIC] Up to this point, we've derived our
coordinate descent algorithm for these squares and we've discussed
what the variant is for lasso. But we didn't derive explicitly why
our coordinate descent algorithm for lasso has that soft thresholding form. And for completeness,
we're going to include the derivation but as an optional video. And I really want to emphasize
that this video is optional. The material is quite advanced. It's actually material that's typically
taught at the level of a Ph.D. or grad level course. So, watch it at your own risk. We think it's still very interesting if you want to see what's under
the hood in the lasso solver. But that's our little word of warning
before we delve into this content, because again this is an example where the
level of what we're gonna present in this video goes much more technical
than what we're assuming for the rest of this specialization. Okay. So, let's get into optimizing our lasso
objective which we've written here And we're doing it again one
coordinate at a time to derive or coordinate descend algorithm. And in this derivation we're gonna
use unnormalized features so we're not doing the normalization. This is just our typical
h that's appearing here. And we're doing that because that's
the most general case we can derive and the normalized case follows
directly from this derivation. Okay, so to start with let's look at
the partial of our residual summer squares term with respect to wj and we did this
derivation when we're looking at deriving corner to center lease squares, but
we did it for Normalized features so, let's do it now for un-normalized features
where remember what we're gonna do is just pull out this w,j term, so
we're gonna get sum i equals one to n. H, j of x, i and I'm gonna skip a step assuming that you remember what we did On that previous video where I'm separating out K not equal to J, W K H K X I from this term where we get the The wjhj of Xi and so when we do this multiplication we're gonna get minus two sum over
i equals one to n hjXi. Times YI minus Sum K not equal to JWKHKXI. And here we're gonna get plus two. Again the WJ comes out,
we get Sum I equals one to N. And hj(xi) squared. And when we were talking
about normalized features, we said the sum of the hj
squared was equal to one. But now we're looking at
unnormalized features. So let's define some terms. Again, we're gonna call this term,
even though in this case we're looking at unnormalized features,
we're still gonna call this row j. And here, now, we have to define
what we're calling this normalizer, and we're gonna call this zi. I mean, sorry, not zi, zj. Okay So, the result. I go back to my blue colors,
we get -2 row J just like before, but now we get +2 WjZj. So, that completes our
residual sum of squares term. Now, let's turn to our L1 penalty. And this is where things
become more complicated. So, in particular what's the partial with
respect to the absolute value of wj? Remember, that's all the other terms
we're thinking of as held fixed. So, we're just looking
at this Wj component. Well, this is where we
get these derivatives of absolute values that become problematic. Remember here we have derivative. Is equal to minus one and
here we have derivative equals plus one and
here we have this problem point. And we mentioned that instead of thinking
about Gradients or these partials, we think about sub gradients. Okay, so now let's talk about these sub
gradients and formalize this a bit. But let's first mention that gradients, we can think of as lower
bounding convex functions. So, if I look at My convex
function at some points A and B. If I take the gradient at my point
a that's this tangent plane. So, this is gradient of G at A. Then what I have is I have that G at B. So, just to be clear, this is g(a), g(b). So, g(b) is greater than or equal to g(a) plus gradient of g(a) times (b-a). And importantly, gradients are unique at x if
the function is differentiable at x. While subgradients generalize this idea of
gradients to non-differentiable points. And a gradient is going to be any
plane that lower bounds this function. So, sorry not a gradient, if I said
a gradient i mean a sub-gradient. So, a sub-gradient is really a set, its a set of all the plane
that lower bound a function. So, we're gonna saw that v is in our set which we denote by this
curly d of g at a point x. So, this is our sub radiant of g at x If we have that
g of b is greater than, or equal to g of a, plus, in place of this gradient here, we're writing V. So, here this was the one
function that lower bounded our Sorry, the one plane
that lower bounded our function. And here this is one of the planes
that lower bounds our function. And again we have b- a. So, this is the definition
of a sub-gradient. And let's look at it in the context
of this absolute value function. So what are all planes that lower
bound this absolute value function. Well, All the planes that I'm gonna draw lower bound this
absolute value function, there's this plane, there's this plane, there's This plane with positive slope. And I could fill this space
here with all these planes. And what are the slopes of these planes? So, here we know that
The slope is equal to -1 and here the slope is is equal to +1. And we see that anything that has
a slope in the range of -1 to 1, any line in the range -1 to 1, is gonna lower
bound this absolutely value function. So, For absolute value of x, we have that v is in minus one to one. So, minus one to one represents our
subgradient of the absolute value of x. So now that we've had our detour
into the world of subgradients. We can start talking- instead of the
partial of this L1 term, we can talk about the subgradient of this L1 term where
here we get Lambda times the subgradient. We already know that subgradient
are the absolute value is the range minus one to one. So, lambda times the subgradient of
the absolute value is going to be Minus lambda to lambda
when w j is equal to 0. Remember a subgradient is defined at
that problem point w j equal 0, but in the case where the derivative exists or
that partial exists it's just going to be lambda times in the case
of w less than 0 we had minus 1. So this is lambda times minus 1. Here we have lambda times minus 1 to 1 and the case where we're in this positive
half plane we get, lambda times 1. Okay so this is our complete lambda
times the sub-gradient of our L1 objective with respect to wj And now that we have the partial of our
residual sum of squares term and the subgradient of our L1 term,
we can put it all together and get our subgradient of our entire
lasso cost, with respect to wj, and here this part,
this is from our residual sum of squares. Whereas this part is from the L1 penalty. Or really landa times the L1 penalty. And when we put these things together
we get three different cases. Because of the three cases for
the L1 objective. So, we get 2zj that normalizer. Wj-rho j- lambda. It won't read all of the different cases. But this now is our subgradient
of our entire Lasso solution. Okay, well,
let's not get lost in the weeds here. Let's pop up a level and say remember
before we would take the gradient and set it equal to 0, or in the coordinate
descent algorithm we talked about taking the partial with
respect to one coordinate then setting that equal to zero to
get the update for that dimension. Here now instead of this
partial derivative we're taking the subgradient and
we have to set it equal to zero. To get the update for
this specific jth coordinate. So, let's do that, so we've taken our
subgradient, we're setting it equal to 0. Again we have three different
cases we need to consider. So, in Case 1, where wj is less than 0,
let's solve for w at j. You get w at j Equals 2 rho j + lambda divided by 2 z j, which I'm just gonna rewrite as rho
j + lambda over 2, divided by z j. So, I've multiplied the top and
bottom by one-half here. Okay, but to be in this case, to have W to have J less than zero,
we need a constraint on row J. So if row J is less than minus lambda over two,
remember this is that correlation term. If that and that's something we can
compute because that's a function of all the other variables except for wj. So, if ro j is less than minus
lambda over 2 then we know that w hat j will be less than 0
according to this formula. Okay, then we get to the second case,
which is wj = 0. In that case, where you've already solved
for w hatch a, there's only one solution. But in order to have that be the optimum, we know that this subgradient when wj = 0 The subgradient has to contain zero
otherwise we would never get that, this is the case that is equal
to zero that it's an optimum. So, we need for
this range to contain zero, so that w hat j equals 0 is an optimum. Of our objective. And for that to be true, we need minus 2
row j plus lambda to be greater than 0. So, we need this upper term of
the interval greater than 0, which is equivalent to
saying that Rho J is less than lambda over two and we need this
bottom interval to be less than zero. So, minus two rho J minus lambda less
than zero which is equivalent to saying rho J is greater than
Minus lambda over two. And if we put these together what
this is saying is that row j is less than lambda over two and greater than minus lambda over two And
actually we could put the equal sign here. So, let's just do, so it has to be
less than or equal to, less than or equal to, less than or equal to. Okay, and our final case, let's just quickly work through
it we get w hat j equals. Row l- lambda over 2 divided by Zj and in order to have W
hat J be greater than 0, we need row j would be
greater than lambda over two. So, let me just talk through this kind of
in the other direction now that we've done the derivation which is saying if rho
J is less than minus lambda over two, then we'll set W hat J as follows. If row j is in this interval, we'll set w hat j equal to 0, and
if row j is greater than lambda over 2, we're gonna set w hat
j as this third case. Okay, so, this slide just
summarizes what I just said. So, this is our optimal 1 D
optimization for this lasso objective. So, let's talk about this more general
form of the soft thresholding rule for lasso in the case of our
unnormalized features. So, remember for our normalized features,
there wasn't this CJ here And what we ended up with for
our least square solution when lambda was equal to 0 was just this line,
w-hat j equals rho j. But now what does our least
squares line look like? Well again, we can just set lambda equal
to 0, and we see that this Lee squares line w hat lee squares is
equal to row j over z j. Remember z j is that normalizer so I mean
over the square of all of our features. So, that number will be positive and
it's typically gonna be larger than one. Potentially much larger than one So,
relative to a slope, which is a 45 degree angle slope of 1, I'm saying that this line
is shrunk more this way, typically, in the case of
unnormalized features. And then,
when I look at my lasso solution, w hat. Lasso in this case. Again, in the range minus lambda over 2. Sorry, that is clearly
not minus lambda over 2. This is minus lambda over 2,
to lambda over 2. I get this thresholding of
the coefficients exactly to zero, relative to My least square solution,
and outside that range, the difference between
the least square solution and my lasso solution,
is that my coefficients are each shrunk by an amount
lambda over lambda 2zj. Okay. But remember that rho_j here as
compared to when we talked about normalized features was
defined differently. It was defined in terms of
our unnormalized features. So, for the same value of lambda that
you would use with normalized features you're getting a different
relationship here. A different range where
things are set to zero. S,o in summary, we've derived this
coordinate descent algorithm for lasso in the case of unormalized features. And in particular, the key insight we had
was instead of just taking the partial of our objective with respect to WJ we
had to take the subgradient of our objective with respect to Wj, and that's
what leads to these three different cases, because the gradient itself is defined for every value of Wj, except for
this one critical point, Wj = 0. But in particular we also had a lot
of insight into how this soft thresholding gives us the sparsity
in our lasso solutions. [MUSIC]