[MUSIC] Up to this point, we've derived our coordinate descent algorithm for these squares and we've discussed what the variant is for lasso. But we didn't derive explicitly why our coordinate descent algorithm for lasso has that soft thresholding form. And for completeness, we're going to include the derivation but as an optional video. And I really want to emphasize that this video is optional. The material is quite advanced. It's actually material that's typically taught at the level of a Ph.D. or grad level course. So, watch it at your own risk. We think it's still very interesting if you want to see what's under the hood in the lasso solver. But that's our little word of warning before we delve into this content, because again this is an example where the level of what we're gonna present in this video goes much more technical than what we're assuming for the rest of this specialization. Okay. So, let's get into optimizing our lasso objective which we've written here And we're doing it again one coordinate at a time to derive or coordinate descend algorithm. And in this derivation we're gonna use unnormalized features so we're not doing the normalization. This is just our typical h that's appearing here. And we're doing that because that's the most general case we can derive and the normalized case follows directly from this derivation. Okay, so to start with let's look at the partial of our residual summer squares term with respect to wj and we did this derivation when we're looking at deriving corner to center lease squares, but we did it for Normalized features so, let's do it now for un-normalized features where remember what we're gonna do is just pull out this w,j term, so we're gonna get sum i equals one to n. H, j of x, i and I'm gonna skip a step assuming that you remember what we did On that previous video where I'm separating out K not equal to J, W K H K X I from this term where we get the The wjhj of Xi and so when we do this multiplication we're gonna get minus two sum over i equals one to n hjXi. Times YI minus Sum K not equal to JWKHKXI. And here we're gonna get plus two. Again the WJ comes out, we get Sum I equals one to N. And hj(xi) squared. And when we were talking about normalized features, we said the sum of the hj squared was equal to one. But now we're looking at unnormalized features. So let's define some terms. Again, we're gonna call this term, even though in this case we're looking at unnormalized features, we're still gonna call this row j. And here, now, we have to define what we're calling this normalizer, and we're gonna call this zi. I mean, sorry, not zi, zj. Okay So, the result. I go back to my blue colors, we get -2 row J just like before, but now we get +2 WjZj. So, that completes our residual sum of squares term. Now, let's turn to our L1 penalty. And this is where things become more complicated. So, in particular what's the partial with respect to the absolute value of wj? Remember, that's all the other terms we're thinking of as held fixed. So, we're just looking at this Wj component. Well, this is where we get these derivatives of absolute values that become problematic. Remember here we have derivative. Is equal to minus one and here we have derivative equals plus one and here we have this problem point. And we mentioned that instead of thinking about Gradients or these partials, we think about sub gradients. Okay, so now let's talk about these sub gradients and formalize this a bit. But let's first mention that gradients, we can think of as lower bounding convex functions. So, if I look at My convex function at some points A and B. If I take the gradient at my point a that's this tangent plane. So, this is gradient of G at A. Then what I have is I have that G at B. So, just to be clear, this is g(a), g(b). So, g(b) is greater than or equal to g(a) plus gradient of g(a) times (b-a). And importantly, gradients are unique at x if the function is differentiable at x. While subgradients generalize this idea of gradients to non-differentiable points. And a gradient is going to be any plane that lower bounds this function. So, sorry not a gradient, if I said a gradient i mean a sub-gradient. So, a sub-gradient is really a set, its a set of all the plane that lower bound a function. So, we're gonna saw that v is in our set which we denote by this curly d of g at a point x. So, this is our sub radiant of g at x If we have that g of b is greater than, or equal to g of a, plus, in place of this gradient here, we're writing V. So, here this was the one function that lower bounded our Sorry, the one plane that lower bounded our function. And here this is one of the planes that lower bounds our function. And again we have b- a. So, this is the definition of a sub-gradient. And let's look at it in the context of this absolute value function. So what are all planes that lower bound this absolute value function. Well, All the planes that I'm gonna draw lower bound this absolute value function, there's this plane, there's this plane, there's This plane with positive slope. And I could fill this space here with all these planes. And what are the slopes of these planes? So, here we know that The slope is equal to -1 and here the slope is is equal to +1. And we see that anything that has a slope in the range of -1 to 1, any line in the range -1 to 1, is gonna lower bound this absolutely value function. So, For absolute value of x, we have that v is in minus one to one. So, minus one to one represents our subgradient of the absolute value of x. So now that we've had our detour into the world of subgradients. We can start talking- instead of the partial of this L1 term, we can talk about the subgradient of this L1 term where here we get Lambda times the subgradient. We already know that subgradient are the absolute value is the range minus one to one. So, lambda times the subgradient of the absolute value is going to be Minus lambda to lambda when w j is equal to 0. Remember a subgradient is defined at that problem point w j equal 0, but in the case where the derivative exists or that partial exists it's just going to be lambda times in the case of w less than 0 we had minus 1. So this is lambda times minus 1. Here we have lambda times minus 1 to 1 and the case where we're in this positive half plane we get, lambda times 1. Okay so this is our complete lambda times the sub-gradient of our L1 objective with respect to wj And now that we have the partial of our residual sum of squares term and the subgradient of our L1 term, we can put it all together and get our subgradient of our entire lasso cost, with respect to wj, and here this part, this is from our residual sum of squares. Whereas this part is from the L1 penalty. Or really landa times the L1 penalty. And when we put these things together we get three different cases. Because of the three cases for the L1 objective. So, we get 2zj that normalizer. Wj-rho j- lambda. It won't read all of the different cases. But this now is our subgradient of our entire Lasso solution. Okay, well, let's not get lost in the weeds here. Let's pop up a level and say remember before we would take the gradient and set it equal to 0, or in the coordinate descent algorithm we talked about taking the partial with respect to one coordinate then setting that equal to zero to get the update for that dimension. Here now instead of this partial derivative we're taking the subgradient and we have to set it equal to zero. To get the update for this specific jth coordinate. So, let's do that, so we've taken our subgradient, we're setting it equal to 0. Again we have three different cases we need to consider. So, in Case 1, where wj is less than 0, let's solve for w at j. You get w at j Equals 2 rho j + lambda divided by 2 z j, which I'm just gonna rewrite as rho j + lambda over 2, divided by z j. So, I've multiplied the top and bottom by one-half here. Okay, but to be in this case, to have W to have J less than zero, we need a constraint on row J. So if row J is less than minus lambda over two, remember this is that correlation term. If that and that's something we can compute because that's a function of all the other variables except for wj. So, if ro j is less than minus lambda over 2 then we know that w hat j will be less than 0 according to this formula. Okay, then we get to the second case, which is wj = 0. In that case, where you've already solved for w hatch a, there's only one solution. But in order to have that be the optimum, we know that this subgradient when wj = 0 The subgradient has to contain zero otherwise we would never get that, this is the case that is equal to zero that it's an optimum. So, we need for this range to contain zero, so that w hat j equals 0 is an optimum. Of our objective. And for that to be true, we need minus 2 row j plus lambda to be greater than 0. So, we need this upper term of the interval greater than 0, which is equivalent to saying that Rho J is less than lambda over two and we need this bottom interval to be less than zero. So, minus two rho J minus lambda less than zero which is equivalent to saying rho J is greater than Minus lambda over two. And if we put these together what this is saying is that row j is less than lambda over two and greater than minus lambda over two And actually we could put the equal sign here. So, let's just do, so it has to be less than or equal to, less than or equal to, less than or equal to. Okay, and our final case, let's just quickly work through it we get w hat j equals. Row l- lambda over 2 divided by Zj and in order to have W hat J be greater than 0, we need row j would be greater than lambda over two. So, let me just talk through this kind of in the other direction now that we've done the derivation which is saying if rho J is less than minus lambda over two, then we'll set W hat J as follows. If row j is in this interval, we'll set w hat j equal to 0, and if row j is greater than lambda over 2, we're gonna set w hat j as this third case. Okay, so, this slide just summarizes what I just said. So, this is our optimal 1 D optimization for this lasso objective. So, let's talk about this more general form of the soft thresholding rule for lasso in the case of our unnormalized features. So, remember for our normalized features, there wasn't this CJ here And what we ended up with for our least square solution when lambda was equal to 0 was just this line, w-hat j equals rho j. But now what does our least squares line look like? Well again, we can just set lambda equal to 0, and we see that this Lee squares line w hat lee squares is equal to row j over z j. Remember z j is that normalizer so I mean over the square of all of our features. So, that number will be positive and it's typically gonna be larger than one. Potentially much larger than one So, relative to a slope, which is a 45 degree angle slope of 1, I'm saying that this line is shrunk more this way, typically, in the case of unnormalized features. And then, when I look at my lasso solution, w hat. Lasso in this case. Again, in the range minus lambda over 2. Sorry, that is clearly not minus lambda over 2. This is minus lambda over 2, to lambda over 2. I get this thresholding of the coefficients exactly to zero, relative to My least square solution, and outside that range, the difference between the least square solution and my lasso solution, is that my coefficients are each shrunk by an amount lambda over lambda 2zj. Okay. But remember that rho_j here as compared to when we talked about normalized features was defined differently. It was defined in terms of our unnormalized features. So, for the same value of lambda that you would use with normalized features you're getting a different relationship here. A different range where things are set to zero. S,o in summary, we've derived this coordinate descent algorithm for lasso in the case of unormalized features. And in particular, the key insight we had was instead of just taking the partial of our objective with respect to WJ we had to take the subgradient of our objective with respect to Wj, and that's what leads to these three different cases, because the gradient itself is defined for every value of Wj, except for this one critical point, Wj = 0. But in particular we also had a lot of insight into how this soft thresholding gives us the sparsity in our lasso solutions. [MUSIC]