[MUSIC] Now let's see why we get
sparsity in our lasso solutions. And, to do this, let's interpret
the solution geometrically. But first, to set the stage, let's interpret the ridge
regression solution geometrically. And, then we'll get to the lasso. Well, since visualizations
are easier in 2D, let's just look at an example where
we have two features, h0 and h1. So, let me just write this, two features for visualization sake. And what I'm writing in
this green box is my ridge objective simplified just for
having two features, and in this pink box, I'm showing just
the residual sum of squares term. And what we're going to do to start with,
we're gonna make a contour plot for our residual sum of squares
in these two dimensions, w0 by w1, and let's look at this
residual sum of squares term. Where inside this sum
over n observations we're gonna get terms that
look like y squared plus w0 squared, h0 squared plus w1 squared, h1 squared plus all the cross terms. When I finish completing this square here,
and if I think about this. And if I sum over all my observations,
these sums will pass in here. So, if I think of this as a function of
w0 and w1, well, what's this defining? If this is equal to some constant,
which is what a contour plot is doing it's looking at
an objective equal to different values. Well this is an equation of an ellipse. Because I have my two parameters,
w0 and w1, each squared. They're multiplied by some weighting and
then there's some other terms having to do with w0 and w1, no power greater than
squared, that are coming in here, setting it equal to constant. That by definition is an ellipse. Okay, so what I see is that, for my residual sum of square's contour plot,
I'm gonna get a series of ellipses. So I'm highlighting here one ellipse. And what this ellipse is is this is
residual sum of squares of w0 w1 equal to what I'll call constant 1. This next plot is residual sum of squares w0 w1 = sum constant two, which is greater than constant one and
so on. That's what all these curves are,
increasing residual sum of squares. And when I walk around this curve,
it's a level set, it's a set of all things
being equal value. So, if I look at sum w0 w1 pair and I look at some other point,
w0 prime, w1 prime, while both of these points,
here and here, have the same residual sum of squares which is
what I called constant one before. So as I'm walking around this circle, every solution w0 w1 has the exactly
the same residule sum of squares. Okay so hopefully what this
plot is showing is now clear. And now let's talk about, what if I
just minimize residual sum of squares? Well if I just minimize
residual sum of squares, everytime I jump from one of these curves
to the next curve to the next curve, all the way into the smallest curve,
just this dot in the middle, this is going to smaller and
smaller residual sum of squares. So this x here marks the minimum over all possible w0 w1 of
residual sum of squares w0 w1 and what is that? That's our lee square solution so
this is w hat lee squares. Okay, I'm gonna, because this is so
important, I'm gonna highlight it in red. That's what this point is. I don't want to draw
a circle cuz the circle is exactly what all these other
ellipses look like here. So I'm gonna put a little box around this,
and I'll highlight in red this
is w hat lee squares. Okay, so that would be my solution
if I were just minimizing residual sum of squares,
but when I'm looking at my ridge objective,
there's also this L2 penalty. This sum of w0 squared + w1 squared. And what does that look like? So I have w0 squared + w1 squared and
when I'm looking at my contour plot I'm looking at setting
this equal to some constant. And changing the value of that constant
to get these different contours, the different colors that I'm seeing here. Well, what shape is w0 squared
plus w1 squared equal to constant? That is exactly the equation of a circle. So we see that the circle
is centered about zero and so each one of these curves here,
just to be clear, is my two norm of w squared, equal to, let's say constant one. This next circle is the two norm of
w squared equal to sum constant two, which is greater than constant one,
and so on. And again,
if I look at any point w0 w1 and some other point w0 prime w1 prime, these things have the same
norm of the w vector squared. And that's true for
all points around the circle. So let's say I'm just trying
to minimize my two norm. What's the solution to that? Well I'm going to jump down
these contours to my minimum. And my minimum,
let me do it directly in red this time. My minimum, oops,
that didn't switch colors. Directly in red. My minimum is setting w0 and
w1 equal to zero. So this is min/w0 w1, my two norm, which I'll write explicitly
in this 2D case, is w0 squared + w1 squared, and the solution is 0. Okay, so
this would be our ridge regression solution if lambda,
the waiting on this 2 norm, were infinity. We talked about that before. [MUSIC]