[MUSIC] For now, let's assume that somebody gives
you the lambda value that we wanna use in our ridge regression objective and let's discuss algorithmically
how we fit this model. So in particular, in this part, we're describing this gray box
our machine learning algorithm. And to start with,
just like we've done many times before, we're first gonna rewrite our
objective in matrix notation, where we assume we have
some N observations and we wanna write jointly what the objective
is for all N of these observations. So let's just recall what we did for
our residual sum of squares term. Where, for our model, we thought about
stacking up all N observations and all the features associated with those
N observations in this green matrix. And that matrix got multiplied by
a vector of our regression coefficient, and then we had this additive
noise per observation. So we wrote this matrix
notation as y = H w + epsilon. Then, when we went to form
our residual sum of squares, we showed that it was equivalent to
the following where we have (y-Hw)T(y-Hw). Okay, so this is our matrix notation for
our residual sum of squares term, but now we wanna do a similar term for
our model complexity, penalty, that we added to our original objective
to get a resulting regression objective. So in particular, we want to write this
two norm of w squared in vector notation. So this two norm of our w vector squared,
we said was equal to w0 squared + w1 squared + w2 squared + all
the way up to our Dth feature squared. And this,
is equivalent to taking our w vector, transpose, meaning putting it as a row,
and multiplying by the w vector itself. Because if we think about
doing this multiplication, we're gonna get w0 * w0 + w1 * w1 + w2
* w2 etc and that's exactly equivalent. And so we can write this as w, so
I'm trying to write a thick w vector. W transpose w. Okay, so this is our vector notation for
this model complexity term. So putting it all together,
our ridge regression total cost, for some N observations,
can be written as follows, where we have (y-Hw) transpose * (y-Hw)
+ lambda * w transpose w. Okay, now that we have this, we can start
doing what we've done in the past which is take the gradient and we can think about
either setting the gradient to zero to get a closed form solution, or
doing our gradient descent algorithm. And we're gonna walk through
both of these approaches now. So the first step is computing
the gradient of this objective. So here I'm just writing exactly
what we had on the previous slide. But, now,
with these gradient signs in front, and as we've seen before,
the gradient distributes across a sum. So we have the gradient of this first term
plus the gradient of our model complexity term, or the first term is our measure of
fit, that residual sum of squares term. And we know that the gradient, or the residual sum of squares
has the following form. -2H transpose (y-Hw). The question is, what's the gradient
of this model complexity term? What we see is that
the gradient of this is 2 * w. Why is it 2*w? Well, instead of deriving it, I'll leave
that for a little mini challenge for you guys. It's fairly straightforward to derive, just taking partials with
respect to each w component. Just write w transpose w as w0 squared
+ w1 squared + blah, blah, blah. All the way up to WD squared, and then take a derivative just
with respect to one of the Ws. But for now what I'm gonna do is
I'm just gonna draw an analogy to the 1d case where w transpose w
is analogous to just w squared. If w weren't a vector and
were instead just a scalar. And what's the derivative of w squared? It's 2w. Okay, so proof by analogy here. [MUSIC]