[MUSIC] Okay, so now that all of you are optimization experts we can think about applying these optimization notions and optimization algorithms that we described to our specific scenario of interest. Which is searching over all possible lines and finding the line that best fits our data. So the first thing that's important to mention is the fact that our objective is Convex. So you can go and show that this is a convex function. And what this is implies is that the solution to this minimization problem is unique. We know there's a unique minimum to this function. And likewise, we know that our gradient descent algorithm will converge to this minimum. Gradient descent algorithm will converge to the minimum. Okay, so this is good news for trying to find the line that best fits our data because this is a very, it sounds like a very complicated problem at least. We have to search over all possible lines. And, of course we couldn't possibly go and test each one of these lines, and it's very, very helpful that there is this property, so we can use these very straightforward algorithms that we described to go and solve for the solution to this problem. Okay, so let's return to the definition of our cost, which is the residual sum of squares of our two parameters, (wo,w1), which I've written again right here. But before we get to taking the gradient, which is gonna be so crucial for our gradient descent algorithm, let's just note the following fact about derivatives, where if we take the derivative of the sum of functions over some parameter, some variable w. Then we can write this equivalently, so this sigma notation, it's just shorthand for this. Full sum over the N terms. So N different functions, g1 to gN, the derivative distributes across the sum. And we can rewrite this as the sum of the derivative, okay? So the derivative of the sum of functions is the same as the sum of the derivative of the individual functions so in our case, We have that gi(w). The gi that I'm writing here is equal to (yi-[w0+w0+w1xi]) squared. And we see that the residual sum of squares is indeed a sum over n different functions of w0 and w1. And so in our case when we're thinking about taking the partial of the residual sum of squares. With respect to, for example w0, this is going to be equal to the sum, I equals 1 to N, of the partial with respect to W0. (Yi- [W0 + W1Xi]) squared, and the same holds for W1. Okay, so now let's go ahead and actually compute this gradient. And the first thing we're gonna do is take the derivative or the partial with respect to the W0. Okay, so I'm gonna use this fact that, I showed on the previous slide to take the sum to the outside. And then I'm gonna take the partial with respect to the inside. So, I have a function raised to a power. So i'm gonna bring that power down. I'm gonna get a 2 here, rewrite the function. That's W1 XI, and now the power is just gonna be 1 here. But then, I have to take the derivative of the inside. And so what's the derivative of the inside when I'm taking this derivative with respect to W0? Well, what I have is I have a -1 multiplying W0, and everything else I'm just treating as a constant. So, I need to multiply by -1. So I'll just rewrite this. I'm gonna multiply the 2 by this -1. I'm gonna pull it outside the sum. So I'm gonna get -2 sum i equals one to n. Yi-w0+w1xi, promise I'll try and minimize how many derivations we're doing in the slides like this. But I think this is really instructive. So, let's go ahead and now take the derivative or the partial with respect to W1. So in this case, again I'm pulling the sum out same thing happens where I'm gonna bring the 2 down. And I'm gonna rewrite the function here, the inside part of the function, raise it just to the 1 power. And then, when I take the derivative of this part, this inside part, with respect to W1, What do I have? Well all of these things are constants with respect to W1, but what's multiplying W1? I have a -xi so I'm going to need to multiply by -xi. Okay, so again rewriting this. I'm going to take the -2 to the outside of the sum. And I'm gonna have yi-w0 +w1xi. It's a good thing you guys can put me on two speed and just get through all of this very quickly. But here we're multiply by xi and we have this extra term in the derivative with respect to w1 okay, well let's put this all together and look, I went ahead and, I typed it out so, I don't need to write it here, that's good. So this is the gradient of our residual sum of squares, and it's a vector of two dimensions because we have two variables, w0 and w1. Now what can w think about doing? Well of course we can think about doing the gradient descent algorithm. But let's hold off on that because what do we know is another way to solve for the minimum of this function? Well we know we can, just like we talked about in one D, taking the derivative and setting it equal to zero, that was the first approach for solving for the minimum. Well here we can take the gradient and set it equal to zero. [MUSIC]