In the previous video, we talked about the form of the hypothesis for linear regression with multiple features or with multiple variables. In this video, let's talk about how to fit the parameters of that hypothesis. In particular let's talk about how to use gradient descent for linear regression with multiple features. To quickly summarize our notation, this is our formal hypothesis in multivariable linear regression where we've adopted the convention that x0=1. The parameters of this model are theta0 through theta n, but instead of thinking of this as n separate parameters, which is valid, I'm instead going to think of the parameters as theta where theta here is a n+1-dimensional vector. So I'm just going to think of the parameters of this model as itself being a vector. Our cost function is J of theta0 through theta n which is given by this usual sum of square of error term. But again instead of thinking of J as a function of these n+1 numbers, I'm going to more commonly write J as just a function of the parameter vector theta so that theta here is a vector. Here's what gradient descent looks like. We're going to repeatedly update each parameter theta j according to theta j minus alpha times this derivative term. And once again we just write this as J of theta, so theta j is updated as theta j minus the learning rate alpha times the derivative, a partial derivative of the cost function with respect to the parameter theta j. Let's see what this looks like when we implement gradient descent and, in particular, let's go see what that partial derivative term looks like. Here's what we have for gradient descent for the case of when we had N=1 feature. We had two separate update rules for the parameters theta0 and theta1, and hopefully these look familiar to you. And this term here was of course the partial derivative of the cost function with respect to the parameter of theta0, and similarly we had a different update rule for the parameter theta1. There's one little difference which is that when we previously had only one feature, we would call that feature x(i) but now in our new notation we would of course call this x(i)1 to denote our one feature. So that was for when we had only one feature. Let's look at the new algorithm for we have more than one feature, where the number of features n may be much larger than one. We get this update rule for gradient descent and, maybe for those of you that know calculus, if you take the definition of the cost function and take the partial derivative of the cost function J with respect to the parameter theta j, you'll find that that partial derivative is exactly that term that I've drawn the blue box around. And if you implement this you will get a working implementation of gradient descent for multivariate linear regression. The last thing I want to do on this slide is give you a sense of why these new and old algorithms are sort of the same thing or why they're both similar algorithms or why they're both gradient descent algorithms. Let's consider a case where we have two features or maybe more than two features, so we have three update rules for the parameters theta0, theta1, theta2 and maybe other values of theta as well. If you look at the update rule for theta0, what you find is that this update rule here is the same as the update rule that we had previously for the case of n = 1. And the reason that they are equivalent is, of course, because in our notational convention we had this x(i)0 = 1 convention, which is why these two term that I've drawn the magenta boxes around are equivalent. Similarly, if you look the update rule for theta1, you find that this term here is equivalent to the term we previously had, or the equation or the update rule we previously had for theta1, where of course we're just using this new notation x(i)1 to denote our first feature, and now that we have more than one feature we can have similar update rules for the other parameters like theta2 and so on. There's a lot going on on this slide so I definitely encourage you if you need to to pause the video and look at all the math on this slide slowly to make sure you understand everything that's going on here. But if you implement the algorithm written up here then you have a working implementation of linear regression with multiple features.