In the previous video, we talked about
the form of the hypothesis for linear
regression with multiple features
or with multiple variables.
In this video, let's talk about how to
fit the parameters of that hypothesis.
In particular let's talk about how
to use gradient descent for linear
regression with multiple features.
To quickly summarize our notation,
this is our formal hypothesis in
multivariable linear regression where
we've adopted the convention that x0=1.
The parameters of this model are theta0
through theta n, but instead of thinking
of this as n separate parameters, which
is valid, I'm instead going to think of
the parameters as theta where theta
here is a n+1-dimensional vector.
So I'm just going to think of the
parameters of this model
as itself being a vector.
Our cost function is J of theta0 through
theta n which is given by this usual
sum of square of error term. But again
instead of thinking of J as a function
of these n+1 numbers, I'm going to
more commonly write J as just a
function of the parameter vector theta
so that theta here is a vector.
Here's what gradient descent looks like.
We're going to repeatedly update each
parameter theta j according to theta j
minus alpha times this derivative term.
And once again we just write this as
J of theta, so theta j is updated as
theta j minus the learning rate
alpha times the derivative, a partial
derivative of the cost function with
respect to the parameter theta j.
Let's see what this looks like when
we implement gradient descent and,
in particular, let's go see what that
partial derivative term looks like.
Here's what we have for gradient descent
for the case of when we had N=1 feature.
We had two separate update rules for
the parameters theta0 and theta1, and
hopefully these look familiar to you.
And this term here was of course the
partial derivative of the cost function
with respect to the parameter of theta0,
and similarly we had a different
update rule for the parameter theta1.
There's one little difference which is
that when we previously had only one
feature, we would call that feature x(i)
but now in our new notation
we would of course call this 
x(i)<u>1 to denote our one feature.</u>
So that was for when
we had only one feature.
Let's look at the new algorithm for
we have more than one feature,
where the number of features n
may be much larger than one.
We get this update rule for gradient
descent and, maybe for those of you that
know calculus, if you take the
definition of the cost function and take
the partial derivative of the cost
function J with respect to the parameter
theta j, you'll find that that partial
derivative is exactly that term that
I've drawn the blue box around.
And if you implement this you will
get a working implementation of
gradient descent for
multivariate linear regression.
The last thing I want to do on
this slide is give you a sense of
why these new and old algorithms are
sort of the same thing or why they're
both similar algorithms or why they're
both gradient descent algorithms.
Let's consider a case
where we have two features
or maybe more than two features,
so we have three update rules for
the parameters theta0, theta1, theta2
and maybe other values of theta as well.
If you look at the update rule for
theta0, what you find is that this
update rule here is the same as
the update rule that we had previously
for the case of n = 1.
And the reason that they are
equivalent is, of course,
because in our notational convention we
had this x(i)<u>0 = 1 convention, which is</u>
why these two term that I've drawn the
magenta boxes around are equivalent.
Similarly, if you look the update
rule for theta1, you find that
this term here is equivalent to
the term we previously had,
or the equation or the update
rule we previously had for theta1,
where of course we're just using
this new notation x(i)<u>1 to denote</u>
our first feature, and now that we have
more than one feature we can have
similar update rules for the other
parameters like theta2 and so on.
There's a lot going on on this slide
so I definitely encourage you
if you need to to pause the video
and look at all the math on this slide
slowly to make sure you understand
everything that's going on here.
But if you implement the algorithm
written up here then you have
a working implementation of linear
regression with multiple features.