In the previous video, we gave a
mathematical definition of gradient
descent. Let's delve deeper, and in this
video, get better intuition about what the
algorithm is doing, and why the steps of
the gradient descent algorithm might make
sense. Here's the gradient descent
algorithm that we saw last time. And, just
to remind you, this parameter, or this
term, alpha, is called the learning rate.
And it controls how big a step we take
when updating my parameter theta J. And
this second term here is the derivative
term. And what I want to do in this video
is give you better intuition about what each of
these two terms is doing and why, when put
together, this entire update makes sense.
In order to convey these intuitions, what
I want to do is use a slightly simpler
example where we want to minimize the
function of just one parameter. So, so we
have a, say we have a cost function J of
just one parameter, theta one, like we
did, you know, a few videos back. Where
theta one is a real number, okay? Just so we can have 1D plots, which
are a little bit simpler to look at. And
let's try to understand why gradient
descent would do on this function.
[sound]. So, let's say here's my function.
J of theta one, and so that's my, and
where theta one is a real number. Right,
now let's say I've initialized gradient
descent with theta one at this location.
So image that we start off at that point
on my function. What gradient descent will
do, is it will update. Theta one gets
updated as theta one minus Alpha times DD
theta one J of theta one, right? and oh an
just as an aside you know this, this
derivative term, right? If you're
wondering why I changed the notation from
these partial derivative symbols. If you
don't know what the difference is between
these partial derivative symbols and the
dd theta don't worry about it. Technically
in mathematics we call this a partial
derivative, we call this a derivative,
depending on the number of, of parameters
in the function J, but that's a
mathematical technicality, so, you know
For the purpose of this lecture, think of
these partial symbols, and DD theta one as
exactly the same thing. And, don't worry
about whether there are any differences.
I'm gonna try to use the mathematically
precise notation. But for our purposes,
these notations are really the same thing.
So, let's see what this, this equation
will do. And so we're going to compute
this derivative of, I'm not sure if you've
seen derivatives in calculus before. But
what a derivative, at this point, does, is
basically saying, you know, let's. Take
the tangent to that point, like that
straight line, the red line, just,
just touching this function and
let's look at the slope of this red line. That's
where the derivative is. It says
what's the slope of the line that is just
tangent to the function, okay, and the
slope of the line is of course is just
right, you know just the height divided by
this horizontal thing. Now. This line has
a positive slope, so it has a positive
derivative. And so, my update to theta is
going to be, theta one gives the update that
theta one minus alpha times some positive
number. Okay? Alpha, the learning
rate is always a positive number. And so
I'm gonna to take theta one, this update
as theta one minus something. So I'm gonna
end up moving theta one to the left. I'm
gonna decrease theta one and we can see
this is the right thing to do because I
actually went ahead in this direction you
know to get me closer to the minimum over
there. So, gradient descent so far seems
to be doing the right thing. Let's look at
another example. So let's take my same
function j. Just trying to draw the same
function j of theta one. And now let's say
I had instead initialized my parameter
over there on the left. So theta one is
here. I'm gonna add that point on the
surface. Now, my derivative term, d, d
theta one j of theta one, when evaluated
at this point, gonna look at right. The
slope of that line. So this derivative
term is a slope of this line. But this
line is slanting down, so this line has
negative slope. Right? Or alternatively I
say that this function has negative
derivative, just means negative slope at
that point. So this is less than equal to
zero. So when I update theta, then if
theta is updated as theta minus alpha
times a negative number. And so I have
theta one minus a negative number which
means I'm actually going to increase theta,
right? Because this is minus of a negative
number means I'm adding something to theta
and what that means is that I'm going to
end up increasing theta. And so we'll
start here and increase theta, which again
seems like the thing I want to do to try
to get me closer to the minimum. So, this
hopefully explains the intuition behind
what the derivative term is doing. Let's
next take a look at the learning rate term
alpha, and try to figure out what that's
doing. So, here's my gradient descent
update rule. Right, there's this equation
And let's look at what can happen, if
Alpha is either too small, or if Alpha is
too large. So this first example, what
happens if Alpha is too small. So here's
my function J. J of theta. Lets
just start here. If alpha is too small
then what I'm going to do is gonna
multiply the update by some small number.
So end up taking, you know, it's like a baby step
like that. Okay, so that's one step
[inaudible]. Then from this new point
we're gonna take another step. But if
the alpha is too small lets take another
little baby step. And so if And so if my
learning rate is too small. I'm gonna end
up, you know, taking these tiny, tiny baby
steps to try to get to the minimum and I'm
gonna need a lot of steps to get to the
minimum and so. If alpha's too small, can
be slow because it's gonna take these
tiny, tiny baby steps. And it's gonna need
a lot of steps before it gets anyway
close to the global minimum. Now,
how about if the alpha is too large.
So here's my function J of theta.
Turns out if alpha is too large, then
grading descent can overshoot a minimum
and may even fail to converge or even diverge. So here is what I mean. Let's say [inaudible]
ireful minimum So the derivative council
It's actually close to the minimum. So the derivative points to the right, but if alpha is too big, I'm gonna
take a huge step. Maybe I'm gonna take a huge step like that. Right? So I end up taking a huge step.
Now, my cost function has got worse. cause it starts off from this value then now my value has gotten worse. Now my
derivatives, you know, points to the left, it's actually decrease theta. But look, if my learning rate is too big,
I may take a huge step going from here all
the way out there so I end up. going all
there. Right? And if my learning rate was too
big I can take another huge step on the
next acceleration and kind of overshoot
and overshoot and so on until you notice
I'm actually getting further and further
away from the minimum. And so if alpha is
too large it can fail to converge or even
diverge. Now, I have another question for
you. So, this is a tricky one. And when I
was first learning this stuff, it actually
took me a long time to figure this out.
What if your pre-emptive theta one is
already at a local minimum? What do you
think one step of gradient descent will
do? So let's suppose you initialize theta
one at a local minimum. So you know
suppose this is your initial value of theta one
over here and it's already at a local
optimum or the local minimum. It sends
out that at local optimum your derivative
would be equal to zero. Since it's that
slope where it's that tangent point so the
slope of this line will be equal to zero
and thus this derivative term. Is equal to
zero. And so, in your gradient descent
update, you have theta one, gives update
that theta one, minus alpha times zero.
And so, what this means is that, if you're
already at a local optimum, it leaves
theta one unchanged 'cause this, you know,
gets the update's theta one equals theta one.
So if your parameter is already at a local
minimum, one step of gradient descent
does absolutely nothing. It doesn't change
the parameter, which is, which is what you
want. Cuz it keeps your solution at the
local optimum. This also explains why
gradient descent can converge the local
minimum, even with the learning rate Alpha
fixed. Here's what I mean by that. Let's
look at an example. So here's a cost
function J with theta. That maybe I want
to minimize and let's say I initialize my
algorithm my gradient descent algorithm, you know,
out there at that magenta point. If I take
one step of gradient descent you know,
maybe it'll take me to that point cuz my
derivative's pretty steep out there, right?
Now I'm at this green point and if I take
another step of gradient descent, you
notice that my derivative, meaning the
slope, is less steep at the green point when
compared to at the magenta point out
there, right? Because as I approach the
minimum my derivative gets closer and
closer to zero as I approach the minimum.
So, after one step of gradient descent,
my new derivative is a little bit smaller.
So I wanna take another step of gradient
descent. I will naturally take a somewhat
smaller step from this green point than I
did from the magenta point. Now I'm at the new
point, the red point, and then now even
closer to global minimums, so the
derivative here will be even smaller than
it was at the green point. So when I take
another step of gradient descent, you know, now
my derivative term is even smaller, and so
the magnitude of the update to theta
one is even smaller, so you can take
small step like so, and as gradient descent
runs. You will automatically take smaller
and smaller steps until eventually you are
taking very small steps, you know, and you
find the converge to the to the local
minimum. So, just to recap. In gradient
descent as we approach the local minimum,
grading descent will automatically take
smaller steps and that's because as we
approach the local minimum, by definition,
local minimum is when you have this
derivative equal to zero. So as we
approach the local minimum this derivative
term will automatically get smaller and
so gradient descent will automatically
take smaller step. So, this is what
gradient descent looks like, and so actually
there is no need to decrease alpha
overtime. So, that's the gradient descent
algorithm, and you can use it to minimize,
to try to minimize any cost function J.
Not the cost function J to be defined for
linear regression. In the next video,
we're going to take the function J, and
set that back to be exactly linear
regression's cost function. The, the
square cost function that we came up with
earlier. And taking gradient descent, and
the square cost function, and putting
them together. That will give us our first
learning algorithm, that'll give us our
linear regression algorithm.