In the last video, we talked
about gradient descent for minimizing
the cost function J of theta for logistic regression.
In this video, I'd like to
tell you about some advanced
optimization algorithms and some
advanced optimization concepts.
Using some of these ideas, we'll
be able to get logistic regression
to run much more quickly than
it's possible with gradient descent.
And this will also let
the algorithms scale much better
to very large machine learning problems,
such as if we had a very large number of features.
Here's an alternative view of
what gradient descent is doing.
We have some cost function J and we want to minimize it.
So what we need to
is, we need to write
code that can take
as input the parameters theta and
they can compute two things: J
of theta and these partial
derivative terms for, you
know, J equals 0, 1
up to N.  Given code that
can do these two things, what
gradient descent does is it
repeatedly performs the following update.
Right?
So given the code that
we wrote to compute these partial
derivatives, gradient descent plugs
in here and uses that to update our parameters theta.
So another way of thinking
about gradient descent is that
we need to supply code to
compute J of theta and
these derivatives, and then
these get plugged into gradient
descents, which can then try to minimize the function for us.
For gradient descent, I guess
technically you don't actually need code
to compute the cost function J of theta.
You only need code to compute the derivative terms.
But if you think of your
code as also monitoring convergence
of some such,
we'll just think of
ourselves as providing code to
compute both the cost
function and the derivative terms.
So, having written code to
compute these two things, one
algorithm we can use is gradient descent.
But gradient descent isn't the only algorithm we can use.
And there are other algorithms,
more advanced, more sophisticated ones,
that, if we only provide
them a way to compute
these two things, then these
are different approaches to optimize
the cost function for us.
So conjugate gradient BFGS and
L-BFGS are examples of more
sophisticated optimization algorithms that
need a way to compute J
of theta, and need a way
to compute the derivatives, and can
then use more sophisticated
strategies than gradient descent to minimize the cost function.
The details of exactly what
these three algorithms is well beyond the scope of this course.
And in fact you often
end up spending, you know, many days,
or a small number of weeks studying these algorithms.
If you take a class and advance the numerical computing.
But let me just tell you about some of their properties.
These three algorithms have a number of advantages.
One is that, with any
of this algorithms you usually do
not need to manually pick the learning rate alpha.
So one way to think
of these algorithms is that given
is the way to compute the derivative and a cost function.
You can think of these algorithms as having a clever inter-loop.
And, in fact, they have a clever
inter-loop called a line
search algorithm that automatically
tries out different values for
the learning rate alpha and automatically
picks a good learning rate alpha
so that it can even pick
a different learning rate for every iteration.
And so then you don't need to choose it yourself.
These algorithms actually do
more sophisticated things than just
pick a good learning rate, and
so they often end up
converging much faster than gradient descent.
These algorithms actually do more
sophisticated things than just
pick a good learning rate, and
so they often end up converging much
faster than gradient descent, but
detailed discussion of exactly
what they do is beyond the scope of this course.
In fact, I actually used
to have used these algorithms for
a long time, like maybe over
a decade, quite frequently, and it
was only, you know, a
few years ago that I actually
figured out for myself the details
of what conjugate gradient, BFGS and O-BFGS do.
So it is actually entirely possible
to use these algorithms successfully and
apply to lots of different learning
problems without actually understanding
the inter-loop of what these algorithms do.
If these algorithms have a disadvantage,
I'd say that the main
disadvantage is that they're
quite a lot more complex than gradient descent.
And in particular, you probably should
not implement these algorithms
- conjugate gradient, L-BGFS, BFGS -
yourself unless you're an expert in numerical computing.
Instead, just as I
wouldn't recommend that you write
your own code to compute square
roots of numbers or to
compute inverses of matrices, for
these algorithms also what I
would recommend you do is just use a software library.
So, you know, to take a square
root what all of us
do is use some function
that someone else has
written to compute the square roots of our numbers.
And fortunately, Octave and
the closely related language MATLAB
- we'll be using that -
Octave has a very good. Has a pretty
reasonable library implementing some of these advanced optimization algorithms.
And so if you just use
the built-in library, you know, you get pretty good results.
I should say that there is
a difference between good
and bad implementations of these algorithms.
And so, if you're using a
different language for your machine
learning application, if you're using
C, C++, Java, and
so on, you
might want to try out a couple
of different libraries to make sure that you find a
good library for implementing these algorithms.
Because there is a difference in
performance between a good implementation
of, you know, contour gradient or
LPFGS versus less good
implementation of contour gradient or LPFGS.
So now let's explain how
to use these algorithms, I'm going to do so with an example.
Let's say that you have a
problem with two parameters
equals theta zero and theta one.
And let's say your cost function
is J of theta equals theta
one minus five squared, plus theta two minus five squared.
So with this cost function.
You know the value for theta 1 and theta 2.
If you want to minimize J of theta as a function of theta.
The value that minimizes it is
going to be theta 1
equals 5, theta 2 equals equals five.
Now, again, I know some of
you know more calculus than others,
but the derivatives of the
cost function J turn out to be these two expressions.
I've done the calculus.
So if you want to apply
one of the advanced optimization algorithms
to minimize cost function J.
So, you know, if we
didn't know the minimum was at
5, 5, but if you want to have
a cost function 5 the minimum
numerically using something like
gradient descent but preferably more
advanced than gradient descent, what
you would do is implement an octave
function like this, so we
implement a cost function,
cost function theta function like that,
and what this does is that
it returns two arguments, the
first J-val, is how
we would compute the cost function
J. And so this says J-val
equals, you know, theta
one minus five squared plus theta
two minus five squared.
So it's just computing this cost function over here.
And the second argument that
this function returns is gradient.
So gradient is going to
be a two by one vector,
and the two elements of the
gradient vector correspond to
the two partial derivative terms over here.
Having implemented this cost function,
you would, you can then
call the advanced optimization
function called the fminunc
- it stands for function
minimization unconstrained in Octave
-and the way you call this is as follows.
You set a few options.
This is a options
as a data structure that stores the options you want.
So grant up on,
this sets the gradient objective parameter to on.
It just means you are indeed going to provide a gradient to this algorithm.
I'm going to set the maximum number
of iterations to, let's say, one hundred.
We're going give it an initial guess for theta.
There's a 2 by 1 vector.
And then this command calls fminunc.
This at symbol presents a
pointer to the cost function
that we just defined up there.
And if you call this,
this will compute, you know, will use
one of the more advanced optimization algorithms.
And if you want to think it as just like gradient descent.
But automatically choosing the learning
rate alpha for so you don't have to do so yourself.
But it will then attempt to
use the sort of advanced optimization algorithms.
Like gradient descent on steroids.
To try to find the optimal value of theta for you.
Let me actually show you what this looks like in Octave.
So I've written this cost function
of theta function exactly as we had it on the previous line.
It computes J-val which is the cost function.
And it computes the gradient with
the two elements being the partial
derivatives of the cost function
with respect to, you know,
the two parameters, theta one and theta two.
Now let's switch to my Octave window.
I'm gonna type in those commands I had just now.
So, options equals optimset. This is
the notation for setting my
parameters on my options,
for my optimization algorithm. Grant option on, maxIter, 100
so that says 100
iterations, and I am
going to provide the gradient to my algorithm.
Let's say initial theta equals zero's two by one.
So that's my initial guess for theta.
And now I have of theta,
function val exit flag
equals fminunc constraint.
A pointer to the cost function.
and provide my initial guess.
And the options like so.
And if I hit enter this will run the optimization algorithm.
And it returns pretty quickly.
This funny formatting that's because
my line, you know, my
code wrapped around.
So, this funny thing
is just because my command line had wrapped around.
But what this says is that
numerically renders, you know, think
of it as gradient descent
on steroids, they found the optimal value of
a theta is theta 1
equals 5, theta 2 equals 5, exactly as we're hoping for.
The function value at the
optimum is essentially 10 to the minus 30.
So that's essentially zero, which
is also what we're hoping for.
And the exit flag is
1, and this shows
what the convergence status of this.
And if you want you can do
help fminunc to
read the documentation for how
to interpret the exit flag.
But the exit flag let's you verify whether or not this algorithm thing has converged.
So that's how you run these algorithms in Octave.
I should mention, by the way,
that for the Octave implementation, this value
of theta, your parameter vector
of theta, must be in
rd for d greater than or equal to 2.
So if theta is just a real number.
So, if it is not at least
a two-dimensional vector
or some higher than two-dimensional
vector, this fminunc
may not work, so and if
in case you have a
one-dimensional function that you use
to optimize, you can look
in the octave documentation for fminunc
for additional details.
So, that's how we optimize
our trial example of this
simple quick driving cost function.
However, we apply this to let's just say progression.
In logistic regression we have
a parameter vector theta, and
I'm going to use a mix
of octave notation and sort of math notation.
But I hope this explanation
will be clear, but our parameter
vector theta comprises these
parameters theta 0 through theta
n because octave indexes,
vectors using indexing from
1, you know, theta 0 is
actually written theta 1
in octave, theta 1 is gonna be written.
So, if theta 2 in octave
and that's gonna be a written
theta n+1, right?
And that's because Octave indexes
is vectors starting from index
of 1 and so the index of 0.
So what we need
to do then is write a
cost function that captures
the cost function for logistic regression.
Concretely, the cost function
needs to return J-val, which is, you know, J-val
as you need some codes to
compute J of theta and
we also need to give it the gradient.
So, gradient 1 is going
to be some code to compute
the partial derivative in respect to
theta 0, the next partial
derivative respect to theta 1 and so on.
Once again, this is gradient
1, gradient 2 and so
on, rather than gradient 0, gradient
1 because octave indexes
is vectors starting from one rather than from zero.
But the main concept I hope
you take away from this slide
is, that what you need to do,
is write a function that returns
the cost function and returns the gradient.
And so in order to
apply this to logistic regression
or even to linear regression, if
you want to use these optimization algorithms for linear regression.
What you need to do is plug in
the appropriate code to compute
these things over here.
So, now you know how to use these advanced optimization algorithms.
Because, using, because for
these algorithms, you're using a
sophisticated optimization library, it makes
the just a little bit
more opaque and so
just maybe a little bit harder to debug.
But because these algorithms often
run much faster than gradient descent,
often quite typically whenever
I have a large machine learning
problem, I will use
these algorithms instead of using gradient descent.
And with these ideas, hopefully,
you'll be able to get logistic progression
and also linear regression to work
on much larger problems.
So, that's it for advanced optimization concepts.
And in the next and
final video on Logistic Regression,
I want to tell you how to
take the logistic regression algorithm
that you already know about and make
it work also on multi-class classification problems.