[MUSIC] So in these algorithms,
I said that there's a stepsize. And you might be wondering, well,
how do you choose that stepsize? So, remember,
the stepsize we denoted as eta. And there are lots of different
choices you can make here, and clearly they're very important. This determines how much you're changing your W at every iteration
of this algorithm. One choice you can look at is something
called fixed stepsize or constant stepsize, where, as the name implies,
you just set eta equal to some constant. So for example, maybe 0.1. But what can happen is that you
can jump around quite a lot. So let's say I start far away. There's a really big radiant. I take a very big step and
I keep taking these big steps. And then here, I end up jumping over
the optimal to a point here and then I jump back and then I'm going
back and forth, and back and forth. And I converge very slowly
to the optimal itself. And so an analogy here is you can think of
a snowboarder in a half-pipe where they just keeping ramping up, and back, and
up, and back, and slowly it's decaying. And slowly they're gonna just settle
in to the groove of that half-pipe. But you can imagine going back and
forth and back and forth quite a number of times. But I should mention that
this fixed stepsize setting will work for a lot of cases we're
going to look at in this module. Because in most of the cases these
functions are going to be something that's called strongly convex. That's just a really well
behaved convex function. It's very clearly convex. It just never flattens out very much. And in those settings, you will
converge using this fixed stepsize, it just might take a little while. On the other hand,
a common choice is to decrease the stepsize as the number
of iterations increase. Okay?
So this means that there's some stepsize, is often called a schedule,
so I can write that. Or stepsize schedule that you need to set,
and you need. And common choices for this are that
the stepsize you're using at iteration t, is equal to some alpha over t or
another common choice is some alpha over square root of t. And let's just plot what
these functions look like. Both of these functions
look something like this. So the stepsize, so this would be
the plot of eta of t over iterations t. So the stepsize is decreasing
with the number of iterations. So why does this make sense? Well, when I start out, I'm typically
assuming that I'm gonna be far or decently far from the optimal,
so I want to take large steps, but as I'm running my algorithm,
hoping I'm getting closer, and I want to hone in more
rapidly to the optimal. So that's why I start
decreasing the stepsize. So that when I'm out here, initially
maybe I still take a really large jump. But then as I'm going in,
maybe I'll jump across once, but I'm gonna hone in much more
rapidly to this minimum here. But one thing you have to
be careful about is not decreasing the stepsize too rapidly. Because if you're doing that, you're
gonna, again, take a while to converge. Because you're just gonna be taking very,
very, very small steps. Okay, so in summary choosing your
stepsize is just a bit of an art. But here we've gone through a couple
of examples of things that are commonly used out there. Okay, so the other part of
the algorithm that we left unspecified is this part where we say,
while not converged. How are we going to
assess our convergence? Well, we know that the optimum occurs
when the derivative of the function is equal to 0. But what we're gonna see in practice, is that the derivative,
it's gonna get smaller and smaller and smaller and smaller and smaller, it's
gonna get very, very, very, very, very, very, very, very small,
but it won't be exactly 0. At some point, we're going to want to say,
okay, that's good enough. We're close enough to the optimum. I'm gonna terminate this algorithm. And I'm gonna say that
this is my solution. So what we're saying is that,
what we're gonna need to specify is something where we say,
when the absolute value of the derivative, I don't care if I'm a little
bit to the right or a little bit to the left of the optimum,
but just what the absolute value is. If that is less than sum epsilon. This is a threshold I'm setting. Then if this is satisfied,
then I'm gonna terminate the algorithm and return whatever solution I have Wt. So in practice, we're just gonna
choose epsilon to be very small. And I wanna emphasize that what
very small means depends on the data that you're looking at,
what the form of this function is. What are the range of
gradients we might expect? Is the value of the function, a plot of
the value of the function over iterations? And you'll tend to see that
the value decrease, and it's basically not changing very much. And at that point,
you know that you've converged. [MUSIC]