[MUSIC] Let's take a moment to compare gradient to
stochastic gradient to really understand how the two approaches
relate to each other. And we see that this very,
very small change in your algorithm and your implementation is going to
make a big difference. So we're going to build
out the table here, comparing the two approaches, gradient
in blue, stochastic gradient in green. And we saw already that gradient
is slow at computing an update for large data,
while stochastic gradient is fast. It doesn't depend on the dataset size,
it's always fast. But the question is,
what's the total time? Each iteration is cheaper for stochastic
gradient, but is it cheaper overall? And there's two answers to this question. In theory, stochastic gradient for
large datasets is always faster, always. In practice, it's a little bit more
nuanced, but it's often faster, so it's often a good thing to do. However, stochastic gradient
has a significant problem, it's much more sensitive to the choice of
parameters like the choice of step size, and it has lots of practical problems. And so a lot of the focus of today's
module is to talk about those practical challenges to stochastic
gradient and how to overcome them. And how to get the most benefit of
the small change to your algorithm. We'll see a lot of pictures like this, so I'm going to take a few
minutes to explain it. This picture compares gradient
to stochastic gradient. So the red line here is the behavior of
gradient as you iterate through data, so as you make passes through datum. And on the y axis we see the data
likelihood, so higher is better, we fit in the data better. The blue line here is stochastic gradient. And to be able to compare the two
approaches, on the x axis I'm not showing exactly running time, but I'm showing you
how many data points you need to touch. So stochastic gradient is going to make
an update every time he sees a data point. Gradient is going to make update every
time this makes full pass of the data. And so, we're showing how many
passes we're making over the data. So the full x axis here is
ten passes over the data, and you see that after ten passes over the
data, gradient is going to a likelihood that's much lower than that
of stochastic gradient. Stochastic gradient goes to
a point that's much higher. And even if you look at it in
the longer scales you'll see, stochastic gradient's
going to converge faster. However, it doesn't convert
smoothly to the optimal solution. It oscillates around the optimal solution,
and we will understand today why that happens. And oscillation, some of the challenges that
are introduced by stochastic gradient. So now I've extended it from 10 passes
over the data to 100 passes over the data, and now we see the gradient
is getting to solutions very close to that stochastic gradient. But again, you see a lot of noise and a lot of oscillation
from stochastic gradient. Sometimes it's good, sometimes it's bad,
sometimes it's good, sometimes it's bad. So that's the challenge there. So here's a summary of what we've learned. Make a tiny change to our algorithm. Instead of using the whole
dataset to compute the gradient, use a single data point,
call that stochastic gradient. going to get better quality faster. However, it's going to be tricky, there's going to be some oscillations, and
you have to learn some of the practical issues that you need to address in
order to make this really effective. But this change is going to allow you
to scale to billions of data points. Even on your desktop you'll be
able to deal with a ton of data, which is really super exciting. [MUSIC]