[MUSIC] So we've made a small change to the
gradient algorithm but it is still big. We start looking at all
the data points one at a time. Why should this even work? Why is this a good idea at all? Let's spend a little bit of time
getting intuition of why it works. And this intuition is going to
help us understand the behavior of the algorithm in practice. In this picture,
I'm showing you the landscape, the counter plot for
our sentiment analysis problem, which is the data that we'll be
using throughout our module today. But just for a subset of the data where
we're only looking at two possible features, coefficient for awful and
the coefficient for awesome. And if we were to start
here at this point, this would be the exact
gradient that would compute. And that exact gradient gives you the best possible improvement of going
from wt to wt plus one. That's the best thing
you could possibly do. Now, there are many other directions
that also improve the value, improve the likelihood, the quality. So for example, if we were to take
this other direction over here and we reached some parameter
w prime is still okay. Because it is still going uphill
we're increasing the likelihood. In other words, the likelihood of w prime is going to
be greater than the likelihood of wt. So in fact, any direction
that take uphill will be good the gradient is just the best direction. The gradient direction is
the sum of the contributions for each one of the data points. So, if I look at that gradient direction. It is the sum over my data points of the
contributions for each one of these xis. So this is the, each one of these red
lines here are the contributions from a data point, from each xiyi, which is this part of the equation,
here. So interestingly,
most contributions point upwards. So, all of these over here
are pointing in an upward direction. So if I were to pick any of those,
I would make some progress. If I picked any of the other ones,
like the ones back here, this would not make progress,
this would be bad directions. But on average,
most of them are good directions. And this is why stochastic gradient works. In stochastic gradient, we're just
going to pick one of these directions. And most of them are okay. So most of the time,
we're going to make progress. Sometimes when we take a bad direction,
we won't make progress. We'll make negative progress. But on average, we're going to
be making positive progress. And so, that's exactly why
stochastic gradient works. So if you think about
the stochastic gradient algorithm, we're going one data point at a time. And picking the direction
associated with that data point and maybe taking a little step. So, at first iteration we might pick this
data point and make some progress and the second one make this number too
here and make negative progress. But in the third one, I pick another
one that makes positive progress and I pick a fourth one that
makes positive progress. And I pick a fifth one that doesn't and
I pick a sixth one that does and so on. And so, most of the time we're
making positive progress and overall the likelihood is improving. [MUSIC]