[MUSIC] You've heard me hint about
this a lot today, and the practical issues of stochastic
gradient are pretty significant. So for
most of the remaining of the module, we'll talk about how to address
some of those practical issues. Let's take a moment to review
the stochastic gradient algorithm. We initialize our parameters,
our coefficients to some value, let's say all 0. And then until convergence, we go 1 data
point at a time, and 1 feature at a time. And we update the coefficient of that
feature by just computing the gradient at that single data point. So I need a contribution of
each data point 1 at a time. So we're scaling down the data, update
the parameters 1 data point at a time. For example, I see my first data point
here 0 awesome, 2 awful, sentiment -1. I make an update which pushes me towards
predicting -1 to this data point. Now if the next data
point is also negative, I'm going to make another
kind of negative push. If the third one is negative,
make another negative push, and so on. And so, one worry, one bad thing
that can happen with stochastic gradient is if your data is
implicitly sorted in a certain way. So, for example, all the negative data points
are coming before all the positives. They can introduce some bad behaviors to
the algorithm, bad practical performance. And so we worry about this a lot and
it's really significant. So, because of that, before you start
running stochastic gradient you should always shuffle the rows
of the data, mix them up. So that you don't have
this long regions of, say, negatives before positives, or
young people before older people. Or people who live in one country versus
people who live in another country. You want to mix it all up. So what that means, from the context of
the stochastic gradient algorithm that we just saw, is just adding a line at
the beginning where you shuffle the data. So before doing anything,
you should start by shuffling the data.