[MUSIC] You've heard me hint about this a lot today, and the practical issues of stochastic gradient are pretty significant. So for most of the remaining of the module, we'll talk about how to address some of those practical issues. Let's take a moment to review the stochastic gradient algorithm. We initialize our parameters, our coefficients to some value, let's say all 0. And then until convergence, we go 1 data point at a time, and 1 feature at a time. And we update the coefficient of that feature by just computing the gradient at that single data point. So I need a contribution of each data point 1 at a time. So we're scaling down the data, update the parameters 1 data point at a time. For example, I see my first data point here 0 awesome, 2 awful, sentiment -1. I make an update which pushes me towards predicting -1 to this data point. Now if the next data point is also negative, I'm going to make another kind of negative push. If the third one is negative, make another negative push, and so on. And so, one worry, one bad thing that can happen with stochastic gradient is if your data is implicitly sorted in a certain way. So, for example, all the negative data points are coming before all the positives. They can introduce some bad behaviors to the algorithm, bad practical performance. And so we worry about this a lot and it's really significant. So, because of that, before you start running stochastic gradient you should always shuffle the rows of the data, mix them up. So that you don't have this long regions of, say, negatives before positives, or young people before older people. Or people who live in one country versus people who live in another country. You want to mix it all up. So what that means, from the context of the stochastic gradient algorithm that we just saw, is just adding a line at the beginning where you shuffle the data. So before doing anything, you should start by shuffling the data.