[MUSIC] The next few sections of this module are going to talk about more practical issues with a stochastic gradient, which you need to address for implementing those algorithms. I made the next few sections optional because it can get really tedious to go through all these practical details. You now already got a sense of how finicky it can be. There's many other practical details, I'm going to make those sections optional. So the first one is that learning from one data point is just way too noisy, usually you use a few data points. And this is called mini-batches. So we really illustrated two extremes so far. We illustrated gradient where you make a full pass through the data, and you use N data points for every update of your coefficient. And then we talked about stochastic gradient, where you look at just one data point when you're making up data and the coefficients. And the question is, is there something in between, where you looked at B data points, say 100? And that's called mini-batches, it reduces noise, increases stability, and it's the right thing to do. And 100 is a really good number to use, by the way. Here I'm showing the convergence paths of the same problem we've been looking at, but comparing batch size of 1 with batch size of 25. And here you observe two things. First, the batch size of 25 makes a convergence path smoother, which is a good thing. But the second thing to observe which is even more interesting is, when you get around optimum batch size of one, really oscillates around the optimum. Well, batch size of 25, it oscillates better, so it has better behavior around the optimal. And by better behavior, it's going to make it much easier to use this approach in practice. So mini-batches are a great thing to do. So now we've introduced one more parameter to be tuned in this stochastic algorithm, this is the batch size B. If it's too large then it behaves just like gradient, for example, if you use batch size of N, it's exactly the gradient ascent algorithm, so in this case, the red line here is batch size too large. If the batch size is too small, you have bad oscillation, or bad behavior of. So B, too small in this case, it isn't converged very well. But if you pick the best batch size, B, you have very nice behavior. You quickly get to great solution, and you stay around that. So picking the right batch size makes a big difference. So let's go back to a simple stochastic gradient algorithm, and modify it, and reduce the notion of batch sizes. So instead of looking one data point at a time, we're going to look at one batch at a time. And if the batch size is size of B, we have N over B batch sizes for data set of size N. So if we have 1 billion data points and batch size 100, it's 1 billion over 100 of those. And now we go batch by batch, but instead of considering one data point at a time in the competition of the gradient, we now consider the B data points in that mini-batch. And the equation here just shows you the math behind basically the obvious thing of just taking 100 data points and use that just to estimate the gradient instead of 1 or instead of 1 billion. [MUSIC]