[MUSIC] What we're going to do about it is introduce an algorithm called stochastic gradient instead of the standard gradient. So the standard gradient says, update the parameter by computing the contribution of every data point and sum it up. Stochastic gradient in its simple possible way says, you don't have to use every single data point just use one data point. That's not exact gradient. That's kind of an approximation to the gradient as we will see. And then instead of using just a single data point, every time we're going to do an update will use a different data point. So we're just going to use one gradient but we use a different one every time. And this simple change, as we will see, is extremely powerful. Let's now dig in and understand that change we hinted at where you use a little bit of data to compute the gradient instead of using an entire data set. And in fact we're just going to use one data point to compute the gradient instead of everything. So this is the gradient ascent algorithm for logistic regression. The one that we've seen earlier in the course. And I'm sure in the gradient explicitly over here. Now, it requires a sum over data points which is the thing that we're trying to avoid. We're not going to do a sum over data points at every update, at every duration. So let's throw out that sum. But each time we're going to pick a different data point. So, we're going to introduce an outer loop here where we loop over the data points, 1 through N and then we compute a gradient with respect to that data point and then we update the parameters. And we do that one at a time. So in gradient ascent we did a sum over all data points. In stochastic gradient, we just going to approximate the gradient by the contribution of one data point. And we'll see why that works. But let's go back to our table, and investigate the total time for one step of stochastic gradient ascent. So, if it takes 1 millisecond to compute the contribution data point x i, you have a 1,000 data points. Gradient takes 1 second. Stochastic gradient is going to take you 1 millisecond to make the update, because it only looks at one data point. If it takes 1 second to compute the contribution data point it's going to take you 1 second per update. If he goes back to two millisecond but you have 10 data points it's still going to cost you one millisecond per update. If he takes one millisecond to compute the data point they have 10 billion data points they still going to take you one millisecond to make an update. So this looks too good to be true, and in a way it is. But the thing to remember is that each update is going to be much cheaper than with gradient. But you might need more updates than you did with gradient in the first place. [MUSIC]