[MUSIC] The second practical issue we're going to talk about is how do you measure convergence for stochastic gradient. If you have to look at all the data points to figure out they have converged, then the whole process is going to be pointless and meaningless. So we need to think about new techniques of measuring convergence. This, again, is going to be an optional section. It's very practical for those those who actually want to use and implement a really practical stochastic gradient algorithm. One way to think about it is, how do they actually make this plot? Here's a plot where a stochastic gradient gets the optimum before a single pass over the data, while gradient is taking 100 or more passes over the data. If to get one point in this plot I had to compute the whole likelihood over the entire data set, that would require me, for every little blue dot, to make a pass over the entire data set. Which make the whole thing really slow. If I had to make a pass over a data set to compute the likelihood, might as well just use full gradient and not have this noisy problems. And so, we need to rethink how we're going to compute conversions, how we're going to plot that we're making progress. And here there'll be a really, really simple trick, really easy, really beautiful, that we can do. So I'm showing here the stochastic gradient ascent algorithm for logistic regression, the one that we've been using so far. So I go data point by data point, and I compute the contribution to the gradient, which is this part, this thing I'm calling partial j. Now if I wanted to compute the log likelihood of the data to see what my quality was, how well I'm doing, I'll have to compute the following equation for data point i. If the data point were a positive data point, I take the log of y = +1 given xi and the current parameters, current coefficient. And, if the data point were a negative yi = -1, then I need to take the log of the probability yi = -1, which turns out to be log of 1- P(y = +1). And so here's the beautiful thing, the thing that I need to compute the likelihood for a data point is exactly the same as the thing that I needed to compute the gradient. And so, I've already computed. I can compute the contribution likelihood of this one data point. Which is great. I can't do it for everybody, but I can do it for one data point. So every iteration t, I can compute the likelihood of a particular data point. I can't use that measure conversions because I could do well one data point classified perfectly but not for others, so it would be a very noisy thing. But if I want to compute how well I'm doing, let's say, after 75 iterations, what I can do is look at the last few data points, how well I did. And the likelihood for those data points, average it out, and then create the smoother curve. And so, for every time stamp I want to keep an average, it's called a moving average, of the last few likelihoods in order to measure convergence. And so in the plot here, the plot that I be showing to you, now I can tell you truth in advertising what it actually was. The blue line here was not straight up gradient, it was many batches of size 100, which is a great number to use. It still converts much faster than gradient. And to draw that blue line, I average the likelihood over the last 30 data points. So that's how I build a plot, and this is how you would have to build a plot if you're going to go through this whole process of stochastic gradient. [MUSIC]