[MUSIC] So we've made a small change to the gradient algorithm but it is still big. We start looking at all the data points one at a time. Why should this even work? Why is this a good idea at all? Let's spend a little bit of time getting intuition of why it works. And this intuition is going to help us understand the behavior of the algorithm in practice. In this picture, I'm showing you the landscape, the counter plot for our sentiment analysis problem, which is the data that we'll be using throughout our module today. But just for a subset of the data where we're only looking at two possible features, coefficient for awful and the coefficient for awesome. And if we were to start here at this point, this would be the exact gradient that would compute. And that exact gradient gives you the best possible improvement of going from wt to wt plus one. That's the best thing you could possibly do. Now, there are many other directions that also improve the value, improve the likelihood, the quality. So for example, if we were to take this other direction over here and we reached some parameter w prime is still okay. Because it is still going uphill we're increasing the likelihood. In other words, the likelihood of w prime is going to be greater than the likelihood of wt. So in fact, any direction that take uphill will be good the gradient is just the best direction. The gradient direction is the sum of the contributions for each one of the data points. So, if I look at that gradient direction. It is the sum over my data points of the contributions for each one of these xis. So this is the, each one of these red lines here are the contributions from a data point, from each xiyi, which is this part of the equation, here. So interestingly, most contributions point upwards. So, all of these over here are pointing in an upward direction. So if I were to pick any of those, I would make some progress. If I picked any of the other ones, like the ones back here, this would not make progress, this would be bad directions. But on average, most of them are good directions. And this is why stochastic gradient works. In stochastic gradient, we're just going to pick one of these directions. And most of them are okay. So most of the time, we're going to make progress. Sometimes when we take a bad direction, we won't make progress. We'll make negative progress. But on average, we're going to be making positive progress. And so, that's exactly why stochastic gradient works. So if you think about the stochastic gradient algorithm, we're going one data point at a time. And picking the direction associated with that data point and maybe taking a little step. So, at first iteration we might pick this data point and make some progress and the second one make this number too here and make negative progress. But in the third one, I pick another one that makes positive progress and I pick a fourth one that makes positive progress. And I pick a fifth one that doesn't and I pick a sixth one that does and so on. And so, most of the time we're making positive progress and overall the likelihood is improving. [MUSIC]