[MUSIC] The simplest form of ascent a gradient says when an estimated gradient one data point at a time. So lets see what that looks like. We're back to our gradient, the sun setting where we are going to start, from some point in space. We're going to compute the gradient of the likelihood function, take a little step in that space, then compute the gradient again, take another little step, another little step, and another little step. And this is the algorithm on the right that we'll be using throughout this module and before that. The question here is how expensive is this algorithm? Let's look at the case of logistic regression, just to remind ourselves of how expensive that is. What we need to do here is sum function over each data point. And you can think about it as the contribution to the gradient the data point xi, yi has, and we're adding those up through each data point. And so, I'm going to denote here by this term, the derivative l subscript i with respect to the priority wj is the contribution of i data point to the gradient. And what we're doing here Is summing up each one of these contributions in order to compute the final gradient that we're going to take, the final update for the parameters. So how computationally costly is it to do this approach? So let's ask that question. So for example, if I were to say computing the contribution of one data point takes me one millisecond. It's really fast, it only takes me one millisecond. When I have 1,000 data points, then that's not too bad. 1,000 milliseconds is one second. I don't feel too bad about that outcome. What if you have a much more complex model, like a neural network or something, where one update takes 1 second instead of 1 millisecond, and you have 1000 data points? Now computing 1 step of gradient requires 1000 seconds, which if you can't do this in your head, I can't. By Cheetah is 16.7 minutes. Now things are getting a little bit more complicated. So 16.7 minutes is a lot to wait, but a thousand data points is nothing. We said YouTube has five billion data points per day or more. So let's say that we're going back to the one millisecond case. It only takes 1 millisecond to compute that gradient contribution, but you have 10 million data points, what happens? Well the total time now becomes 2.8 hours. So things are starting to get nasty but again, 10 million data points is not that much. YouTube has five billion, ten billion more per day. So what happens if you have ten billion data points and it takes you just one millisecond to compute that creating contribution? Then it's going to take us 10 billion milliseconds, which I cheated here, it's 115.7 days. So it's going to take about four months, approximately, just to compute one gradient. And if you remember from implementation, you have to do many gradient updates before you converge. So this is definitely not going to work, you need to do something about it. [MUSIC]