[MUSIC] The simplest form of ascent
a gradient says when an estimated gradient one data point at a time. So lets see what that looks like. We're back to our gradient, the sun
setting where we are going to start, from some point in space. We're going to compute the gradient
of the likelihood function, take a little step in that space,
then compute the gradient again, take another little step, another
little step, and another little step. And this is the algorithm on
the right that we'll be using throughout this module and before that. The question here is how
expensive is this algorithm? Let's look at the case
of logistic regression, just to remind ourselves
of how expensive that is. What we need to do here is sum
function over each data point. And you can think about it as
the contribution to the gradient the data point xi, yi has, and we're adding
those up through each data point. And so,
I'm going to denote here by this term, the derivative l subscript
i with respect to the priority wj is the contribution
of i data point to the gradient. And what we're doing here Is summing
up each one of these contributions in order to compute the final
gradient that we're going to take, the final update for the parameters. So how computationally costly
is it to do this approach? So let's ask that question. So for example, if I were to say computing the contribution of one
data point takes me one millisecond. It's really fast,
it only takes me one millisecond. When I have 1,000 data points,
then that's not too bad. 1,000 milliseconds is one second. I don't feel too bad about that outcome. What if you have a much more complex
model, like a neural network or something, where one update takes 1
second instead of 1 millisecond, and you have 1000 data points? Now computing 1 step of
gradient requires 1000 seconds, which if you can't do this in your head,
I can't. By Cheetah is 16.7 minutes. Now things are getting
a little bit more complicated. So 16.7 minutes is a lot to wait, but a thousand data points is nothing. We said YouTube has five billion
data points per day or more. So let's say that we're going
back to the one millisecond case. It only takes 1 millisecond to compute
that gradient contribution, but you have 10 million data points,
what happens? Well the total time now becomes 2.8 hours. So things are starting to get nasty but
again, 10 million data points is not that much. YouTube has five billion,
ten billion more per day. So what happens if you have
ten billion data points and it takes you just one millisecond to
compute that creating contribution? Then it's going to take us
10 billion milliseconds, which I cheated here, it's 115.7 days. So it's going to take about four months,
approximately, just to compute one gradient. And if you remember from implementation, you have to do many gradient
updates before you converge. So this is definitely not going to work,
you need to do something about it. [MUSIC]