[MUSIC] The second question on stochastic gradient is how do you pick the step size eta. And this is a significant issue just like it is with gradient. Both of them, it's kind of annoying, it's a pain to figure out to pick that coefficient eta. But it turns out that because of the oscillations to the stochastic gradient, picking eta can be even more annoying, much more annoying. So, if we go back to our data set and when we've been using it, I've shown you this blue curve many times. This was the best eta that I could find, the best step size. Now, if I were to pick smaller step sizes. So, smaller ETAs, it will behave kind of like stocha elect regular gradient. It will be slower to converge and you see less oscillations. But, it will eventually get there, but I mean, much slower. So, we worry about that a bit. On the other hand, instead of using the best step size, we try to use a larger step size because we thought, what could make more progress more quickly? You'll see this crazy oscillations and the oscillations are much worse than you observe with gradient, and I showed you gradient. So, you have to be really careful to pick a too large things can behave extremely erratically. And in fact, if you picked step size very, very large, you can end up with behaviors like this. So, this black line here was an eta that was way too large, that's a technical term here that we'd like to use. And in this case, the solution is not even close to anything we got even with etas of oscillations that we have showed you in the previous slide. It's a huge gap and so, etas too large leads to really bad behavior in stochastic gradient. The rule of thumb that we described for gradient, for picking eta is basically the same as the one for picking step size for stochastic gradient. The same as for gradient, but unfortunately it requires much more trial and error. So, it's even more annoying, so you might spent a lot of time in the trial and error even though it's a hundred times faster than converge, it's possible to spend a hundred times more effort trying to find the right step size, just be prepared. But, we just try several values exponentially spaced from each other. And try to find somewhere between an eta that's too small and an eta that is too big, and then find one that's just right. And, I mentioned this in the gradient section, but for stochastic gradients, even more important, for those who end up exploring this further, there's an advanced step where you would make the step size decrease over iterations. And so, for example, you might have an eta that depends on what iteration you're in and often you set it to something like some constant here eta zero divided by the iteration number t. Iteration number, and this approach tends to reduce noise and make things behave quite a bit better. [MUSIC]