1 00:00:00,000 --> 00:00:05,040 [MUSIC] 2 00:00:05,040 --> 00:00:08,610 The second question on stochastic gradient is how do you pick the step size eta. 3 00:00:09,630 --> 00:00:12,290 And this is a significant issue just like it is with gradient. 4 00:00:13,820 --> 00:00:16,070 Both of them, it's kind of annoying, 5 00:00:16,070 --> 00:00:19,180 it's a pain to figure out to pick that coefficient eta. 6 00:00:19,180 --> 00:00:22,920 But it turns out that because of the oscillations to the stochastic gradient, 7 00:00:22,920 --> 00:00:26,570 picking eta can be even more annoying, much more annoying. 8 00:00:28,120 --> 00:00:31,460 So, if we go back to our data set and when we've been using it, 9 00:00:31,460 --> 00:00:34,060 I've shown you this blue curve many times. 10 00:00:34,060 --> 00:00:39,300 This was the best eta that I could find, the best step size. 11 00:00:39,300 --> 00:00:45,680 Now, if I were to pick smaller step sizes. 12 00:00:45,680 --> 00:00:50,655 So, smaller ETAs, it will behave kind of like stocha elect regular gradient. 13 00:00:50,655 --> 00:00:54,790 It will be slower to converge and you see less oscillations. 14 00:00:54,790 --> 00:00:57,720 But, it will eventually get there, but I mean, much slower. 15 00:00:57,720 --> 00:00:59,390 So, we worry about that a bit. 16 00:01:00,620 --> 00:01:06,870 On the other hand, instead of using the best step size, 17 00:01:06,870 --> 00:01:12,570 we try to use a larger step size because we thought, 18 00:01:12,570 --> 00:01:15,480 what could make more progress more quickly? 19 00:01:15,480 --> 00:01:19,180 You'll see this crazy oscillations and the oscillations are much worse than you 20 00:01:19,180 --> 00:01:21,990 observe with gradient, and I showed you gradient. 21 00:01:21,990 --> 00:01:26,220 So, you have to be really careful to pick a too 22 00:01:26,220 --> 00:01:28,930 large things can behave extremely erratically. 23 00:01:30,460 --> 00:01:33,170 And in fact, if you picked step size very, 24 00:01:33,170 --> 00:01:37,530 very large, you can end up with behaviors like this. 25 00:01:37,530 --> 00:01:42,810 So, this black line here was an eta that was 26 00:01:42,810 --> 00:01:48,520 way too large, that's a technical term here that we'd like to use. 27 00:01:49,550 --> 00:01:54,070 And in this case, the solution is not even 28 00:01:54,070 --> 00:01:59,280 close to anything we got even 29 00:01:59,280 --> 00:02:02,400 with etas of oscillations that we have showed you in the previous slide. 30 00:02:02,400 --> 00:02:04,453 It's a huge gap and so, 31 00:02:04,453 --> 00:02:10,430 etas too large leads to really bad behavior in stochastic gradient. 32 00:02:12,060 --> 00:02:15,780 The rule of thumb that we described for gradient, for picking eta is basically 33 00:02:15,780 --> 00:02:21,390 the same as the one for picking step size for stochastic gradient. 34 00:02:21,390 --> 00:02:22,290 The same as for 35 00:02:22,290 --> 00:02:27,350 gradient, but unfortunately it requires much more trial and error. 36 00:02:27,350 --> 00:02:31,470 So, it's even more annoying, so you might spent a lot of time in the trial and error 37 00:02:31,470 --> 00:02:34,120 even though it's a hundred times faster than converge, it's possible to spend 38 00:02:34,120 --> 00:02:39,450 a hundred times more effort trying to find the right step size, just be prepared. 39 00:02:39,450 --> 00:02:43,350 But, we just try several values exponentially spaced from each other. 40 00:02:43,350 --> 00:02:46,630 And try to find somewhere between an eta that's too small and 41 00:02:46,630 --> 00:02:50,795 an eta that is too big, and then find one that's just right. 42 00:02:50,795 --> 00:02:54,365 And, I mentioned this in the gradient section, but for stochastic gradients, 43 00:02:54,365 --> 00:02:58,935 even more important, for those who end up exploring this further, there's 44 00:02:58,935 --> 00:03:03,985 an advanced step where you would make the step size decrease over iterations. 45 00:03:03,985 --> 00:03:08,710 And so, for example, you might have an eta that depends on what iteration you're in 46 00:03:08,710 --> 00:03:16,018 and often you set it to something like some constant 47 00:03:16,018 --> 00:03:22,230 here eta zero divided by the iteration number t. 48 00:03:25,420 --> 00:03:31,736 Iteration number, and this approach tends to reduce noise and 49 00:03:31,736 --> 00:03:35,682 make things behave quite a bit better. 50 00:03:35,682 --> 00:03:40,099 [MUSIC]