1 00:00:00,000 --> 00:00:04,843 [MUSIC] 2 00:00:04,843 --> 00:00:09,137 The next few sections of this module are going to talk about more 3 00:00:09,137 --> 00:00:12,380 practical issues with a stochastic gradient, 4 00:00:12,380 --> 00:00:16,691 which you need to address for implementing those algorithms. 5 00:00:16,691 --> 00:00:20,388 I made the next few sections optional because it can get really tedious to go 6 00:00:20,388 --> 00:00:22,337 through all these practical details. 7 00:00:22,337 --> 00:00:25,667 You now already got a sense of how finicky it can be. 8 00:00:25,667 --> 00:00:29,456 There's many other practical details, I'm going to make those sections optional. 9 00:00:29,456 --> 00:00:34,833 So the first one is that learning from one data point is just way too noisy, 10 00:00:34,833 --> 00:00:37,369 usually you use a few data points. 11 00:00:37,369 --> 00:00:40,741 And this is called mini-batches. 12 00:00:40,741 --> 00:00:43,707 So we really illustrated two extremes so far. 13 00:00:43,707 --> 00:00:48,083 We illustrated gradient where you make a full pass through the data, and 14 00:00:48,083 --> 00:00:51,685 you use N data points for every update of your coefficient. 15 00:00:51,685 --> 00:00:55,635 And then we talked about stochastic gradient, where you look at just one data 16 00:00:55,635 --> 00:00:58,573 point when you're making up data and the coefficients. 17 00:00:58,573 --> 00:01:01,307 And the question is, is there something in between, 18 00:01:01,307 --> 00:01:03,756 where you looked at B data points, say 100? 19 00:01:03,756 --> 00:01:08,642 And that's called mini-batches, it reduces noise, increases stability, and 20 00:01:08,642 --> 00:01:10,327 it's the right thing to do. 21 00:01:10,327 --> 00:01:13,851 And 100 is a really good number to use, by the way. 22 00:01:13,851 --> 00:01:20,987 Here I'm showing the convergence paths of the same problem we've been looking at, 23 00:01:20,987 --> 00:01:25,620 but comparing batch size of 1 with batch size of 25. 24 00:01:25,620 --> 00:01:28,078 And here you observe two things. 25 00:01:28,078 --> 00:01:33,420 First, the batch size of 25 makes a convergence 26 00:01:33,420 --> 00:01:37,627 path smoother, which is a good thing. 27 00:01:37,627 --> 00:01:41,294 But the second thing to observe which is even more interesting is, 28 00:01:41,294 --> 00:01:46,088 when you get around optimum batch size of one, really oscillates around the optimum. 29 00:01:46,088 --> 00:01:52,334 Well, batch size of 25, it oscillates better, 30 00:01:52,334 --> 00:01:58,016 so it has better behavior around the optimal. 31 00:01:58,016 --> 00:01:59,194 And by better behavior, 32 00:01:59,194 --> 00:02:02,288 it's going to make it much easier to use this approach in practice. 33 00:02:02,288 --> 00:02:05,518 So mini-batches are a great thing to do. 34 00:02:05,518 --> 00:02:11,506 So now we've introduced one more parameter to be tuned in this stochastic algorithm, 35 00:02:11,506 --> 00:02:13,236 this is the batch size B. 36 00:02:13,236 --> 00:02:17,920 If it's too large then it behaves just like gradient, for example, 37 00:02:17,920 --> 00:02:22,927 if you use batch size of N, it's exactly the gradient ascent algorithm, 38 00:02:22,927 --> 00:02:26,922 so in this case, the red line here is batch size too large. 39 00:02:29,214 --> 00:02:35,115 If the batch size is too small, you have bad oscillation, or bad behavior of. 40 00:02:35,115 --> 00:02:41,068 So B, too small in this case, it isn't converged very well. 41 00:02:41,068 --> 00:02:48,039 But if you pick the best batch size, B, you have very nice behavior. 42 00:02:48,039 --> 00:02:52,111 You quickly get to great solution, and you stay around that. 43 00:02:52,111 --> 00:02:55,951 So picking the right batch size makes a big difference. 44 00:02:55,951 --> 00:03:00,456 So let's go back to a simple stochastic gradient algorithm, and modify it, and 45 00:03:00,456 --> 00:03:02,386 reduce the notion of batch sizes. 46 00:03:02,386 --> 00:03:04,395 So instead of looking one data point at a time, 47 00:03:04,395 --> 00:03:06,156 we're going to look at one batch at a time. 48 00:03:06,156 --> 00:03:10,343 And if the batch size is size of B, we have N over B batch sizes for 49 00:03:10,343 --> 00:03:11,613 data set of size N. 50 00:03:11,613 --> 00:03:16,472 So if we have 1 billion data points and batch size 100, 51 00:03:16,472 --> 00:03:19,755 it's 1 billion over 100 of those. 52 00:03:19,755 --> 00:03:21,652 And now we go batch by batch, but 53 00:03:21,652 --> 00:03:26,575 instead of considering one data point at a time in the competition of the gradient, 54 00:03:26,575 --> 00:03:29,830 we now consider the B data points in that mini-batch. 55 00:03:29,830 --> 00:03:34,757 And the equation here just shows you the math behind basically the obvious 56 00:03:34,757 --> 00:03:37,618 thing of just taking 100 data points and 57 00:03:37,618 --> 00:03:42,801 use that just to estimate the gradient instead of 1 or instead of 1 billion. 58 00:03:42,801 --> 00:03:47,759 [MUSIC]