1 00:00:00,000 --> 00:00:07,306 In the previous video, we talked about Stochastic gradient descent, and how that can be much faster than Batch gradient descent. 2 00:00:07,306 --> 00:00:12,866 In this video, let's talk about another variation on these ideas is called Mini-batch gradient descent 3 00:00:12,866 --> 00:00:16,906 they can work sometimes even a bit faster than stochastic gradient descent. 4 00:00:16,906 --> 00:00:22,046 To summarize the algorithms we talked about so far. 5 00:00:22,046 --> 00:00:26,619 In Batch gradient descent we will use all m examples in each generation. 6 00:00:26,619 --> 00:00:31,792 Whereas in Stochastic gradient descent we will use a single example in each generation. 7 00:00:31,792 --> 00:00:36,120 What Mini-batch gradient descent does is somewhere in between. 8 00:00:36,120 --> 00:00:46,559 Specifically, with this algorithm we're going to use b examples in each iteration where b is a parameter called the "mini batch size" 9 00:00:46,559 --> 00:00:52,688 so the idea is that this is somewhat in-between Batch gradient descent and Stochastic gradient descent. 10 00:00:52,688 --> 00:00:57,488 This is just like batch gradient descent, except that I'm going to use a much smaller batch size. 11 00:00:57,488 --> 00:01:08,672 A typical choice for the value of b might be b equals 10, lets say, and a typical range really might be anywhere from b equals 2 up to b equals 100. 12 00:01:08,672 --> 00:01:13,668 So that will be a pretty typical range of values for the Mini-batch size. 13 00:01:13,668 --> 00:01:21,153 And the idea is that rather than using one example at a time or m examples at a time we will use b examples at a time. 14 00:01:21,153 --> 00:01:28,833 So let me just write this out informally, we're going to get, let's say, b. For this example, let's say b equals 10. 15 00:01:28,833 --> 00:01:37,782 So we're going to get, the next 10 examples from my training set so that may be some set of examples xi, yi. 16 00:01:37,782 --> 00:01:46,114 If it's 10 examples then the indexing will be up to x (i+9), y (i+9) 17 00:01:46,114 --> 00:01:57,794 so that's 10 examples altogether and then we'll perform essentially a gradient descent update using these 10 examples. 18 00:01:57,794 --> 00:02:19,012 So, that's any rate times one tenth times sum over k equals i through i+9 of h subscript theta of x(k) minus y(k) times x(k)j. 19 00:02:19,012 --> 00:02:27,213 And so in this expression, where summing the gradient terms over my ten examples. 20 00:02:27,229 --> 00:02:32,370 So, that's number ten, that's, you know, my mini batch size and just i+9 again, 21 00:02:32,370 --> 00:02:39,384 the 9 comes from the choice of the parameter b, and then after this we will then increase, you know, 22 00:02:39,384 --> 00:02:46,755 i by tenth, we will go on to the next ten examples and then keep moving like this. 23 00:02:46,755 --> 00:02:50,584 So just to write out the entire algorithm in full. 24 00:02:50,584 --> 00:02:55,231 In order to simplify the indexing for this one at the right top, 25 00:02:55,231 --> 00:02:59,843 I'm going to assume we have a mini-batch size of ten and a training set size of a thousand, 26 00:02:59,843 --> 00:03:05,045 what we're going to do is have this sort of form, for i equals 1 and that in 21's the stepping, 27 00:03:05,045 --> 00:03:07,926 in steps of 10 because we look at 10 examples at a time. 28 00:03:07,926 --> 00:03:13,648 And then we perform this sort of gradient descent update using ten examples at a time 29 00:03:13,648 --> 00:03:21,566 so this 10 and this i+9 those are consequence of having chosen my mini-batch to be ten. 30 00:03:21,566 --> 00:03:27,435 And you know, this ultimate four-loop, this ends at 991 here because 31 00:03:27,435 --> 00:03:34,457 if I have 1000 training samples then I need 100 steps of size 10 in order to get through my training set. 32 00:03:34,457 --> 00:03:37,729 So this is mini-batch gradient descent. 33 00:03:37,729 --> 00:03:43,219 Compared to batch gradient descent, this also allows us to make progress much faster. 34 00:03:43,219 --> 00:03:49,487 So we have again our running example of, you know, U.S. Census data with 300 million training examples, 35 00:03:49,487 --> 00:03:55,621 then what we're saying is after looking at just the first 10 examples we can start to make progress 36 00:03:55,621 --> 00:04:00,873 in improving the parameters theta so we don't need to scan through the entire training set. 37 00:04:00,873 --> 00:04:05,377 We just need to look at the first 10 examples and this will start letting us make progress and then 38 00:04:05,377 --> 00:04:09,289 we can look at the second ten examples and modify the parameters a little bit again and so on. 39 00:04:09,289 --> 00:04:14,186 So, that is why Mini-batch gradient descent can be faster than batch gradient descent. 40 00:04:14,186 --> 00:04:19,578 Namely, you can start making progress in modifying the parameters after looking at just ten examples 41 00:04:19,578 --> 00:04:24,836 rather than needing to wait 'till you've scan through every single training example of 300 million of them. 42 00:04:24,836 --> 00:04:29,699 So, how about Mini-batch gradient descent versus Stochastic gradient descent. 43 00:04:29,699 --> 00:04:38,237 So, why do we want to look at b examples at a time rather than look at just a single example at a time as the Stochastic gradient descent? 44 00:04:38,237 --> 00:04:42,044 The answer is in vectorization. 45 00:04:42,044 --> 00:04:47,450 In particular, Mini-batch gradient descent is likely to outperform Stochastic gradient descent 46 00:04:47,450 --> 00:04:50,817 only if you have a good vectorized implementation. 47 00:04:50,817 --> 00:04:58,571 In that case, the sum over 10 examples can be performed in a more vectorized way 48 00:04:58,571 --> 00:05:05,376 which will allow you to partially parallelize your computation over the ten examples. 49 00:05:05,376 --> 00:05:09,953 So, in other words, by using appropriate vectorization to compute the rest of the terms, 50 00:05:09,953 --> 00:05:18,565 you can sometimes partially use the good numerical algebra libraries and parallelize your gradient computations over the b examples, 51 00:05:18,565 --> 00:05:24,152 whereas if you were looking at just a single example of time with Stochastic gradient descent then, you know, 52 00:05:24,152 --> 00:05:27,456 just looking at one example at a time their isn't much to parallelize over. 53 00:05:27,456 --> 00:05:29,824 At least there is less to parallelize over. 54 00:05:29,824 --> 00:05:34,866 One disadvantage of Mini-batch gradient descent is that there is now this extra parameter b, 55 00:05:34,866 --> 00:05:39,006 the Mini-batch size which you may have to fiddle with, and which may therefore take time. 56 00:05:39,006 --> 00:05:45,611 But if you have a good vectorized implementation this can sometimes run even faster that Stochastic gradient descent. 57 00:05:45,611 --> 00:05:52,937 So that was Mini-batch gradient descent which is an algorithm that in some sense does something 58 00:05:52,937 --> 00:05:57,697 that's somewhat in between what Stochastic gradient descent does and what Batch gradient descent does. 59 00:05:57,697 --> 00:06:02,626 And if you choose their reasonable value of b. I usually use b equals 10, but, 60 00:06:02,626 --> 00:06:07,343 you know, other values, anywhere from say 2 to 100, would be reasonably common. 61 00:06:07,343 --> 00:06:11,917 So we choose value of b and if you use a good vectorized implementation, 62 00:06:11,917 --> 00:06:15,917 sometimes it can be faster than both Stochastic gradient descent and faster than Batch gradient descent.