1 00:00:00,251 --> 00:00:05,622 For many learning algorithms, among them linear regression, logistic regression and neural networks, 2 00:00:05,622 --> 00:00:11,955 the way we derive the algorithm was by coming up with a cost function or coming up with an optimization objective. 3 00:00:11,955 --> 00:00:16,476 And then using an algorithm like gradient descent to minimize that cost function. 4 00:00:16,476 --> 00:00:22,461 We have a very large training set gradient descent becomes a computationally very expensive procedure. 5 00:00:22,461 --> 00:00:29,300 In this video, we'll talk about a modification to the basic gradient descent algorithm called Stochastic gradient descent, 6 00:00:29,300 --> 00:00:37,841 which will allow us to scale these algorithms to much bigger training sets. 7 00:00:37,841 --> 00:00:41,928 Suppose you are training a linear regression model using gradient descent. 8 00:00:41,928 --> 00:00:48,055 As a quick recap, the hypothesis will look like this, and the cost function will look like this, 9 00:00:48,055 --> 00:00:54,459 which is the sum of one half of the average square error of your hypothesis on your m training examples, 10 00:00:54,459 --> 00:00:59,705 and the cost function we've already seen looks like this sort of bow-shaped function. 11 00:00:59,705 --> 00:01:06,659 So, plotted as function of the parameters theta 0 and theta 1, the cost function J is a sort of a bow-shaped function. 12 00:01:06,659 --> 00:01:10,999 And gradient descent looks like this, where in the inner loop of gradient descent 13 00:01:10,999 --> 00:01:15,594 you repeatedly update the parameters theta using that expression. 14 00:01:15,594 --> 00:01:22,574 Now in the rest of this video, I'm going to keep using linear regression as the running example. 15 00:01:22,574 --> 00:01:29,371 But the ideas here, the ideas of Stochastic gradient descent is fully general and also applies to other learning algorithms 16 00:01:29,371 --> 00:01:38,011 like logistic regression, neural networks and other algorithms that are based on training gradient descent on a specific training set. 17 00:01:38,011 --> 00:01:43,236 So here's a picture of what gradient descent does, if the parameters are initialized to the point there 18 00:01:43,236 --> 00:01:50,072 then as you run gradient descent different iterations of gradient descent will take the parameters to the global minimum. 19 00:01:50,072 --> 00:01:55,193 So take a trajectory that looks like that and heads pretty directly to the global minimum. 20 00:01:55,193 --> 00:01:59,561 Now, the problem with gradient descent is that if m is large. 21 00:01:59,561 --> 00:02:08,382 Then computing this derivative term can be very expensive, because the surprise, summing over all m examples. 22 00:02:08,382 --> 00:02:15,644 So if m is 300 million, alright. So in the United States, there are about 300 million people. 23 00:02:15,644 --> 00:02:20,783 And so the US or United States census data may have on the order of that many records. 24 00:02:20,783 --> 00:02:26,715 So you want to fit the linear regression model to that then you need to sum over 300 million records. 25 00:02:26,715 --> 00:02:36,385 And that's very expensive. To give the algorithm a name, this particular version of gradient descent is also called Batch gradient descent. 26 00:02:36,385 --> 00:02:41,352 And the term Batch refers to the fact that we're looking at all of the training examples at a time. 27 00:02:41,352 --> 00:02:44,303 We call it sort of a batch of all of the training examples. 28 00:02:44,303 --> 00:02:51,853 And it really isn't the, maybe the best name but this is what machine learning people call this particular version of gradient descent. 29 00:02:51,853 --> 00:02:57,157 And if you imagine really that you have 300 million census records stored away on disc. 30 00:02:57,157 --> 00:03:05,945 The way this algorithm works is you need to read into your computer memory all 300 million records in order to compute this derivative term. 31 00:03:05,945 --> 00:03:11,508 You need to stream all of these records through computer because you can't store all your records in computer memory. 32 00:03:11,508 --> 00:03:16,425 So you need to read through them and slowly, you know, accumulate the sum in order to compute the derivative. 33 00:03:16,425 --> 00:03:21,452 And then having done all that work, that allows you to take one step of gradient descent. 34 00:03:21,452 --> 00:03:24,749 And now you need to do the whole thing again. 35 00:03:24,749 --> 00:03:28,424 You know, scan through all 300 million records, accumulate these sums. 36 00:03:28,424 --> 00:03:32,578 And having done all that work, you can take another little step using gradient descent. 37 00:03:32,578 --> 00:03:36,959 And then do that again. And then you take yet a third step. And so on. 38 00:03:36,959 --> 00:03:40,819 And so it's gonna take a long time in order to get the algorithm to converge. 39 00:03:40,819 --> 00:03:45,375 In contrast to Batch gradient descent, what we are going to do is come up with a different algorithm 40 00:03:45,375 --> 00:03:50,465 that doesn't need to look at all the training examples in every single iteration, 41 00:03:50,465 --> 00:03:55,118 but that needs to look at only a single training example in one iteration. 42 00:03:55,118 --> 00:03:59,617 Before moving on to the new algorithm, here's just a Batch gradient descent algorithm written out again 43 00:03:59,617 --> 00:04:05,794 with that being the cost function and that being the update and of course this term here, 44 00:04:05,794 --> 00:04:10,678 that's used in the gradient descent rule, that is the partial derivative 45 00:04:10,678 --> 00:04:17,933 with respect to the parameters theta J of our optimization objective, J train of theta. 46 00:04:17,933 --> 00:04:23,386 Now, let's look at the more efficient algorithm that scales better to large data sets. 47 00:04:23,386 --> 00:04:26,489 In order to work off the algorithms called Stochastic gradient descent, 48 00:04:26,489 --> 00:04:32,657 this vectors the cost function in a slightly different way then they define the cost of the parameter theta 49 00:04:32,657 --> 00:04:40,471 with respect to a training example x(i), y(i) to be equal to one half times the squared error 50 00:04:40,471 --> 00:04:44,791 that my hypothesis incurs on that example, x(i), y(i). 51 00:04:44,791 --> 00:04:53,386 So this cost function term really measures how well is my hypothesis doing on a single example x(i), y(i). 52 00:04:53,386 --> 00:05:01,010 Now you notice that the overall cost function j train can now be written in this equivalent form. 53 00:05:01,010 --> 00:05:09,606 So j train is just the average over my m training examples of the cost of my hypothesis on that example x(i), y(i). 54 00:05:09,606 --> 00:05:13,522 Armed with this view of the cost function for linear regression, 55 00:05:13,522 --> 00:05:17,636 let me now write out what Stochastic gradient descent does. 56 00:05:17,636 --> 00:05:26,940 The first step of Stochastic gradient descent is to randomly shuffle the data set. 57 00:05:26,940 --> 00:05:32,539 So by that I just mean randomly shuffle, or randomly reorder your m training examples. 58 00:05:32,539 --> 00:05:37,450 It's sort of a standard pre-processing step, come back to this in a minute. 59 00:05:37,450 --> 00:05:42,997 But the main work of Stochastic gradient descent is then done in the following. 60 00:05:42,997 --> 00:05:48,150 We're going to repeat for i equals 1 through m. 61 00:05:48,150 --> 00:05:53,067 So we'll repeatedly scan through my training examples and perform the following update. 62 00:05:53,067 --> 00:06:06,523 Gonna update the parameter theta j as theta j minus alpha times h of x(i) minus y(i) times x(i)j. 63 00:06:06,523 --> 00:06:12,961 And we're going to do this update as usual for all values of j. 64 00:06:12,961 --> 00:06:24,708 Now, you notice that this term over here is exactly what we had inside the summation for Batch gradient descent. 65 00:06:24,708 --> 00:06:31,256 In fact, for those of you that are calculus is possible to show that that term here, that's this term here, 66 00:06:31,256 --> 00:06:43,511 is equal to the partial derivative with respect to my parameter theta j of the cost of the parameters theta on x(i), y(i). 67 00:06:43,511 --> 00:06:47,383 Where cost is of course this thing that was defined previously. 68 00:06:47,383 --> 00:06:52,081 And just the wrap of the algorithm, let me close my curly braces over there. 69 00:06:52,081 --> 00:06:59,365 So what Stochastic gradient descent is doing is it is actually scanning through the training examples. 70 00:06:59,365 --> 00:07:04,349 And first it's gonna look at my first training example x(1), y(1). 71 00:07:04,349 --> 00:07:09,399 And then looking at only this first example, it's gonna take like a basically a little gradient descent step 72 00:07:09,399 --> 00:07:13,725 with respect to the cost of just this first training example. 73 00:07:13,725 --> 00:07:15,717 So in other words, we're going to look at the first example 74 00:07:15,717 --> 00:07:21,214 and modify the parameters a little bit to fit just the first training example a little bit better. 75 00:07:21,214 --> 00:07:29,244 Having done this inside this inner for-loop is then going to go on to the second training example. 76 00:07:29,244 --> 00:07:33,848 And what it's going to do there is take another little step in parameter space, 77 00:07:33,848 --> 00:07:39,682 so modify the parameters just a little bit to try to fit just a second training example a little bit better. 78 00:07:39,682 --> 00:07:44,130 Having done that, is then going to go onto my third training example. 79 00:07:44,130 --> 00:07:51,722 And modify the parameters to try to fit just the third training example a little bit better, and so on 80 00:07:51,722 --> 00:07:55,114 until you know, you get through the entire training set. 81 00:07:55,114 --> 00:08:01,297 And then this ultra repeat loop may cause it to take multiple passes over the entire training set. 82 00:08:01,297 --> 00:08:07,346 This view of Stochastic gradient descent also motivates why we wanted to start by randomly shuffling the data set. 83 00:08:07,346 --> 00:08:10,772 This doesn't show us that when we scan through the training site here, 84 00:08:10,772 --> 00:08:15,197 that we end up visiting the training examples in some sort of randomly sorted order. 85 00:08:15,197 --> 00:08:21,229 Depending on whether your data already came randomly sorted or whether it came originally sorted in some strange order, 86 00:08:21,229 --> 00:08:26,391 in practice this would just speed up the conversions to Stochastic gradient descent just a little bit. 87 00:08:26,391 --> 00:08:30,985 So in the interest of safety, it's usually better to randomly shuffle the data set if you aren't sure 88 00:08:30,985 --> 00:08:34,056 if it came to you in randomly sorted order. 89 00:08:34,056 --> 00:08:37,240 But more importantly another view of Stochastic gradient descent is 90 00:08:37,240 --> 00:08:45,504 that it's a lot like descent but rather than wait to sum up these gradient terms over all m training examples, 91 00:08:45,504 --> 00:08:50,624 what we're doing is we're taking this gradient term using just one single training example 92 00:08:50,624 --> 00:08:54,810 and we're starting to make progress in improving the parameters already. 93 00:08:54,810 --> 00:09:02,248 So rather than, you know, waiting 'till taking a path through all 300,000 United States Census records, 94 00:09:02,248 --> 00:09:05,632 say, rather than needing to scan through all of the training examples 95 00:09:05,632 --> 00:09:09,947 before we can modify the parameters a little bit and make progress towards a global minimum. 96 00:09:09,947 --> 00:09:14,975 For Stochastic gradient descent instead we just need to look at a single training example 97 00:09:14,975 --> 00:09:22,188 and we're already starting to make progress in this case of parameters towards, moving the parameters towards the global minimum. 98 00:09:22,188 --> 00:09:27,558 So, here's the algorithm written out again where the first step is to randomly shuffle the data 99 00:09:27,558 --> 00:09:35,089 and the second step is where the real work is done, where that's the update with respect to a single training example x(i), y(i). 100 00:09:35,089 --> 00:09:40,139 So, let's see what this algorithm does to the parameters. 101 00:09:40,139 --> 00:09:43,467 Previously, we saw that when we are using Batch gradient descent, 102 00:09:43,467 --> 00:09:46,331 that is the algorithm that looks at all the training examples in time, 103 00:09:46,331 --> 00:09:53,397 Batch gradient descent will tend to, you know, take a reasonably straight line trajectory to get to the global minimum like that. 104 00:09:53,397 --> 00:09:59,956 In contrast with Stochastic gradient descent every iteration is going to be much faster 105 00:09:59,956 --> 00:10:03,108 because we don't need to sum up over all the training examples. 106 00:10:03,108 --> 00:10:07,259 But every iteration is just trying to fit single training example better. 107 00:10:07,259 --> 00:10:13,931 So, if we were to start stochastic gradient descent, oh, let's start stochastic gradient descent at a point like that. 108 00:10:13,931 --> 00:10:19,556 The first iteration, you know, may take the parameters in that direction and 109 00:10:19,556 --> 00:10:23,791 maybe the second iteration looking at just the second example maybe just by chance, 110 00:10:23,791 --> 00:10:28,278 we get more unlucky and actually head in a bad direction with the parameters like that. 111 00:10:28,278 --> 00:10:33,731 In the third iteration where we tried to modify the parameters to fit just the third training examples better, 112 00:10:33,731 --> 00:10:36,418 maybe we'll end up heading in that direction. 113 00:10:36,418 --> 00:10:42,717 And then we'll look at the fourth training example and we will do that. The fifth example, sixth example, 7th and so on. 114 00:10:42,717 --> 00:10:46,725 And as you run Stochastic gradient descent, what you find is that 115 00:10:46,725 --> 00:10:52,923 it will generally move the parameters in the direction of the global minimum, but not always. 116 00:10:52,923 --> 00:11:00,117 And so take some more random-looking, circuitous path to watch the global minimum. 117 00:11:00,117 --> 00:11:07,630 And in fact as you run Stochastic gradient descent it doesn't actually converge in the same same sense as Batch gradient descent does 118 00:11:07,630 --> 00:11:15,196 and what it ends up doing is wandering around continuously in some region that's in some region close to the global minimum, 119 00:11:15,196 --> 00:11:18,740 but it doesn't just get to the global minimum and stay there. 120 00:11:18,740 --> 00:11:21,676 But in practice this isn't a problem because, you know, so 121 00:11:21,676 --> 00:11:26,788 long as the parameters end up in some region there maybe it is pretty close to the global minimum. 122 00:11:26,788 --> 00:11:32,164 So, as parameters end up pretty close to the global minimum, that will be a pretty good hypothesis 123 00:11:32,164 --> 00:11:36,340 and so usually running Stochastic gradient descent 124 00:11:36,340 --> 00:11:43,658 we get a parameter near the global minimum and that's good enough for, you know, essentially any, most practical purposes. 125 00:11:43,658 --> 00:11:47,121 Just one final detail. In Stochastic gradient descent, 126 00:11:47,121 --> 00:11:51,099 we had this outer loop repeat which says to do this inner loop multiple times. 127 00:11:51,099 --> 00:11:53,892 So, how many times do we repeat this outer loop? 128 00:11:53,892 --> 00:11:59,336 Depending on the size of the training set, doing this loop just a single time may be enough. 129 00:11:59,336 --> 00:12:02,064 And up to, you know, maybe 10 times may be typical 130 00:12:02,064 --> 00:12:05,852 so we may end up repeating this inner loop anywhere from once to ten times. 131 00:12:05,852 --> 00:12:12,309 So if we have a you know, truly massive data set like the this US census gave us that example 132 00:12:12,309 --> 00:12:15,260 that I've been talking about with 300 million examples, 133 00:12:15,260 --> 00:12:19,609 it is possible that by the time you've taken just a single pass through your training set. 134 00:12:19,609 --> 00:12:23,073 So, this is for i equals 1 through 300 million. 135 00:12:23,073 --> 00:12:25,720 It's possible that by the time you've taken a single pass through your data set 136 00:12:25,720 --> 00:12:29,872 you might already have a perfectly good hypothesis. 137 00:12:29,872 --> 00:12:36,613 In which case, you know, this inner loop you might need to do only once if m is very, very large. 138 00:12:36,613 --> 00:12:43,071 But in general taking anywhere from 1 through 10 passes through your data set, you know, maybe fairly common. 139 00:12:43,071 --> 00:12:45,439 But really it depends on the size of your training set. 140 00:12:45,439 --> 00:12:49,413 And if you contrast this to Batch gradient descent. 141 00:12:49,413 --> 00:12:53,905 With Batch gradient descent, after taking a pass through your entire training set, 142 00:12:53,905 --> 00:12:57,034 you would have taken just one single gradient descent steps. 143 00:12:57,034 --> 00:13:01,983 So one of these little baby steps of gradient descent where you just take one small gradient descent step 144 00:13:01,983 --> 00:13:05,776 and this is why Stochastic gradient descent can be much faster. 145 00:13:05,776 --> 00:13:10,880 So, that was the Stochastic gradient descent algorithm. 146 00:13:10,880 --> 00:13:15,594 And if you implement it, hopefully that will allow you to scale up many of your learning algorithms 147 00:13:15,594 --> 99:59:59,000 to much bigger data sets and get much more performance that way.