1 00:00:00,000 --> 00:00:04,547 [MUSIC] 2 00:00:04,547 --> 00:00:08,910 Now, let's actually summarize the entire gradient descent algorithm for 3 00:00:08,910 --> 00:00:10,340 multiple regression. 4 00:00:10,340 --> 00:00:14,280 Stepping through, very carefully, every step of this algorithm. 5 00:00:14,280 --> 00:00:19,330 So in particular at first, what we're gonna do, is we're just gonna initialize 6 00:00:19,330 --> 00:00:24,950 all of our different parameters to be zero at the first iteration. 7 00:00:24,950 --> 00:00:29,220 Or you could initialize them randomly or you could do something a bit smarter. 8 00:00:30,570 --> 00:00:33,410 But let's just assume that they're all initialized to zero and 9 00:00:33,410 --> 00:00:38,431 we're gonna start our iteration counter at one. 10 00:00:38,431 --> 00:00:45,260 And then what we're doing is we're saying while we're not converged, 11 00:00:45,260 --> 00:00:50,260 and what was the condition we talked about before in our simple regression module for 12 00:00:50,260 --> 00:00:51,370 not being converged? 13 00:00:51,370 --> 00:00:54,558 We said while the gradient of our residual sum of squares, 14 00:00:54,558 --> 00:00:57,628 the magnitude of that gradient is sufficiently large. 15 00:00:57,628 --> 00:01:04,160 Larger than some tolerance epsilon then we were going to keep going. 16 00:01:06,610 --> 00:01:11,350 So what is the magnitude of residual sum of squares? 17 00:01:14,390 --> 00:01:20,480 Here this thing just to be very explicit is the square root of 18 00:01:20,480 --> 00:01:25,460 the square, so what are the elements of the gradient of residual sum of squares? 19 00:01:25,460 --> 00:01:28,060 Well, it's a vector where every element 20 00:01:28,060 --> 00:01:31,610 is the partial derivative with respect to some parameter. 21 00:01:31,610 --> 00:01:35,530 I'm gonna refer to that as partial of j, okay? 22 00:01:35,530 --> 00:01:39,966 So, when I take the magnitude of the vector I multiply 23 00:01:39,966 --> 00:01:44,712 the vector times its transposed, take the square root. 24 00:01:44,712 --> 00:01:51,510 That's equivalent to saying I'm gonna sum up the partial derivative 25 00:01:51,510 --> 00:01:57,039 with respect to the first feature squared plus all the way 26 00:01:57,039 --> 00:02:02,586 up to the partial derivative of the capital Dth feature. 27 00:02:02,586 --> 00:02:07,230 Sorry, I guess I should start, the indexing really starts with zero, squared, 28 00:02:07,230 --> 00:02:09,840 and then I take the square root. 29 00:02:09,840 --> 00:02:12,490 So if the result of this is greater than 30 00:02:12,490 --> 00:02:16,190 epsilon then I'm gonna continue my gradient descent iterates. 31 00:02:16,190 --> 00:02:19,190 If it's less than epsilon then I'm gonna stop. 32 00:02:19,190 --> 00:02:22,040 But let's talk about what the actual iterates are. 33 00:02:22,040 --> 00:02:26,560 Well, for every feature in my multiple regression model, 34 00:02:26,560 --> 00:02:30,220 first thing I'm going to do is I'm going to calculate this partial derivative, 35 00:02:30,220 --> 00:02:32,150 with respect to the jth feature. 36 00:02:33,190 --> 00:02:36,960 And I'm going to store that, because that's going to be useful in 37 00:02:36,960 --> 00:02:41,580 both taking the gradient step as well as monitoring convergence as I wrote here. 38 00:02:43,100 --> 00:02:47,690 So this jth partial, we derived it on the previous slide. 39 00:02:48,710 --> 00:02:56,910 It has this formed and then my gradient step takes that jth coefficient at timed t 40 00:02:56,910 --> 00:03:02,110 and subtracts my step size times that partial derivative. 41 00:03:02,110 --> 00:03:05,790 And then once I cycle through all the features in my model, 42 00:03:05,790 --> 00:03:09,270 then I'm gonna increment this t counter. 43 00:03:09,270 --> 00:03:13,260 I'm gonna check whether I've achieved convergence or not. 44 00:03:13,260 --> 00:03:17,822 If not I'm gonna loop through, and I'm gonna do this until this 45 00:03:17,822 --> 00:03:22,335 condition, this magnitude of my gradient is less than epsilon. 46 00:03:22,335 --> 00:03:22,963 Okay, so 47 00:03:22,963 --> 00:03:27,837 I wanna take a few moments to talk about this gradient descent algorithm. 48 00:03:27,837 --> 00:03:32,398 Because we presented it specifically in the context of multiple regression, and 49 00:03:32,398 --> 00:03:35,110 also for the simple regression case. 50 00:03:35,110 --> 00:03:38,090 But this algorithm is really, really important. 51 00:03:38,090 --> 00:03:42,004 It's probably the most widely used machine learning algorithm out there. 52 00:03:42,004 --> 00:03:45,560 And we're gonna see it when we talk about classification, 53 00:03:45,560 --> 00:03:48,920 all the way to talking about deep learning. 54 00:03:48,920 --> 00:03:53,007 So even though we presented this in the context of multiple regression, 55 00:03:53,007 --> 00:03:57,831 this is a really really useful algorithm, actually an extremely useful algorithm, 56 00:03:57,831 --> 00:03:59,712 as the title of this slide shows. 57 00:03:59,712 --> 00:04:03,969 [MUSIC]