1 00:00:00,220 --> 00:00:03,688 In the previous video, we talked about the form of the hypothesis for linear 2 00:00:03,688 --> 00:00:07,246 regression with multiple features or with multiple variables. 3 00:00:07,246 --> 00:00:11,912 In this video, let's talk about how to fit the parameters of that hypothesis. 4 00:00:11,912 --> 00:00:15,175 In particular let's talk about how to use gradient descent for linear 5 00:00:15,175 --> 00:00:19,875 regression with multiple features. 6 00:00:19,875 --> 00:00:24,802 To quickly summarize our notation, this is our formal hypothesis in 7 00:00:24,802 --> 00:00:31,509 multivariable linear regression where we've adopted the convention that x0=1. 8 00:00:31,509 --> 00:00:37,505 The parameters of this model are theta0 through theta n, but instead of thinking 9 00:00:37,505 --> 00:00:42,385 of this as n separate parameters, which is valid, I'm instead going to think of 10 00:00:42,385 --> 00:00:51,175 the parameters as theta where theta here is a n+1-dimensional vector. 11 00:00:51,175 --> 00:00:55,498 So I'm just going to think of the parameters of this model 12 00:00:55,498 --> 00:00:58,674 as itself being a vector. 13 00:00:58,674 --> 00:01:03,507 Our cost function is J of theta0 through theta n which is given by this usual 14 00:01:03,507 --> 00:01:08,983 sum of square of error term. But again instead of thinking of J as a function 15 00:01:08,983 --> 00:01:14,016 of these n+1 numbers, I'm going to more commonly write J as just a 16 00:01:14,016 --> 00:01:22,275 function of the parameter vector theta so that theta here is a vector. 17 00:01:22,275 --> 00:01:26,897 Here's what gradient descent looks like. We're going to repeatedly update each 18 00:01:26,897 --> 00:01:32,142 parameter theta j according to theta j minus alpha times this derivative term. 19 00:01:32,142 --> 00:01:37,868 And once again we just write this as J of theta, so theta j is updated as 20 00:01:37,868 --> 00:01:41,840 theta j minus the learning rate alpha times the derivative, a partial 21 00:01:41,840 --> 00:01:47,840 derivative of the cost function with respect to the parameter theta j. 22 00:01:47,840 --> 00:01:51,305 Let's see what this looks like when we implement gradient descent and, 23 00:01:51,305 --> 00:01:55,985 in particular, let's go see what that partial derivative term looks like. 24 00:01:55,985 --> 00:02:01,383 Here's what we have for gradient descent for the case of when we had N=1 feature. 25 00:02:01,383 --> 00:02:06,782 We had two separate update rules for the parameters theta0 and theta1, and 26 00:02:06,782 --> 00:02:12,779 hopefully these look familiar to you. And this term here was of course the 27 00:02:12,779 --> 00:02:17,672 partial derivative of the cost function with respect to the parameter of theta0, 28 00:02:17,672 --> 00:02:21,891 and similarly we had a different update rule for the parameter theta1. 29 00:02:21,891 --> 00:02:26,259 There's one little difference which is that when we previously had only one 30 00:02:26,259 --> 00:02:31,992 feature, we would call that feature x(i) but now in our new notation 31 00:02:31,992 --> 00:02:38,462 we would of course call this x(i)1 to denote our one feature. 32 00:02:38,462 --> 00:02:41,019 So that was for when we had only one feature. 33 00:02:41,019 --> 00:02:44,496 Let's look at the new algorithm for we have more than one feature, 34 00:02:44,496 --> 00:02:47,350 where the number of features n may be much larger than one. 35 00:02:47,350 --> 00:02:53,158 We get this update rule for gradient descent and, maybe for those of you that 36 00:02:53,158 --> 00:02:57,781 know calculus, if you take the definition of the cost function and take 37 00:02:57,781 --> 00:03:03,312 the partial derivative of the cost function J with respect to the parameter 38 00:03:03,312 --> 00:03:08,119 theta j, you'll find that that partial derivative is exactly that term that 39 00:03:08,119 --> 00:03:10,665 I've drawn the blue box around. 40 00:03:10,665 --> 00:03:14,837 And if you implement this you will get a working implementation of 41 00:03:14,837 --> 00:03:18,962 gradient descent for multivariate linear regression. 42 00:03:18,962 --> 00:03:21,572 The last thing I want to do on this slide is give you a sense of 43 00:03:21,572 --> 00:03:26,882 why these new and old algorithms are sort of the same thing or why they're 44 00:03:26,882 --> 00:03:30,904 both similar algorithms or why they're both gradient descent algorithms. 45 00:03:30,904 --> 00:03:34,363 Let's consider a case where we have two features 46 00:03:34,363 --> 00:03:37,488 or maybe more than two features, so we have three update rules for 47 00:03:37,488 --> 00:03:42,680 the parameters theta0, theta1, theta2 and maybe other values of theta as well. 48 00:03:42,680 --> 00:03:49,457 If you look at the update rule for theta0, what you find is that this 49 00:03:49,457 --> 00:03:55,300 update rule here is the same as the update rule that we had previously 50 00:03:55,300 --> 00:03:57,350 for the case of n = 1. 51 00:03:57,350 --> 00:04:00,203 And the reason that they are equivalent is, of course, 52 00:04:00,203 --> 00:04:06,871 because in our notational convention we had this x(i)0 = 1 convention, which is 53 00:04:06,871 --> 00:04:12,003 why these two term that I've drawn the magenta boxes around are equivalent. 54 00:04:12,003 --> 00:04:16,010 Similarly, if you look the update rule for theta1, you find that 55 00:04:16,010 --> 00:04:21,540 this term here is equivalent to the term we previously had, 56 00:04:21,540 --> 00:04:25,020 or the equation or the update rule we previously had for theta1, 57 00:04:25,020 --> 00:04:30,222 where of course we're just using this new notation x(i)1 to denote 58 00:04:30,222 --> 00:04:37,605 our first feature, and now that we have more than one feature we can have 59 00:04:37,605 --> 00:04:43,560 similar update rules for the other parameters like theta2 and so on. 60 00:04:43,560 --> 00:04:48,219 There's a lot going on on this slide so I definitely encourage you 61 00:04:48,219 --> 00:04:52,020 if you need to to pause the video and look at all the math on this slide 62 00:04:52,020 --> 00:04:55,446 slowly to make sure you understand everything that's going on here. 63 00:04:55,446 --> 00:05:00,440 But if you implement the algorithm written up here then you have 64 00:05:00,440 --> 00:05:51,300 a working implementation of linear regression with multiple features.