1 00:00:00,454 --> 00:00:02,208 In previous videos, we talked 2 00:00:02,238 --> 00:00:04,581 about the gradient descent algorithm 3 00:00:04,581 --> 00:00:06,005 and talked about the linear 4 00:00:06,005 --> 00:00:09,071 regression model and the squared error cost function. 5 00:00:09,071 --> 00:00:10,822 In this video, we're going to 6 00:00:10,822 --> 00:00:12,695 put together gradient descent with 7 00:00:12,695 --> 00:00:14,672 our cost function, and that 8 00:00:14,672 --> 00:00:16,652 will give us an algorithm for 9 00:00:16,652 --> 00:00:19,431 linear regression for fitting a straight line to our data. 10 00:00:21,001 --> 00:00:22,743 So, this is 11 00:00:22,743 --> 00:00:25,077 what we worked out in the previous videos. 12 00:00:25,077 --> 00:00:27,095 That's our gradient descent algorithm, which 13 00:00:27,095 --> 00:00:29,197 should be familiar, and you 14 00:00:29,197 --> 00:00:31,199 see the linear linear regression model 15 00:00:31,199 --> 00:00:36,461 with our linear hypothesis and our squared error cost function. 16 00:00:36,461 --> 00:00:38,612 What we're going to do is apply 17 00:00:38,612 --> 00:00:42,288 gradient descent to minimize 18 00:00:44,519 --> 00:00:47,537 our squared error cost function. 19 00:00:47,891 --> 00:00:49,381 Now, in order to apply 20 00:00:49,381 --> 00:00:50,896 gradient descent, in order 21 00:00:50,896 --> 00:00:52,695 to write this piece of 22 00:00:52,695 --> 00:00:54,191 code, the key term 23 00:00:54,191 --> 00:00:58,384 we need is this derivative term over here. 24 00:00:59,692 --> 00:01:00,798 So, we need to figure out 25 00:01:00,798 --> 00:01:02,830 what is this partial derivative term, 26 00:01:02,830 --> 00:01:04,478 and plug in the 27 00:01:04,478 --> 00:01:06,249 definition of the cost 28 00:01:06,249 --> 00:01:08,418 function J, this turns 29 00:01:08,418 --> 00:01:11,074 out to be this "inaudible" 30 00:01:12,613 --> 00:01:15,156 equals sum 1 through M of 31 00:01:15,156 --> 00:01:18,856 this squared error 32 00:01:20,456 --> 00:01:22,023 cost function term, and all 33 00:01:22,023 --> 00:01:23,794 I did here was I just 34 00:01:23,794 --> 00:01:25,538 you know plugged in the definition of 35 00:01:25,538 --> 00:01:28,275 the cost function there, and simplifying 36 00:01:28,275 --> 00:01:30,563 little bit more, this turns 37 00:01:30,563 --> 00:01:34,136 out to be equal to, this 38 00:01:34,136 --> 00:01:36,983 "inaudible" equals sum 1 through M 39 00:01:36,983 --> 00:01:40,609 of tetha zero plus tetha one, XI 40 00:01:41,163 --> 00:01:43,427 minus YI squared. 41 00:01:43,427 --> 00:01:44,777 And all I did there was took 42 00:01:44,777 --> 00:01:46,983 the definition for my hypothesis 43 00:01:46,983 --> 00:01:49,376 and plug that in there. 44 00:01:49,622 --> 00:01:51,669 And it turns out we need 45 00:01:51,669 --> 00:01:52,523 to figure out what is 46 00:01:52,523 --> 00:01:54,011 the partial derivative of two 47 00:01:54,011 --> 00:01:55,284 cases for J equals 48 00:01:55,284 --> 00:01:57,006 0 and for J equals 1 want 49 00:01:57,006 --> 00:01:58,547 to figure out what is this 50 00:01:58,547 --> 00:02:00,767 partial derivative for both the 51 00:02:00,767 --> 00:02:04,115 theta(0) case and the theta(1) case. 52 00:02:04,115 --> 00:02:07,012 And I'm just going to write out the answers. 53 00:02:07,012 --> 00:02:10,996 It turns out this first term simplifies 54 00:02:10,996 --> 00:02:14,218 to 1/M, sum over 55 00:02:14,218 --> 00:02:16,446 my training set of just 56 00:02:16,446 --> 00:02:21,146 that, X(i)- Y(i). 57 00:02:21,146 --> 00:02:23,951 And for this term, partial derivative 58 00:02:23,951 --> 00:02:26,186 with respect to theta(1), it turns 59 00:02:26,186 --> 00:02:34,954 out I get this term: -Y(i)X(i). 60 00:02:35,031 --> 00:02:36,187 Okay. 61 00:02:36,402 --> 00:02:38,690 And computing these partial 62 00:02:38,690 --> 00:02:40,761 derivatives, so going from 63 00:02:40,761 --> 00:02:43,406 this equation to either 64 00:02:43,406 --> 00:02:46,414 of these equations down there, computing 65 00:02:46,414 --> 00:02:51,090 those partial derivative terms requires some multivariate calculus. 66 00:02:51,090 --> 00:02:53,118 If you know calculus, feel free 67 00:02:53,118 --> 00:02:54,825 to work through the derivations yourself 68 00:02:54,825 --> 00:02:57,060 and check take the derivatives 69 00:02:57,060 --> 00:02:59,853 you actually get the answers that I got. 70 00:02:59,853 --> 00:03:00,855 But if you are less 71 00:03:00,855 --> 00:03:02,611 familiar with calculus you don't 72 00:03:02,611 --> 00:03:04,235 worry about it, and it 73 00:03:04,235 --> 00:03:06,260 is fine to just take these equations 74 00:03:06,260 --> 00:03:08,025 worked out, and you 75 00:03:08,025 --> 00:03:09,462 won't need to know calculus or 76 00:03:09,462 --> 00:03:10,340 anything like that in order to 77 00:03:10,340 --> 00:03:14,551 do the homework, so to implement gradient descent you'd get that to work. 78 00:03:14,551 --> 00:03:16,520 But so, after these definitions, 79 00:03:16,520 --> 00:03:18,156 or after what we've worked 80 00:03:18,156 --> 00:03:19,993 out to be the derivatives, which 81 00:03:19,993 --> 00:03:21,316 is really just the slope of 82 00:03:21,316 --> 00:03:22,889 the cost function J. We 83 00:03:22,889 --> 00:03:24,724 can now plug them back into 84 00:03:24,724 --> 00:03:27,157 our gradient descent algorithm. 85 00:03:27,157 --> 00:03:28,794 So here's gradient descent, or 86 00:03:28,794 --> 00:03:30,309 the regression, which is going 87 00:03:30,309 --> 00:03:32,971 to repeat until convergence, theta 0 88 00:03:32,971 --> 00:03:35,552 and theta one get updated as, 89 00:03:35,552 --> 00:03:37,163 you know, the same minus alpha 90 00:03:37,163 --> 00:03:39,591 times the derivative term. 91 00:03:39,591 --> 00:03:43,202 So, this term here. 92 00:03:43,202 --> 00:03:46,854 So, here's our linear regression algorithm. 93 00:03:46,854 --> 00:03:52,696 This first term here that 94 00:03:52,696 --> 00:03:54,495 term is, of course, just 95 00:03:54,495 --> 00:03:56,071 a partial derivative of respective 96 00:03:56,071 --> 00:03:59,995 theta zero, that we worked on in the previous slide. 97 00:03:59,995 --> 00:04:03,454 And this second term here, 98 00:04:03,454 --> 00:04:04,199 that term is just 99 00:04:04,199 --> 00:04:05,947 a partial derivative with respect to 100 00:04:05,947 --> 00:04:11,188 theta one that we worked out on the previous line. 101 00:04:11,188 --> 00:04:13,582 And just as a quick reminder, 102 00:04:13,582 --> 00:04:15,485 you must, when implementing gradient descent, 103 00:04:15,485 --> 00:04:17,067 there's actually there's detail that, you 104 00:04:17,067 --> 00:04:19,248 know, you should be implementing 105 00:04:19,248 --> 00:04:24,452 it so the update theta zero and theta one simultaneously. 106 00:04:24,452 --> 00:04:28,270 So, let's see how gradient descent works. 107 00:04:28,270 --> 00:04:29,447 One of the issues we solved 108 00:04:29,447 --> 00:04:32,839 gradient descent is that it can be susceptible to local optima. 109 00:04:32,839 --> 00:04:34,433 So, when I first explained gradient 110 00:04:34,449 --> 00:04:36,136 descent, I showed you this picture 111 00:04:36,136 --> 00:04:37,724 of it, you know, going downhill 112 00:04:37,724 --> 00:04:38,788 on the surface and we 113 00:04:38,788 --> 00:04:40,120 saw how, depending on where 114 00:04:40,120 --> 00:04:42,872 you're initializing, you can end up with different local optima. 115 00:04:42,872 --> 00:04:44,846 You know, you can end up here or here. 116 00:04:44,846 --> 00:04:46,958 But, it turns out that 117 00:04:46,958 --> 00:04:49,025 the cost function for gradient 118 00:04:49,025 --> 00:04:50,791 of cost function for linear regression 119 00:04:50,791 --> 00:04:52,754 is always going to be 120 00:04:52,754 --> 00:04:55,305 a bow-shaped function like this. 121 00:04:55,305 --> 00:04:57,573 The technical term for this 122 00:04:57,573 --> 00:05:01,163 is that this is called a convex function. 123 00:05:02,825 --> 00:05:04,920 And I'm not going 124 00:05:04,920 --> 00:05:06,561 to give the formal definition for what 125 00:05:06,561 --> 00:05:09,557 is a convex function, c-o-n-v-e-x, but 126 00:05:09,557 --> 00:05:11,310 informally a convex function 127 00:05:11,310 --> 00:05:14,793 means a bow-shaped function, you know, kind of like a bow shaped. 128 00:05:14,793 --> 00:05:18,006 And so, this function doesn't 129 00:05:18,006 --> 00:05:19,738 have any local optima, except 130 00:05:19,738 --> 00:05:22,433 for the one global optimum. 131 00:05:22,433 --> 00:05:24,265 And does gradient descent on 132 00:05:24,265 --> 00:05:25,590 this type of cost function which 133 00:05:25,590 --> 00:05:27,695 you get whenever you're using linear 134 00:05:27,695 --> 00:05:29,201 regression, it will always convert 135 00:05:29,201 --> 00:05:33,623 to the global optimum, because there are no other local optima other than global optimum. 136 00:05:33,623 --> 00:05:37,272 So now, let's see this algorithm in action. 137 00:05:38,026 --> 00:05:40,085 As usual, here are plots of 138 00:05:40,085 --> 00:05:42,182 the hypothesis function and of 139 00:05:42,182 --> 00:05:45,024 my cost function J. 140 00:05:45,763 --> 00:05:47,011 And so, let's see how 141 00:05:47,011 --> 00:05:49,687 to initialize my parameters at this value. 142 00:05:49,687 --> 00:05:51,652 You know, let's say, usually you 143 00:05:51,652 --> 00:05:53,590 initialize your parameters at zero 144 00:05:53,590 --> 00:05:56,287 for zero, theta zero and zero. 145 00:05:56,287 --> 00:05:58,331 For illustration in this 146 00:05:58,331 --> 00:05:59,948 specific presentation, I have 147 00:05:59,948 --> 00:06:02,615 initialised theta zero at 148 00:06:02,615 --> 00:06:06,831 about 900, and theta one at about minus 0.1, okay? 149 00:06:06,831 --> 00:06:09,791 And so, this corresponds to H 150 00:06:09,791 --> 00:06:12,022 over X, equals, you know, 151 00:06:12,022 --> 00:06:15,859 minus 900 minus 0.1 x 152 00:06:15,859 --> 00:06:19,341 is this line, so out here on the cost function. 153 00:06:19,341 --> 00:06:20,491 Now if we take one 154 00:06:20,491 --> 00:06:22,163 step of gradient descent, we end 155 00:06:22,163 --> 00:06:24,298 up going from this point 156 00:06:24,298 --> 00:06:27,065 out here, a little 157 00:06:27,065 --> 00:06:29,564 bit to the down left 158 00:06:29,564 --> 00:06:31,732 to that second point over there. 159 00:06:31,732 --> 00:06:35,279 And, you notice that my line changed a little bit. 160 00:06:35,279 --> 00:06:36,547 And, as I take another step 161 00:06:36,547 --> 00:06:41,425 at gradient descent, my line on the left will change. 162 00:06:41,425 --> 00:06:42,376 Right. 163 00:06:42,376 --> 00:06:43,804 And I have also 164 00:06:43,804 --> 00:06:47,544 moved to a new point on my cost function. 165 00:06:47,544 --> 00:06:48,745 And as I think further step 166 00:06:48,745 --> 00:06:50,697 is gradient descent, I'm going 167 00:06:50,697 --> 00:06:53,058 down in cost, right, so 168 00:06:53,058 --> 00:06:55,079 my parameter is following 169 00:06:55,079 --> 00:06:58,062 this trajectory, and if 170 00:06:58,062 --> 00:07:00,368 you look on the left, this corresponds 171 00:07:00,368 --> 00:07:04,025 to hypotheses that seem 172 00:07:04,025 --> 00:07:04,912 to be getting to be 173 00:07:04,912 --> 00:07:06,429 better and better fits for the 174 00:07:06,429 --> 00:07:09,987 data until eventually, 175 00:07:09,987 --> 00:07:12,663 I have now wound up at the global minimum. 176 00:07:12,663 --> 00:07:16,189 And this global minimum corresponds to 177 00:07:16,189 --> 00:07:20,506 this hypothesis, which gives me a good fit to the data. 178 00:07:20,922 --> 00:07:23,605 And so that's gradient 179 00:07:23,605 --> 00:07:24,969 descent, and we've just run 180 00:07:24,969 --> 00:07:27,302 it and gotten a good 181 00:07:27,302 --> 00:07:31,359 fit to my data set of housing prices. 182 00:07:31,359 --> 00:07:34,108 And you can now use it to predict. 183 00:07:34,108 --> 00:07:35,284 You know, if your friend has a 184 00:07:35,284 --> 00:07:36,452 house with a 185 00:07:36,452 --> 00:07:39,116 size 1250 square feet, you 186 00:07:39,116 --> 00:07:40,684 can now read off the value 187 00:07:40,684 --> 00:07:42,090 and tell them that, I don't 188 00:07:42,090 --> 00:07:43,188 know, maybe they can get 189 00:07:43,188 --> 00:07:47,159 $350,000 for their house. 190 00:07:48,606 --> 00:07:49,982 Finally, just to give 191 00:07:49,982 --> 00:07:51,930 this another name, it turns out 192 00:07:51,930 --> 00:07:52,991 that the algorithm that we 193 00:07:52,991 --> 00:07:55,030 just went over is sometimes 194 00:07:55,030 --> 00:07:57,074 called batch gradient descent. 195 00:07:57,074 --> 00:07:58,781 And it turns out in machine 196 00:07:58,781 --> 00:08:00,203 learning, I feel like us machine 197 00:08:00,203 --> 00:08:02,050 learning people, we're not always 198 00:08:02,050 --> 00:08:04,381 created has given me some algorithms. 199 00:08:04,381 --> 00:08:06,634 But the term batch gradient descent 200 00:08:06,634 --> 00:08:08,212 means that refers to the 201 00:08:08,212 --> 00:08:09,551 fact that, in every step 202 00:08:09,551 --> 00:08:11,164 of gradient descent we're looking 203 00:08:11,164 --> 00:08:13,649 at all of the training examples. 204 00:08:13,649 --> 00:08:15,783 So, in gradient descent, you 205 00:08:15,783 --> 00:08:18,875 know, when computing derivatives, we're computing 206 00:08:18,875 --> 00:08:21,307 these sums, this sum of. 207 00:08:21,307 --> 00:08:22,210 So, in every separate 208 00:08:22,210 --> 00:08:23,510 gradient descent, we end up 209 00:08:23,510 --> 00:08:25,278 computing something like this, that 210 00:08:25,278 --> 00:08:28,434 sums over our M training examples. 211 00:08:28,434 --> 00:08:29,835 And so the term batch gradient 212 00:08:29,835 --> 00:08:31,195 descent refers to the fact 213 00:08:31,195 --> 00:08:33,193 when looking at the entire batch 214 00:08:33,193 --> 00:08:34,559 of training examples, and again, 215 00:08:34,559 --> 00:08:35,742 this is really, really not 216 00:08:35,742 --> 00:08:36,915 a great name, but this is 217 00:08:36,915 --> 00:08:39,444 what Machine Learning people call it. 218 00:08:39,444 --> 00:08:40,958 And it turns out there are 219 00:08:40,958 --> 00:08:42,665 sometimes other versions of 220 00:08:42,665 --> 00:08:43,918 gradient descent that are not 221 00:08:43,918 --> 00:08:45,969 batch versions but instead do 222 00:08:45,969 --> 00:08:47,752 not look at the entire traning set 223 00:08:47,752 --> 00:08:49,722 but look at small subsets 224 00:08:49,722 --> 00:08:51,529 of the training sets at the time, 225 00:08:51,529 --> 00:08:54,874 and we'll talk about those versions later in this course as well. 226 00:08:54,874 --> 00:08:56,195 But for now, using the algorithm 227 00:08:56,195 --> 00:08:57,410 you just learned, now we're 228 00:08:57,410 --> 00:08:58,759 using batch gradient descent, you 229 00:08:58,759 --> 00:09:01,159 now know how to implement 230 00:09:01,159 --> 00:09:04,149 gradient descent, or linear regression. 231 00:09:05,856 --> 00:09:08,672 So that's linear regression with gradient descent. 232 00:09:09,349 --> 00:09:11,747 If you've seen advanced linear algebra 233 00:09:11,747 --> 00:09:12,672 before so some you may 234 00:09:12,672 --> 00:09:14,206 have taken a class with advanced 235 00:09:14,206 --> 00:09:16,279 linear algebra, you might 236 00:09:16,295 --> 00:09:17,678 know that there exists a solution 237 00:09:17,678 --> 00:09:19,754 for numerically solving for the 238 00:09:19,754 --> 00:09:20,914 minimum of the cost function 239 00:09:20,914 --> 00:09:22,561 J, without needing to 240 00:09:22,561 --> 00:09:25,604 use and iterative algorithm like gradient descent. 241 00:09:25,604 --> 00:09:27,154 Later in this course we will 242 00:09:27,154 --> 00:09:28,098 talk about that method as 243 00:09:28,098 --> 00:09:29,753 well that just solves for the 244 00:09:29,753 --> 00:09:31,076 minimum cost function J without 245 00:09:31,076 --> 00:09:33,791 needing this multiple steps of gradient descent. 246 00:09:33,791 --> 00:09:37,656 That other method is called normal equations methods. 247 00:09:37,656 --> 00:09:39,167 And, but in case you 248 00:09:39,167 --> 00:09:40,141 have heard of that method, it turns 249 00:09:40,141 --> 00:09:41,622 out gradient descent will 250 00:09:41,622 --> 00:09:43,774 scale better to larger data 251 00:09:43,774 --> 00:09:45,008 sets than that normal equals 252 00:09:45,008 --> 00:09:47,315 method and, now that 253 00:09:47,315 --> 00:09:48,756 we know about gradient descent, we'll 254 00:09:48,756 --> 00:09:49,922 be able to use it in 255 00:09:49,922 --> 00:09:51,387 lots of different contexts, and we'll 256 00:09:51,387 --> 00:09:54,849 use it in lots of different Machine Learning problems as well. 257 00:09:54,849 --> 00:09:57,138 So, congrats on learning 258 00:09:57,138 --> 00:10:00,317 about your first Machine Learning algorithm. 259 00:10:00,317 --> 00:10:02,508 We'll later have exercises in 260 00:10:02,508 --> 00:10:03,659 which we'll ask you to 261 00:10:03,659 --> 00:10:05,068 implement gradient descent and 262 00:10:05,068 --> 00:10:07,071 hopefully see these algorithms work for yourselves. 263 00:10:07,071 --> 00:10:08,955 But before that I first 264 00:10:08,955 --> 00:10:10,587 want to tell you in 265 00:10:10,587 --> 00:10:11,919 the next set of videos, the 266 00:10:11,919 --> 00:10:13,223 first want to tell you about 267 00:10:13,223 --> 00:10:15,724 a generalization of the gradient descent 268 00:10:15,724 --> 00:10:16,935 algorithm that will make 269 00:10:16,935 --> 00:10:18,403 it much more powerful and I 270 00:10:18,403 --> 99:59:59,000 guess I will tell you about that in the next video.