1 00:00:00,302 --> 00:00:01,883 In this video, we'll talk about 2 00:00:01,883 --> 00:00:03,948 the normal equation, which for 3 00:00:03,948 --> 00:00:05,660 some linear regression problems, will 4 00:00:05,660 --> 00:00:06,981 give us a much better way 5 00:00:06,981 --> 00:00:09,115 to solve for the optimal value 6 00:00:09,115 --> 00:00:10,879 of the parameters theta. 7 00:00:10,879 --> 00:00:13,096 Concretely, so far the 8 00:00:13,096 --> 00:00:14,399 algorithm that we've been using 9 00:00:14,399 --> 00:00:16,042 for linear regression is gradient 10 00:00:16,042 --> 00:00:17,823 descent where in order 11 00:00:17,823 --> 00:00:19,410 to minimize the cost function 12 00:00:19,410 --> 00:00:21,354 J of Theta, we would take 13 00:00:21,354 --> 00:00:23,792 this iterative algorithm that takes 14 00:00:23,792 --> 00:00:26,410 many steps, multiple iterations of 15 00:00:26,410 --> 00:00:28,259 gradient descent to converge 16 00:00:28,259 --> 00:00:30,396 to the global minimum. 17 00:00:30,396 --> 00:00:32,563 In contrast, the normal equation 18 00:00:32,563 --> 00:00:34,413 would give us a method to 19 00:00:34,413 --> 00:00:36,986 solve for theta analytically, so 20 00:00:36,986 --> 00:00:38,761 that rather than needing to run 21 00:00:38,761 --> 00:00:40,594 this iterative algorithm, we can 22 00:00:40,594 --> 00:00:41,365 instead just solve for the 23 00:00:41,365 --> 00:00:42,791 optimal value for theta 24 00:00:42,791 --> 00:00:44,403 all at one go, so that in 25 00:00:44,403 --> 00:00:46,096 basically one step you get 26 00:00:46,096 --> 00:00:48,136 to the optimal value right there. 27 00:00:49,136 --> 00:00:51,947 It turns out the normal equation 28 00:00:52,209 --> 00:00:54,442 that has some advantages and 29 00:00:54,442 --> 00:00:56,024 some disadvantages, but before 30 00:00:56,024 --> 00:00:57,817 we get to that and talk about 31 00:00:57,903 --> 00:00:59,426 when you should use it, let's 32 00:00:59,426 --> 00:01:02,539 get some intuition about what this method does. 33 00:01:02,539 --> 00:01:04,633 For this week's planetary example, let's 34 00:01:04,633 --> 00:01:06,120 imagine, let's take a 35 00:01:06,120 --> 00:01:07,505 very simplified cost function 36 00:01:07,505 --> 00:01:09,291 J of Theta, that's just the 37 00:01:09,291 --> 00:01:11,958 function of a real number Theta. 38 00:01:11,958 --> 00:01:13,642 So, for now, imagine that Theta 39 00:01:13,842 --> 00:01:16,615 is just a scalar value or that Theta is just a row value. 40 00:01:16,769 --> 00:01:18,918 It's just a number, rather than a vector. 41 00:01:19,171 --> 00:01:24,595 Imagine that we have a cost function J that's a quadratic function of this real value 42 00:01:25,028 --> 00:01:27,420 parameter Theta, so J of Theta looks like that. 43 00:01:27,851 --> 00:01:30,336 Well, how do you minimize a quadratic function? 44 00:01:30,720 --> 00:01:32,745 For those of you that know a little bit of calculus, 45 00:01:32,858 --> 00:01:34,965 you may know that the way to 46 00:01:34,965 --> 00:01:36,628 minimize a function is to 47 00:01:36,628 --> 00:01:38,991 take derivatives and to 48 00:01:38,991 --> 00:01:41,707 set derivatives equal to zero. 49 00:01:41,707 --> 00:01:44,721 So, you take the derivative of J with respect to the parameter of Theta. 50 00:01:44,797 --> 00:01:46,847 You get some formula which I am not going to derive, 51 00:01:46,847 --> 00:01:49,161 you set that derivative 52 00:01:49,161 --> 00:01:50,782 equal to zero, and this 53 00:01:50,782 --> 00:01:53,503 allows you to solve for 54 00:01:53,503 --> 00:01:57,866 the value of Theda that minimizes J of Theta. 55 00:01:57,866 --> 00:01:59,096 That was a simpler case 56 00:01:59,096 --> 00:02:01,716 of when data was just real number. 57 00:02:01,716 --> 00:02:04,272 In the problem that we are interested in, Theta is 58 00:02:04,929 --> 00:02:06,559 no longer just a real number, 59 00:02:06,559 --> 00:02:07,847 but, instead, is this 60 00:02:07,847 --> 00:02:11,986 n+1-dimensional parameter vector, and, 61 00:02:11,986 --> 00:02:13,809 a cost function J is 62 00:02:13,809 --> 00:02:15,742 a function of this vector 63 00:02:15,742 --> 00:02:17,501 value or Theta 0 through 64 00:02:17,501 --> 00:02:18,924 Theta m. And, a cost 65 00:02:18,924 --> 00:02:21,957 function looks like this, some square cost function on the right. 66 00:02:22,373 --> 00:02:25,712 How do we minimize this cost function J? 67 00:02:25,712 --> 00:02:27,163 Calculus actually tells us 68 00:02:27,163 --> 00:02:29,377 that, if you, that 69 00:02:29,377 --> 00:02:30,709 one way to do so, is 70 00:02:30,709 --> 00:02:38,604 to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set 71 00:02:38,604 --> 00:02:40,271 all of these to 0. 72 00:02:40,271 --> 00:02:41,394 If you do that, and you 73 00:02:41,394 --> 00:02:42,718 solve for the values of 74 00:02:42,718 --> 00:02:44,000 Theta 0, Theta 1, 75 00:02:44,000 --> 00:02:45,973 up to Theta N, then, 76 00:02:45,973 --> 00:02:47,217 this would give you that values 77 00:02:47,217 --> 00:02:48,765 of Theta to minimize the cost 78 00:02:48,765 --> 00:02:50,878 function J. Where, if 79 00:02:50,878 --> 00:02:52,176 you actually work through the 80 00:02:52,176 --> 00:02:53,597 calculus and work through 81 00:02:53,597 --> 00:02:55,194 the solution to the parameters 82 00:02:55,194 --> 00:02:57,316 Theta 0 through Theta N, the 83 00:02:57,316 --> 00:03:00,520 derivation ends up being somewhat involved. 84 00:03:00,520 --> 00:03:01,625 And, what I am going 85 00:03:01,625 --> 00:03:03,113 to do in this video, 86 00:03:03,113 --> 00:03:04,852 is actually to not go 87 00:03:04,852 --> 00:03:06,297 through the derivation, which is kind 88 00:03:06,297 --> 00:03:07,657 of long and kind of involved, but 89 00:03:07,657 --> 00:03:08,962 what I want to do is just 90 00:03:08,962 --> 00:03:10,545 tell you what you need to know 91 00:03:10,545 --> 00:03:12,619 in order to implement this process 92 00:03:12,619 --> 00:03:14,138 so you can solve for the 93 00:03:14,138 --> 00:03:15,511 values of the thetas that 94 00:03:15,511 --> 00:03:16,892 corresponds to where the 95 00:03:16,892 --> 00:03:19,273 partial derivatives is equal to zero. 96 00:03:19,273 --> 00:03:21,733 Or alternatively, or equivalently, 97 00:03:21,733 --> 00:03:23,357 the values of Theta is that 98 00:03:23,357 --> 00:03:25,901 minimize the cost function J of Theta. 99 00:03:25,901 --> 00:03:27,283 I realize that some of 100 00:03:27,283 --> 00:03:28,846 the comments I made that made 101 00:03:28,846 --> 00:03:29,914 more sense only to those 102 00:03:29,914 --> 00:03:31,896 of you that are normally familiar with calculus. 103 00:03:31,896 --> 00:03:33,065 So, but if you don't 104 00:03:33,065 --> 00:03:34,487 know, if you're less familiar 105 00:03:34,487 --> 00:03:36,354 with calculus, don't worry about it. 106 00:03:36,354 --> 00:03:37,404 I'm just going to tell you what 107 00:03:37,404 --> 00:03:38,374 you need to know in order to 108 00:03:38,374 --> 00:03:41,358 implement this algorithm and get it to work. 109 00:03:41,358 --> 00:03:42,585 For the example that I 110 00:03:42,585 --> 00:03:43,737 want to use as a running 111 00:03:43,737 --> 00:03:46,339 example let's say that 112 00:03:46,339 --> 00:03:49,056 I have m = 4 training examples. 113 00:03:50,409 --> 00:03:52,881 In order to implement this normal 114 00:03:52,881 --> 00:03:56,515 equation at big, what I'm going to do is the following. 115 00:03:56,515 --> 00:03:57,640 I'm going to take my 116 00:03:57,640 --> 00:04:00,375 data set, so here are my four training examples. 117 00:04:00,375 --> 00:04:01,844 In this case let's assume that, 118 00:04:01,844 --> 00:04:06,073 you know, these four examples is all the data I have. 119 00:04:06,073 --> 00:04:07,890 What I am going to do is take 120 00:04:07,890 --> 00:04:09,007 my data set and add 121 00:04:09,007 --> 00:04:11,289 an extra column that corresponds 122 00:04:11,289 --> 00:04:14,579 to my extra feature, x0, 123 00:04:14,579 --> 00:04:15,967 that is always takes 124 00:04:15,967 --> 00:04:17,527 on this value of 1. 125 00:04:17,527 --> 00:04:18,681 What I'm going to do is 126 00:04:18,681 --> 00:04:19,943 I'm then going to construct 127 00:04:19,943 --> 00:04:22,638 a matrix called X that's 128 00:04:22,638 --> 00:04:24,632 a matrix are basically contains all 129 00:04:24,632 --> 00:04:26,100 of the features from my 130 00:04:26,100 --> 00:04:28,140 training data, so completely 131 00:04:28,140 --> 00:04:31,528 here is my here are 132 00:04:31,528 --> 00:04:33,743 all my features and we're 133 00:04:33,743 --> 00:04:34,797 going to take all those numbers and 134 00:04:34,797 --> 00:04:37,777 put them into this matrix "X", okay? 135 00:04:37,777 --> 00:04:39,179 So just, you know, copy 136 00:04:39,179 --> 00:04:41,233 the data over one column 137 00:04:41,233 --> 00:04:45,962 at a time and then I am going to do something similar for y's. 138 00:04:45,962 --> 00:04:47,087 I am going to take the 139 00:04:47,087 --> 00:04:47,952 values that I'm trying to 140 00:04:47,952 --> 00:04:49,360 predict and construct now 141 00:04:49,360 --> 00:04:52,894 a vector, like so 142 00:04:52,894 --> 00:04:55,440 and call that a vector y. 143 00:04:55,440 --> 00:04:58,038 So X is going to be a 144 00:04:59,653 --> 00:05:05,688 m by (n+1) - dimensional matrix, and 145 00:05:05,688 --> 00:05:07,490 Y is going to be 146 00:05:07,490 --> 00:05:14,421 a m-dimensional vector 147 00:05:14,421 --> 00:05:16,624 where m is the number of training examples 148 00:05:16,984 --> 00:05:18,688 and n is, n is 149 00:05:18,688 --> 00:05:20,713 a number of features, n+1, because of 150 00:05:20,713 --> 00:05:24,825 this extra feature X0 that I had. 151 00:05:24,825 --> 00:05:26,350 Finally if you take 152 00:05:26,350 --> 00:05:27,489 your matrix X and you take 153 00:05:27,489 --> 00:05:28,595 your vector Y, and if you 154 00:05:28,595 --> 00:05:31,065 just compute this, and set 155 00:05:31,065 --> 00:05:32,419 theta to be equal to 156 00:05:32,419 --> 00:05:34,440 X transpose X inverse times 157 00:05:34,440 --> 00:05:36,516 X transpose Y, this would 158 00:05:36,516 --> 00:05:38,583 give you the value of theta 159 00:05:38,583 --> 00:05:42,559 that minimizes your cost function. 160 00:05:42,559 --> 00:05:43,436 There was a lot 161 00:05:43,436 --> 00:05:44,416 that happened on the slides and 162 00:05:44,416 --> 00:05:47,514 I work through it using one specific example of one dataset. 163 00:05:47,514 --> 00:05:49,241 Let me just write this 164 00:05:49,333 --> 00:05:50,770 out in a slightly more general form 165 00:05:50,955 --> 00:05:53,418 and then let me just, and later on in 166 00:05:53,621 --> 00:05:56,531 this video let me explain this equation a little bit more. 167 00:05:57,581 --> 00:06:00,687 It is not yet entirely clear how to do this. 168 00:06:00,687 --> 00:06:02,129 In a general case, let us 169 00:06:02,129 --> 00:06:04,124 say we have M training examples 170 00:06:04,124 --> 00:06:05,697 so X1, Y1 up to 171 00:06:05,697 --> 00:06:09,319 Xn, Yn and n features. 172 00:06:09,319 --> 00:06:10,811 So, each of the training example 173 00:06:10,811 --> 00:06:12,926 x(i) may looks like a vector 174 00:06:12,926 --> 00:06:16,297 like this, that is a n+1 dimensional feature vector. 175 00:06:16,943 --> 00:06:18,350 The way I'm going to construct the 176 00:06:18,350 --> 00:06:20,674 matrix "X", this is 177 00:06:20,674 --> 00:06:24,827 also called the design matrix 178 00:06:24,827 --> 00:06:26,712 is as follows. 179 00:06:26,712 --> 00:06:28,640 Each training example gives 180 00:06:28,640 --> 00:06:30,549 me a feature vector like this. 181 00:06:30,549 --> 00:06:34,491 say, sort of n+1 dimensional vector. 182 00:06:34,491 --> 00:06:36,190 The way I am going to construct my 183 00:06:36,359 --> 00:06:39,734 design matrix x is only construct the matrix like this. 184 00:06:39,734 --> 00:06:40,834 and what I'm going to 185 00:06:40,834 --> 00:06:42,109 do is take the first 186 00:06:42,109 --> 00:06:43,711 training example, so that's 187 00:06:43,711 --> 00:06:46,350 a vector, take its transpose 188 00:06:46,350 --> 00:06:48,692 so it ends up being this, 189 00:06:48,692 --> 00:06:50,250 you know, long flat thing and 190 00:06:50,250 --> 00:06:55,153 make x1 transpose the first row of my design matrix. 191 00:06:55,153 --> 00:06:56,225 Then I am going to take my 192 00:06:56,225 --> 00:06:58,682 second training example, x2, take 193 00:06:58,682 --> 00:07:00,437 the transpose of that and 194 00:07:00,437 --> 00:07:01,838 put that as the second row 195 00:07:01,838 --> 00:07:04,068 of x and so on, 196 00:07:04,068 --> 00:07:07,206 down until my last training example. 197 00:07:07,206 --> 00:07:09,279 Take the transpose of that, 198 00:07:09,279 --> 00:07:10,850 and that's my last row of 199 00:07:10,850 --> 00:07:12,665 my matrix X. And, so, 200 00:07:12,665 --> 00:07:14,418 that makes my matrix X, an 201 00:07:14,418 --> 00:07:17,129 M by N +1 202 00:07:17,129 --> 00:07:19,836 dimensional matrix. 203 00:07:19,836 --> 00:07:21,953 As a concrete example, let's 204 00:07:21,953 --> 00:07:23,505 say I have only one 205 00:07:23,505 --> 00:07:24,670 feature, really, only one 206 00:07:24,670 --> 00:07:26,631 feature other than X zero, 207 00:07:26,631 --> 00:07:28,165 which is always equal to 1. 208 00:07:28,165 --> 00:07:30,376 So if my feature vectors 209 00:07:30,376 --> 00:07:32,186 X-i are equal to this 210 00:07:32,186 --> 00:07:33,878 1, which is X-0, then 211 00:07:33,878 --> 00:07:35,912 some real feature, like maybe the 212 00:07:35,912 --> 00:07:37,662 size of the house, then my 213 00:07:37,662 --> 00:07:40,947 design matrix, X, would be equal to this. 214 00:07:40,947 --> 00:07:42,589 For the first row, I'm going 215 00:07:42,589 --> 00:07:46,071 to basically take this and take its transpose. 216 00:07:46,071 --> 00:07:51,644 So, I'm going to end up with 1, and then X-1-1. 217 00:07:51,644 --> 00:07:53,309 For the second row, we're going to end 218 00:07:53,309 --> 00:07:56,077 up with 1 and then 219 00:07:56,077 --> 00:07:58,046 X-1-2 and so 220 00:07:58,046 --> 00:07:59,046 on down to 1, and 221 00:07:59,046 --> 00:08:01,420 then X-1-M. 222 00:08:01,420 --> 00:08:03,084 And thus, this will be 223 00:08:03,084 --> 00:08:07,776 a m by 2-dimensional matrix. 224 00:08:07,776 --> 00:08:08,821 So, that's how to construct 225 00:08:08,821 --> 00:08:11,251 the matrix X. And, the 226 00:08:11,251 --> 00:08:13,886 vector Y--sometimes I might 227 00:08:13,886 --> 00:08:15,487 write an arrow on top to 228 00:08:15,487 --> 00:08:16,541 denote that it is a vector, 229 00:08:16,541 --> 00:08:19,871 but very often I'll just write this as Y, either way. 230 00:08:19,871 --> 00:08:21,182 The vector Y is obtained by 231 00:08:21,182 --> 00:08:23,275 taking all all the labels, 232 00:08:23,275 --> 00:08:25,098 all the correct prices of 233 00:08:25,098 --> 00:08:27,076 houses in my training set, and 234 00:08:27,076 --> 00:08:28,963 just stacking them up into 235 00:08:28,963 --> 00:08:32,011 an M-dimensional vector, and 236 00:08:32,011 --> 00:08:34,511 that's Y. Finally, having 237 00:08:34,511 --> 00:08:36,724 constructed the matrix X 238 00:08:36,724 --> 00:08:38,184 and the vector Y, we then 239 00:08:38,184 --> 00:08:40,887 just compute theta as X'(1/X) 240 00:08:40,887 --> 00:08:47,243 x X'Y. I just 241 00:08:47,243 --> 00:08:49,356 want to make 242 00:08:49,356 --> 00:08:51,348 I just want to make sure that this equation makes sense to you 243 00:08:51,348 --> 00:08:52,242 and that you know how to implement it. 244 00:08:52,242 --> 00:08:55,221 So, you know, concretely, what is this X'(1/X)? 245 00:08:55,221 --> 00:08:57,903 Well, X'(1/X) is the 246 00:08:57,903 --> 00:09:02,101 inverse of the matrix X'X. 247 00:09:02,101 --> 00:09:04,498 Concretely, if you were 248 00:09:04,498 --> 00:09:08,055 to say set A to 249 00:09:08,055 --> 00:09:11,120 be equal to X' x 250 00:09:11,120 --> 00:09:12,542 X, so X' is a 251 00:09:12,542 --> 00:09:14,063 matrix, X' x X 252 00:09:14,063 --> 00:09:15,305 gives you another matrix, and we 253 00:09:15,305 --> 00:09:17,560 call that matrix A. Then, you 254 00:09:17,560 --> 00:09:19,968 know, X'(1/X) is just 255 00:09:19,968 --> 00:09:22,352 you take this matrix A and you invert it, right! 256 00:09:23,245 --> 00:09:24,417 This gives, let's say 1/A. 257 00:09:26,025 --> 00:09:28,919 And so that's how you compute this thing. 258 00:09:28,919 --> 00:09:31,451 You compute X'X and then you compute its inverse. 259 00:09:31,451 --> 00:09:34,296 We haven't yet talked about Octave. 260 00:09:34,296 --> 00:09:35,941 We'll do so in the later 261 00:09:35,941 --> 00:09:37,211 set of videos, but in the 262 00:09:37,211 --> 00:09:39,073 Octave programming language or a 263 00:09:39,073 --> 00:09:40,652 similar view, and also the 264 00:09:40,652 --> 00:09:42,957 matlab programming language is very similar. 265 00:09:42,957 --> 00:09:46,937 The command to compute this quantity, 266 00:09:47,384 --> 00:09:50,326 X transpose X inverse times 267 00:09:50,326 --> 00:09:52,537 X transpose Y, is as follows. 268 00:09:52,537 --> 00:09:54,903 In Octave X prime is 269 00:09:54,903 --> 00:09:58,354 the notation that you use to denote X transpose. 270 00:09:58,354 --> 00:10:00,737 And so, this expression that's 271 00:10:00,737 --> 00:10:03,588 boxed in red, that's computing 272 00:10:03,588 --> 00:10:06,633 X transpose times X. 273 00:10:06,633 --> 00:10:08,551 pinv is a function for 274 00:10:08,551 --> 00:10:09,701 computing the inverse of 275 00:10:09,701 --> 00:10:11,818 a matrix, so this computes 276 00:10:11,818 --> 00:10:14,656 X transpose X inverse, 277 00:10:14,656 --> 00:10:16,453 and then you multiply that by 278 00:10:16,453 --> 00:10:18,267 X transpose, and you multiply 279 00:10:18,267 --> 00:10:19,712 that by Y. So you 280 00:10:19,712 --> 00:10:22,325 end computing that formula 281 00:10:22,325 --> 00:10:24,369 which I didn't prove, 282 00:10:24,369 --> 00:10:25,994 but it is possible to 283 00:10:25,994 --> 00:10:27,382 show mathematically even though I'm 284 00:10:27,382 --> 00:10:28,537 not going to do so 285 00:10:28,537 --> 00:10:31,071 here, that this formula gives you 286 00:10:31,071 --> 00:10:32,316 the optimal value of theta 287 00:10:32,316 --> 00:10:34,865 in the sense that if you set theta equal 288 00:10:34,865 --> 00:10:36,512 to this, that's the value 289 00:10:36,512 --> 00:10:38,000 of theta that minimizes the 290 00:10:38,000 --> 00:10:40,169 cost function J of theta 291 00:10:40,169 --> 00:10:41,993 for the new regression. 292 00:10:41,993 --> 00:10:44,530 One last detail in the earlier video. 293 00:10:44,530 --> 00:10:46,131 I talked about the feature 294 00:10:46,131 --> 00:10:47,061 skill and the idea of 295 00:10:47,061 --> 00:10:48,878 getting features to be 296 00:10:48,878 --> 00:10:50,726 on similar ranges of 297 00:10:50,726 --> 00:10:54,900 Scales of similar ranges of values of each other. 298 00:10:54,900 --> 00:10:56,872 If you are using this normal 299 00:10:56,872 --> 00:10:59,843 equation method then feature 300 00:10:59,843 --> 00:11:02,315 scaling isn't actually necessary 301 00:11:02,315 --> 00:11:04,361 and is actually okay if, 302 00:11:04,361 --> 00:11:06,094 say, some feature X one 303 00:11:06,094 --> 00:11:07,552 is between zero and one, 304 00:11:07,552 --> 00:11:08,846 and some feature X two is 305 00:11:08,846 --> 00:11:10,550 between ranges from zero to 306 00:11:10,550 --> 00:11:12,019 one thousand and some feature 307 00:11:12,019 --> 00:11:14,159 x three ranges from zero 308 00:11:14,159 --> 00:11:15,822 to ten to the 309 00:11:15,822 --> 00:11:17,263 minus five and if 310 00:11:17,263 --> 00:11:18,321 you are using the normal equation method 311 00:11:18,321 --> 00:11:20,296 this is okay and there is 312 00:11:20,296 --> 00:11:21,550 no need to do features 313 00:11:21,550 --> 00:11:22,740 scaling, although of course 314 00:11:22,740 --> 00:11:25,667 if you are using gradient descent, 315 00:11:25,667 --> 00:11:27,814 then, features scaling is still important. 316 00:11:28,030 --> 00:11:31,020 Finally, where should you use the gradient descent 317 00:11:31,020 --> 00:11:33,273 and when should you use the normal equation method. 318 00:11:33,273 --> 00:11:35,800 Here are some of the their advantages and disadvantages. 319 00:11:35,800 --> 00:11:38,305 Let's say you have m training 320 00:11:38,305 --> 00:11:40,918 examples and n features. 321 00:11:40,918 --> 00:11:42,854 One disadvantage of gradient descent 322 00:11:42,854 --> 00:11:46,015 is that, you need to choose the learning rate Alpha. 323 00:11:46,015 --> 00:11:47,374 And, often, this means running 324 00:11:47,374 --> 00:11:49,128 it few times with different learning 325 00:11:49,128 --> 00:11:51,154 rate alphas and then seeing what works best. 326 00:11:51,154 --> 00:11:54,274 And so that is sort of extra work and extra hassle. 327 00:11:54,274 --> 00:11:55,976 Another disadvantage with gradient descent 328 00:11:55,976 --> 00:11:57,841 is it needs many more iterations. 329 00:11:57,841 --> 00:11:59,346 So, depending on the details, 330 00:11:59,346 --> 00:12:00,839 that could make it slower, although 331 00:12:00,839 --> 00:12:04,391 there's more to the story as we'll see in a second. 332 00:12:04,391 --> 00:12:07,544 As for the normal equation, you don't need to choose any learning rate alpha. 333 00:12:07,821 --> 00:12:11,208 So that, you know, makes it really convenient, makes it simple to implement. 334 00:12:11,208 --> 00:12:13,888 You just run it and it usually just works. 335 00:12:13,888 --> 00:12:15,061 And you don't need to 336 00:12:15,061 --> 00:12:16,129 iterate, so, you don't need 337 00:12:16,129 --> 00:12:17,456 to plot J of Theta or 338 00:12:17,456 --> 00:12:20,497 check the convergence or take all those extra steps. 339 00:12:20,497 --> 00:12:21,931 So far, the balance seems to 340 00:12:21,931 --> 00:12:23,846 favor normal the normal equation. 341 00:12:24,826 --> 00:12:27,085 Here are some disadvantages of 342 00:12:27,612 --> 00:12:29,435 the normal equation, and some advantages of gradient descent. 343 00:12:29,681 --> 00:12:31,447 Gradient descent works pretty well, 344 00:12:31,928 --> 00:12:34,698 even when you have a very large number of features. 345 00:12:34,698 --> 00:12:36,168 So, even if you 346 00:12:36,168 --> 00:12:37,812 have millions of features you 347 00:12:37,812 --> 00:12:40,865 can run gradient descent and it will be reasonably efficient. 348 00:12:40,865 --> 00:12:43,381 It will do something reasonable. 349 00:12:43,381 --> 00:12:46,566 In contrast to normal equation, In, in 350 00:12:46,566 --> 00:12:48,014 order to solve for the parameters 351 00:12:48,014 --> 00:12:50,394 data, we need to solve for this term. 352 00:12:50,394 --> 00:12:53,058 We need to compute this term, X transpose, X inverse. 353 00:12:53,058 --> 00:12:56,328 This matrix X transpose X. 354 00:12:56,328 --> 00:13:00,206 That's an n by n matrix, if you have n features. 355 00:13:00,770 --> 00:13:02,947 Because, if you look 356 00:13:02,947 --> 00:13:03,917 at the dimensions of 357 00:13:03,917 --> 00:13:05,529 X transpose the dimension of 358 00:13:05,529 --> 00:13:07,024 X, you multiply, figure out what 359 00:13:07,024 --> 00:13:08,749 the dimension of the product 360 00:13:08,749 --> 00:13:10,983 is, the matrix X transpose 361 00:13:10,983 --> 00:13:13,727 X is an n by n matrix where 362 00:13:13,727 --> 00:13:15,853 n is the number of features, and 363 00:13:15,853 --> 00:13:18,641 for almost computed implementations 364 00:13:18,641 --> 00:13:20,990 the cost of inverting 365 00:13:20,990 --> 00:13:23,087 the matrix, rose roughly as 366 00:13:23,087 --> 00:13:25,707 the cube of the dimension of the matrix. 367 00:13:25,707 --> 00:13:28,180 So, computing this inverse costs, 368 00:13:28,180 --> 00:13:29,964 roughly order, and cube time. 369 00:13:29,964 --> 00:13:31,213 Sometimes, it's slightly faster than 370 00:13:31,213 --> 00:13:35,050 N cube but, it's, you know, close enough for our purposes. 371 00:13:35,489 --> 00:13:36,605 So if n the number of features is very large, 372 00:13:37,643 --> 00:13:39,025 then computing this 373 00:13:39,025 --> 00:13:40,570 quantity can be slow and 374 00:13:40,570 --> 00:13:44,289 the normal equation method can actually be much slower. 375 00:13:44,289 --> 00:13:45,491 So if n is 376 00:13:45,491 --> 00:13:47,622 large then I might 377 00:13:47,622 --> 00:13:49,490 usually use gradient descent because 378 00:13:49,490 --> 00:13:51,872 we don't want to pay this all in q time. 379 00:13:51,872 --> 00:13:53,525 But, if n is relatively small, 380 00:13:53,525 --> 00:13:57,395 then the normal equation might give you a better way to solve the parameters. 381 00:13:57,395 --> 00:13:59,080 What does small and large mean? 382 00:13:59,080 --> 00:14:00,741 Well, if n is on 383 00:14:00,741 --> 00:14:02,130 the order of a hundred, then 384 00:14:02,130 --> 00:14:03,822 inverting a hundred-by-hundred matrix is 385 00:14:03,822 --> 00:14:06,539 no problem by modern computing standards. 386 00:14:06,539 --> 00:14:10,966 If n is a thousand, I would still use the normal equation method. 387 00:14:10,966 --> 00:14:12,583 Inverting a thousand-by-thousand matrix is 388 00:14:12,583 --> 00:14:15,408 actually really fast on a modern computer. 389 00:14:15,408 --> 00:14:18,406 If n is ten thousand, then I might start to wonder. 390 00:14:18,406 --> 00:14:20,618 Inverting a ten-thousand- by-ten-thousand matrix 391 00:14:20,618 --> 00:14:22,208 starts to get kind of slow, 392 00:14:22,208 --> 00:14:23,471 and I might then start to 393 00:14:23,471 --> 00:14:25,007 maybe lean in the 394 00:14:25,007 --> 00:14:27,007 direction of gradient descent, but maybe not quite. 395 00:14:27,114 --> 00:14:28,672 n equals ten thousand, you can 396 00:14:28,672 --> 00:14:31,148 sort of convert a ten-thousand-by-ten-thousand matrix. 397 00:14:31,148 --> 00:14:34,345 But if it gets much bigger than that, then, I would probably use gradient descent. 398 00:14:34,345 --> 00:14:35,834 So, if n equals ten 399 00:14:35,834 --> 00:14:36,920 to the sixth with a million 400 00:14:36,920 --> 00:14:38,963 features, then inverting a 401 00:14:38,963 --> 00:14:41,565 million-by-million matrix is going 402 00:14:41,565 --> 00:14:42,631 to be very expensive, and 403 00:14:42,631 --> 00:14:46,163 I would definitely favor gradient descent if you have that many features. 404 00:14:46,163 --> 00:14:47,859 So exactly how large 405 00:14:47,859 --> 00:14:49,282 set of features has to be 406 00:14:49,282 --> 00:14:52,655 before you convert a gradient descent, it's hard to give a strict number. 407 00:14:52,655 --> 00:14:53,855 But, for me, it is usually 408 00:14:53,855 --> 00:14:55,501 around ten thousand that I might 409 00:14:55,501 --> 00:14:58,258 start to consider switching over 410 00:14:58,335 --> 00:15:00,663 to gradient descents or maybe, 411 00:15:00,663 --> 00:15:04,324 some other algorithms that we'll talk about later in this class. 412 00:15:04,324 --> 00:15:05,765 To summarize, so long 413 00:15:05,765 --> 00:15:06,999 as the number of features is 414 00:15:06,999 --> 00:15:08,475 not too large, the normal equation 415 00:15:08,475 --> 00:15:12,229 gives us a great alternative method to solve for the parameter theta. 416 00:15:12,583 --> 00:15:13,983 Concretely, so long as 417 00:15:13,983 --> 00:15:15,749 the number of features is less 418 00:15:15,749 --> 00:15:17,472 than 1000, you know, I would 419 00:15:17,472 --> 00:15:18,881 use, I would usually is used 420 00:15:18,881 --> 00:15:21,955 in normal equation method rather than, gradient descent. 421 00:15:21,955 --> 00:15:23,549 To preview some ideas that 422 00:15:23,549 --> 00:15:24,493 we'll talk about later in this 423 00:15:24,493 --> 00:15:26,235 course, as we get 424 00:15:26,235 --> 00:15:27,912 to the more complex learning algorithm, for 425 00:15:27,912 --> 00:15:29,617 example, when we talk about 426 00:15:29,617 --> 00:15:32,188 classification algorithm, like a logistic regression algorithm, 427 00:15:32,834 --> 00:15:34,319 We'll see that those algorithm 428 00:15:34,319 --> 00:15:35,467 actually... 429 00:15:35,467 --> 00:15:37,592 The normal equation method actually do not work 430 00:15:37,592 --> 00:15:39,388 for those more sophisticated 431 00:15:39,388 --> 00:15:41,190 learning algorithms, and, we 432 00:15:41,190 --> 00:15:43,916 will have to resort to gradient descent for those algorithms. 433 00:15:43,916 --> 00:15:46,682 So, gradient descent is a very useful algorithm to know. 434 00:15:46,682 --> 00:15:48,859 The linear regression will have 435 00:15:48,982 --> 00:15:50,017 a large number of features and 436 00:15:50,017 --> 00:15:52,373 for some of the other algorithms 437 00:15:52,373 --> 00:15:53,893 that we'll see in 438 00:15:53,893 --> 00:15:55,438 this course, because, for them, the normal 439 00:15:55,438 --> 00:15:58,747 equation method just doesn't apply and doesn't work. 440 00:15:58,747 --> 00:16:00,537 But for this specific model of 441 00:16:00,537 --> 00:16:02,904 linear regression, the normal equation 442 00:16:02,904 --> 00:16:05,827 can give you a alternative 443 00:16:07,219 --> 00:16:08,612 that can be much faster, than gradient descent. 444 00:16:09,604 --> 00:16:11,920 So, depending on the detail of your algortithm, 445 00:16:12,007 --> 00:16:14,164 depending of the detail of the problems and 446 00:16:14,164 --> 00:16:15,550 how many features that you have, 447 00:16:15,550 --> 00:16:19,550 both of these algorithms are well worth knowing about.