1 00:00:00,310 --> 00:00:02,286 In this video, we'll figure out 2 00:00:02,300 --> 00:00:03,903 a slightly simpler way to 3 00:00:03,910 --> 00:00:06,513 write the cost function than we have been using so far. 4 00:00:06,520 --> 00:00:08,252 And we'll also figure out 5 00:00:08,252 --> 00:00:10,779 how to apply gradient descent to fit 6 00:00:10,779 --> 00:00:13,321 the parameters of logistic regression. 7 00:00:13,321 --> 00:00:14,210 So by the end of this 8 00:00:14,210 --> 00:00:15,589 video you know how to 9 00:00:15,589 --> 00:00:19,201 implement a fully working version of logistic regression. 10 00:00:19,201 --> 00:00:24,802 Here's our cost function for logistic regression. 11 00:00:24,802 --> 00:00:27,613 Our overall cost function 12 00:00:27,613 --> 00:00:29,477 is 1 over M times sum 13 00:00:29,477 --> 00:00:31,003 of the training set of the 14 00:00:31,003 --> 00:00:32,797 cost of making different 15 00:00:32,797 --> 00:00:34,580 predictions on the different examples 16 00:00:34,580 --> 00:00:36,408 of labels Y I. And 17 00:00:36,408 --> 00:00:39,492 this is a cost for a single example that we worked out earlier. 18 00:00:39,492 --> 00:00:40,604 And I just want to remind 19 00:00:40,604 --> 00:00:43,514 you that for classification problems 20 00:00:43,514 --> 00:00:45,778 in our training and in fact 21 00:00:45,778 --> 00:00:47,076 even for examples known in 22 00:00:47,076 --> 00:00:48,915 our training set, Y is 23 00:00:48,915 --> 00:00:51,056 always equal to 0 or 1. 24 00:00:51,056 --> 00:00:51,056 Right? 25 00:00:51,056 --> 00:00:52,150 That's sort of part of the 26 00:00:52,150 --> 00:00:55,700 mathematical definition of Y. 27 00:00:55,720 --> 00:00:57,430 Because Y is either 0 or 1. 28 00:00:57,430 --> 00:00:59,453 We'll be able to 29 00:00:59,460 --> 00:01:00,735 come up with a simpler 30 00:01:00,760 --> 00:01:03,001 way to write this cost function. 31 00:01:03,001 --> 00:01:04,980 And in particular, rather than writing 32 00:01:04,980 --> 00:01:06,394 out this cost function on two 33 00:01:06,410 --> 00:01:07,966 separate lines with two separate 34 00:01:07,966 --> 00:01:09,519 cases for Y equals 1 and Y equals 35 00:01:09,519 --> 00:01:11,123 0, I am going to show 36 00:01:11,130 --> 00:01:12,687 you a way take these 37 00:01:12,687 --> 00:01:16,241 two lines and compress them into one equation. 38 00:01:16,241 --> 00:01:17,743 And this will make it more convenient 39 00:01:17,743 --> 00:01:19,250 to write out the cost function 40 00:01:19,250 --> 00:01:21,493 and derive gradient descent. 41 00:01:21,493 --> 00:01:24,492 Concretely, we can write out the cost function as follows. 42 00:01:24,492 --> 00:01:27,304 We'll say the cost of H 43 00:01:27,304 --> 00:01:29,269 of X comma Y. I'm going 44 00:01:29,269 --> 00:01:31,750 to write this as minus Y 45 00:01:31,770 --> 00:01:34,201 times log H of 46 00:01:34,201 --> 00:01:37,730 X minus 1 47 00:01:38,060 --> 00:01:41,615 minus Y times log 1 48 00:01:41,660 --> 00:01:44,655 minus H of X. 49 00:01:44,670 --> 00:01:45,824 And I'll show you in a 50 00:01:45,824 --> 00:01:48,062 second that this expression, or 51 00:01:48,062 --> 00:01:51,038 this equation is an equivalent 52 00:01:51,038 --> 00:01:52,354 way or more compact way 53 00:01:52,354 --> 00:01:54,195 of writing out this definition 54 00:01:54,195 --> 00:01:56,353 of the cost function that we had up here. 55 00:01:56,353 --> 00:02:00,243 Let's see why that's the case. 56 00:02:03,730 --> 00:02:06,190 We know that there are only 2 possible cases. 57 00:02:06,190 --> 00:02:07,210 Y must be 0 or 1. 58 00:02:07,230 --> 00:02:10,857 So let's suppose Y equals 1. 59 00:02:10,857 --> 00:02:12,480 If Y is equal 60 00:02:12,480 --> 00:02:14,822 to 1 then this equation 61 00:02:14,822 --> 00:02:17,603 is saying that the cost 62 00:02:18,573 --> 00:02:20,172 is equal to. 63 00:02:20,172 --> 00:02:23,895 Well if Y is equal to one, then this thing here is equal to one. 64 00:02:23,900 --> 00:02:26,631 And one minus Y is going to be equal to zero, right? 65 00:02:26,631 --> 00:02:27,852 So if Y is equal 66 00:02:27,860 --> 00:02:29,348 to one, then one minus Y 67 00:02:29,370 --> 00:02:32,336 is one minus one, which is therefore zero. 68 00:02:32,336 --> 00:02:34,076 So the second term gets multiplied 69 00:02:34,076 --> 00:02:36,047 by zero and goes away, 70 00:02:36,047 --> 00:02:37,380 and we're left with only this 71 00:02:37,420 --> 00:02:38,631 first term which is Y 72 00:02:38,650 --> 00:02:40,654 times log, minus Y times 73 00:02:40,654 --> 00:02:42,174 log H of X. Y is 74 00:02:42,174 --> 00:02:43,621 1 so that's equal to minus 75 00:02:43,630 --> 00:02:46,313 log H of X. 76 00:02:46,320 --> 00:02:48,300 And this equation is 77 00:02:48,300 --> 00:02:50,050 exactly what we have 78 00:02:50,060 --> 00:02:53,276 up here for if Y is equal to one. 79 00:02:53,276 --> 00:02:55,566 The other case is if 80 00:02:55,566 --> 00:02:57,275 Y is equal to 0. 81 00:02:57,290 --> 00:02:58,718 And if that is 82 00:02:58,718 --> 00:03:01,430 the case then, writing of 83 00:03:01,500 --> 00:03:03,584 the cost function is saying that 84 00:03:03,600 --> 00:03:05,500 if Y is equal to zero, 85 00:03:05,500 --> 00:03:08,381 then this term here, will be equal to zero. 86 00:03:08,381 --> 00:03:10,111 Whereas 1 minus Y, if 87 00:03:10,111 --> 00:03:11,270 Y equals zero, would be 88 00:03:11,280 --> 00:03:12,528 equal to 1, because 1 minus 89 00:03:12,530 --> 00:03:14,556 Y becomes 1 minus 0, 90 00:03:14,556 --> 00:03:16,650 which is just equal to 1. 91 00:03:16,650 --> 00:03:18,643 And so the cost function 92 00:03:18,643 --> 00:03:22,583 simplifies to just this last term here. 93 00:03:22,583 --> 00:03:22,583 Right? 94 00:03:22,583 --> 00:03:24,724 Because the first term 95 00:03:24,724 --> 00:03:27,493 over here gets multiplied by zero, and so it disappears. 96 00:03:27,493 --> 00:03:28,802 So we're just left with this last 97 00:03:28,802 --> 00:03:30,486 term, which is minus 98 00:03:30,510 --> 00:03:32,566 log, 1 minus H of 99 00:03:32,590 --> 00:03:34,243 X. And you can 100 00:03:34,260 --> 00:03:36,013 verify that this term here 101 00:03:36,013 --> 00:03:40,434 is just exactly what we had for when Y is equal to 0. 102 00:03:40,450 --> 00:03:42,260 So this shows that this 103 00:03:42,260 --> 00:03:43,628 definition for the cost is 104 00:03:43,628 --> 00:03:45,423 just a more compact way of 105 00:03:45,423 --> 00:03:47,376 taking both of these expressions, 106 00:03:47,376 --> 00:03:48,757 the cases Y equals 1 and 107 00:03:48,757 --> 00:03:50,284 Y equals 0, and writing 108 00:03:50,284 --> 00:03:52,014 them in one, in a 109 00:03:52,030 --> 00:03:54,580 more convenient form with just one line. 110 00:03:54,600 --> 00:03:56,449 We can, therefore, write 111 00:03:56,449 --> 00:03:59,898 all of our cost function for logistic regression as follows. 112 00:03:59,898 --> 00:04:00,628 It is this 113 00:04:00,628 --> 00:04:01,746 one of m of the sum 114 00:04:01,746 --> 00:04:03,856 of this cost functions, and plugging 115 00:04:03,856 --> 00:04:05,123 in the definition for the 116 00:04:05,123 --> 00:04:07,255 cost that we worked out earlier, we end up with this. 117 00:04:07,255 --> 00:04:09,767 And we just brought the minus sign outside. 118 00:04:09,767 --> 00:04:12,214 And why do we choose this particular cost function? 119 00:04:12,230 --> 00:04:16,250 When it looks like there could be other cost functions that we could have chosen. 120 00:04:16,250 --> 00:04:17,427 Although I won't have time to 121 00:04:17,430 --> 00:04:19,171 go into great detail of this 122 00:04:19,171 --> 00:04:21,345 in this course, this cost function 123 00:04:21,345 --> 00:04:23,566 can be derived from statistics using 124 00:04:23,566 --> 00:04:25,416 the principle of maximum likelihood 125 00:04:25,440 --> 00:04:26,816 estimation, which is an 126 00:04:26,820 --> 00:04:28,754 idea statistics for how 127 00:04:28,770 --> 00:04:33,014 to efficiently find parameters data for different models. 128 00:04:33,014 --> 00:04:35,843 And it also has a nice property that it is convex. 129 00:04:35,860 --> 00:04:37,666 So this is the cost function 130 00:04:37,666 --> 00:04:40,003 that, you know, essentially everyone uses 131 00:04:40,040 --> 00:04:42,736 when putting Logistic Regression models. 132 00:04:42,740 --> 00:04:44,264 If we don't understand the terms 133 00:04:44,264 --> 00:04:45,731 I just say and you don't 134 00:04:45,731 --> 00:04:47,280 know what the principle of maximum 135 00:04:47,280 --> 00:04:49,706 likelihood estimation is, don't worry about. 136 00:04:49,706 --> 00:04:51,240 There's just a deeper 137 00:04:51,250 --> 00:04:53,780 rational and justification behind this 138 00:04:53,790 --> 00:04:55,617 particular cost function then I 139 00:04:55,630 --> 00:04:58,203 have time to go into in this class. 140 00:04:58,203 --> 00:05:00,683 Given this cost function, in 141 00:05:00,683 --> 00:05:02,601 order to fit the parameters, 142 00:05:02,601 --> 00:05:04,541 what we're going to do then is 143 00:05:04,541 --> 00:05:07,896 try to find the parameters theta that minimizes J of theta. 144 00:05:07,910 --> 00:05:10,716 So if we, you know, try to minimize this. 145 00:05:10,716 --> 00:05:15,006 This would give us some set of parameters theta. 146 00:05:15,006 --> 00:05:17,157 Finally, if we're given a new 147 00:05:17,157 --> 00:05:18,549 example with some set 148 00:05:18,549 --> 00:05:20,164 of features X. We can 149 00:05:20,164 --> 00:05:21,640 then take the thetas that we 150 00:05:21,640 --> 00:05:23,980 fit our training set and output 151 00:05:23,980 --> 00:05:25,793 our prediction as this, and 152 00:05:25,800 --> 00:05:27,336 just to remind you the output 153 00:05:27,336 --> 00:05:28,842 of my hypothesis, I am 154 00:05:28,850 --> 00:05:30,253 going to interpret as the 155 00:05:30,253 --> 00:05:33,001 probability that Y is equal to 1. 156 00:05:33,001 --> 00:05:34,656 And this is given the 157 00:05:34,670 --> 00:05:36,900 implement X and parameters by theta. 158 00:05:36,900 --> 00:05:38,070 But think of this 159 00:05:38,070 --> 00:05:40,613 as just my hypothesis is 160 00:05:40,613 --> 00:05:43,873 estimating the probability that Y is equal to 1. 161 00:05:43,880 --> 00:05:45,579 So all that remains to 162 00:05:45,590 --> 00:05:47,143 be done is figure out 163 00:05:47,150 --> 00:05:49,520 how to actually minimize J 164 00:05:49,520 --> 00:05:51,005 of theta as a function 165 00:05:51,010 --> 00:05:52,519 of theta so we can actually 166 00:05:52,519 --> 00:05:55,625 fit the parameters to our training set. 167 00:05:56,390 --> 00:05:57,819 The way we're going to minimize the 168 00:05:57,819 --> 00:06:00,599 cost function is using gradient descent. 169 00:06:00,600 --> 00:06:02,225 Here's our cost function. 170 00:06:02,250 --> 00:06:05,307 And if we want to minimize it as a function of theta. 171 00:06:05,340 --> 00:06:08,070 Here's our usual template for gradient descent. 172 00:06:08,070 --> 00:06:09,880 Where we repeatedly update each 173 00:06:09,880 --> 00:06:12,398 parameter by taking updating 174 00:06:12,398 --> 00:06:14,099 it as itself minus a 175 00:06:14,099 --> 00:06:17,684 learning rate alpha times this derivative term. 176 00:06:17,684 --> 00:06:19,219 If you know some calculus feel 177 00:06:19,219 --> 00:06:20,739 free to take this term and 178 00:06:20,739 --> 00:06:22,788 try to compute a derivative yourself 179 00:06:22,788 --> 00:06:24,592 and see if you can simplify 180 00:06:24,592 --> 00:06:26,664 it to the same answer that I get. 181 00:06:26,664 --> 00:06:30,538 But even if you don't know calculus don't worry about it. 182 00:06:30,538 --> 00:06:32,355 If you actually compute this, 183 00:06:32,370 --> 00:06:34,811 what you get is this equation. 184 00:06:34,811 --> 00:06:37,634 And just write it out here. 185 00:06:37,634 --> 00:06:39,047 The sum from I equals 1 186 00:06:39,047 --> 00:06:41,386 through M of the, 187 00:06:41,386 --> 00:06:43,722 essentially the error, times 188 00:06:43,722 --> 00:06:46,378 X I J. So if 189 00:06:46,390 --> 00:06:48,504 you take this partial derivative 190 00:06:48,504 --> 00:06:49,716 term and plug it back 191 00:06:49,716 --> 00:06:51,210 in here, we can then 192 00:06:51,230 --> 00:06:55,203 write out our gradient descent algorithm as follows. 193 00:06:55,203 --> 00:06:56,393 And all I've done is I 194 00:06:56,393 --> 00:06:57,633 took the derivative term from 195 00:06:57,633 --> 00:07:00,163 the previous line and plugged it in there. 196 00:07:00,170 --> 00:07:01,454 So if you have N 197 00:07:01,454 --> 00:07:03,856 features, you would have, you know, a 198 00:07:03,856 --> 00:07:06,865 parameter vector theta, which parameters 199 00:07:06,865 --> 00:07:08,417 theta zero, theta one, theta 200 00:07:08,417 --> 00:07:10,031 two, down to theta 201 00:07:10,031 --> 00:07:11,324 N and you will 202 00:07:11,340 --> 00:07:13,930 use this update to simultaneously 203 00:07:13,930 --> 00:07:15,920 update all of your values of theta. 204 00:07:15,950 --> 00:07:17,378 Now if you take this 205 00:07:17,378 --> 00:07:19,498 update rule and compare it 206 00:07:19,498 --> 00:07:21,175 to what we were doing 207 00:07:21,180 --> 00:07:23,364 for linear regression, you might 208 00:07:23,370 --> 00:07:25,679 be surprised to realize that, 209 00:07:25,710 --> 00:07:28,958 well, this equation was exactly 210 00:07:28,970 --> 00:07:30,529 what we had for linear regression. 211 00:07:30,550 --> 00:07:31,678 In fact, if you look 212 00:07:31,678 --> 00:07:33,234 at the earlier videos and look 213 00:07:33,240 --> 00:07:35,123 at the update rule, the 214 00:07:35,123 --> 00:07:36,543 gradient descent rule for linear 215 00:07:36,550 --> 00:07:38,418 regression, it looked exactly 216 00:07:38,418 --> 00:07:41,268 like what I drew here inside the blue box. 217 00:07:41,268 --> 00:07:43,280 So are linear regression and 218 00:07:43,280 --> 00:07:45,875 logistic regression different algorithms or not? 219 00:07:45,900 --> 00:07:47,415 Well, this is resolved by 220 00:07:47,415 --> 00:07:49,468 observing that for logistic 221 00:07:49,500 --> 00:07:51,376 regression, what has changed 222 00:07:51,380 --> 00:07:54,723 is that the definition for this hypothesis has changed. 223 00:07:54,723 --> 00:07:56,788 So whereas for linear regression 224 00:07:56,800 --> 00:07:58,586 we had H of X equals 225 00:07:58,620 --> 00:08:01,093 theta transpose X, now the 226 00:08:01,093 --> 00:08:02,633 definition of H of 227 00:08:02,633 --> 00:08:04,060 X has changed and is 228 00:08:04,060 --> 00:08:05,460 instead now 1 over 1 229 00:08:05,460 --> 00:08:07,897 plus e to the negative theta transpose X. 230 00:08:07,910 --> 00:08:09,326 So even though the update 231 00:08:09,340 --> 00:08:12,213 rule looks cosmetically identical, because 232 00:08:12,230 --> 00:08:13,872 the definition of the hypothesis 233 00:08:13,872 --> 00:08:15,826 has changed, this is actually 234 00:08:15,826 --> 00:08:19,445 not the same thing as gradient descent for linear regression. 235 00:08:19,445 --> 00:08:21,063 In an earlier video, when 236 00:08:21,090 --> 00:08:22,889 we were talking about gradient descent 237 00:08:22,900 --> 00:08:24,514 for linear regression, we had 238 00:08:24,514 --> 00:08:26,128 talked about how to monitor 239 00:08:26,160 --> 00:08:29,630 gradient descent to make sure that it is converging. 240 00:08:29,630 --> 00:08:31,463 I usually apply that same 241 00:08:31,463 --> 00:08:33,354 method to logistic regression too 242 00:08:33,354 --> 00:08:37,193 to monitor gradient descent to make sure it's conversion correctly. 243 00:08:37,220 --> 00:08:38,612 And hopefully you can figure 244 00:08:38,612 --> 00:08:40,306 out how to apply that technique 245 00:08:40,306 --> 00:08:43,984 to logistic regression yourself. 246 00:08:43,984 --> 00:08:46,603 When implementing logistic regression with 247 00:08:46,610 --> 00:08:48,229 gradient descent, we have 248 00:08:48,229 --> 00:08:50,404 all of these different parameter 249 00:08:50,404 --> 00:08:52,093 values, you know, theta 250 00:08:52,130 --> 00:08:55,816 0 down to theta N that we need to update using this expression. 251 00:08:55,816 --> 00:08:58,770 And one thing we could do is have a for loop. 252 00:08:58,770 --> 00:09:00,926 So for I equals 0 to 253 00:09:00,926 --> 00:09:03,658 N of i equals 1 to N plus 1. 254 00:09:03,658 --> 00:09:07,217 So update each of these parameter values in turn. 255 00:09:07,217 --> 00:09:08,653 But of course, rather than using 256 00:09:08,653 --> 00:09:10,588 a folder, ideally we would 257 00:09:10,600 --> 00:09:13,163 also use a vectorized implementation. 258 00:09:13,170 --> 00:09:15,072 And so that a vectorized 259 00:09:15,072 --> 00:09:16,899 implementation can update, you 260 00:09:16,899 --> 00:09:18,310 know, all of these N plus 261 00:09:18,310 --> 00:09:21,110 1 parameters all in one fell swoop. 262 00:09:21,110 --> 00:09:22,233 And to check your own 263 00:09:22,233 --> 00:09:23,675 understanding, you might see 264 00:09:23,690 --> 00:09:25,223 if you can figure out how 265 00:09:25,223 --> 00:09:27,763 to do the vectorized implementation 266 00:09:27,763 --> 00:09:31,020 of this algorithm as well. 267 00:09:31,030 --> 00:09:32,331 So now you know how 268 00:09:32,350 --> 00:09:35,079 to implement gradient descents for logistic aggression. 269 00:09:35,079 --> 00:09:36,706 There was one last idea 270 00:09:36,706 --> 00:09:40,753 that we had talked about earlier for which was feature scaling. 271 00:09:40,753 --> 00:09:42,946 We saw how feature scaling can 272 00:09:42,946 --> 00:09:46,502 help gradient descents converge faster for linear regression. 273 00:09:46,502 --> 00:09:48,827 The idea of feature scaling also 274 00:09:48,850 --> 00:09:51,712 applies to gradient descent for logistic regression. 275 00:09:51,730 --> 00:09:54,874 And if you have features that are on very different scales. 276 00:09:54,890 --> 00:09:56,857 Then applying feature scaling can also 277 00:09:56,857 --> 00:09:58,941 make it, gradient descent, run 278 00:09:58,941 --> 00:10:01,550 faster for logistic regression. 279 00:10:01,550 --> 00:10:02,699 So, that's it. 280 00:10:02,699 --> 00:10:04,552 You now know how to implement 281 00:10:04,552 --> 00:10:06,549 logistic regression, and this 282 00:10:06,549 --> 00:10:08,918 is a very powerful and 283 00:10:08,918 --> 00:10:10,441 probably even most widely used 284 00:10:10,441 --> 00:10:11,982 classification algorithm in the world. 285 00:10:11,982 --> 00:10:14,130 And you now know how we get to work with yourself.