1 00:00:00,160 --> 00:00:01,704 In this video we'll talk about 2 00:00:01,704 --> 00:00:04,010 how to fit the parameters theta 3 00:00:04,040 --> 00:00:05,869 for logistic regression. 4 00:00:05,880 --> 00:00:06,982 In particular, I'd like to 5 00:00:07,020 --> 00:00:10,386 define the optimization objective or the 6 00:00:10,400 --> 00:00:14,470 cost function that we'll use to fit the parameters. 7 00:00:15,390 --> 00:00:17,370 Here's to supervised learning problem 8 00:00:17,370 --> 00:00:19,892 of fitting a logistic regression model. 9 00:00:19,960 --> 00:00:22,210 We have a training set 10 00:00:22,210 --> 00:00:24,964 of M training examples. 11 00:00:24,964 --> 00:00:26,577 And as usual each of 12 00:00:26,577 --> 00:00:28,130 our examples is represented by 13 00:00:28,150 --> 00:00:32,830 feature vector that's N plus 1 dimensional. 14 00:00:32,830 --> 00:00:35,133 And as usual we have 15 00:00:35,180 --> 00:00:36,498 X 0 equals 1. 16 00:00:36,498 --> 00:00:38,315 Our first feature, or our 0 17 00:00:38,315 --> 00:00:39,951 feature is always equal to 1, 18 00:00:39,970 --> 00:00:41,203 and because this is a 19 00:00:41,203 --> 00:00:43,335 classification problem, our training 20 00:00:43,350 --> 00:00:44,999 set has the property that 21 00:00:45,010 --> 00:00:48,422 every label Y, is either 0 or 1. 22 00:00:48,422 --> 00:00:50,576 This is a hypothesis 23 00:00:50,576 --> 00:00:52,007 and the parameters of the 24 00:00:52,007 --> 00:00:54,460 hypothesis is this theta over here. 25 00:00:54,490 --> 00:00:55,572 And the question I want 26 00:00:55,610 --> 00:00:57,339 to talk about is given this 27 00:00:57,340 --> 00:00:58,846 training set how do 28 00:00:58,880 --> 00:01:02,482 we choose, or how do we fit the parameters theta? 29 00:01:02,510 --> 00:01:04,125 Back when we were developing the 30 00:01:04,125 --> 00:01:08,463 linear regression model, we use the following cost function. 31 00:01:08,480 --> 00:01:10,868 I've written this slightly differently, where 32 00:01:10,900 --> 00:01:12,663 instead of 1/2m, I've 33 00:01:12,670 --> 00:01:16,440 taken the 1/2 and put it inside the summation instead. 34 00:01:16,440 --> 00:01:17,440 Now, I want to use 35 00:01:17,440 --> 00:01:19,132 an alternative way of writing 36 00:01:19,140 --> 00:01:20,663 out this cost function which is 37 00:01:20,700 --> 00:01:22,009 that instead of writing out 38 00:01:22,030 --> 00:01:23,920 this squared and return here, 39 00:01:23,920 --> 00:01:27,100 let's write here, cost of 40 00:01:28,310 --> 00:01:31,476 H of X comma 41 00:01:31,500 --> 00:01:33,605 Y, and I'm going 42 00:01:33,605 --> 00:01:37,176 to define that term cost 43 00:01:37,210 --> 00:01:39,727 of H of X comma Y to be equal to this. 44 00:01:39,740 --> 00:01:42,641 It's just equal to just one half of the square root error. 45 00:01:42,670 --> 00:01:43,800 So now, we can see more 46 00:01:43,800 --> 00:01:46,018 clearly that the cost 47 00:01:46,018 --> 00:01:48,145 function is a sum 48 00:01:48,145 --> 00:01:49,740 over my training set, or 49 00:01:49,740 --> 00:01:51,427 is 1/m times the sum 50 00:01:51,427 --> 00:01:56,046 over my training set of this cost term here. 51 00:01:56,050 --> 00:01:58,065 And to simplify this 52 00:01:58,065 --> 00:01:59,470 equation a little bit more, it's gonna 53 00:01:59,490 --> 00:02:02,587 be convenient to get rid of those superscripts. 54 00:02:02,610 --> 00:02:04,408 So just define cost of 55 00:02:04,408 --> 00:02:05,527 H of X comma Y to 56 00:02:05,527 --> 00:02:06,618 be equal to 1/2 of 57 00:02:06,618 --> 00:02:08,925 this square root error and the 58 00:02:08,925 --> 00:02:10,336 interpretation of this cost function 59 00:02:10,360 --> 00:02:11,876 is that this is the 60 00:02:11,890 --> 00:02:13,447 cost I want my learning 61 00:02:13,460 --> 00:02:15,110 algorithm to, you know, 62 00:02:15,110 --> 00:02:16,701 have to pay, if it 63 00:02:16,750 --> 00:02:18,737 outputs that value it 64 00:02:18,737 --> 00:02:19,912 this prediction is H of 65 00:02:19,912 --> 00:02:21,258 X, and the actual 66 00:02:21,310 --> 00:02:24,035 label was Y. So just 67 00:02:24,050 --> 00:02:27,836 cross off those superscripts. All right. 68 00:02:27,840 --> 00:02:29,756 And no surprise for linear 69 00:02:29,756 --> 00:02:31,537 regression the cost for you to define is that. 70 00:02:31,537 --> 00:02:32,757 Well the cost for this 71 00:02:32,757 --> 00:02:34,535 is, that is 1/2 72 00:02:34,540 --> 00:02:36,232 times the square difference 73 00:02:36,232 --> 00:02:37,663 between what are predicted and the 74 00:02:37,670 --> 00:02:38,943 actual value that we observe 75 00:02:38,943 --> 00:02:41,103 for Y. Now, this cost 76 00:02:41,103 --> 00:02:42,848 function worked fine for linear 77 00:02:42,848 --> 00:02:47,418 regression, but here we're interested in logistic regression. 78 00:02:47,430 --> 00:02:49,146 If we could minimize this cost 79 00:02:49,150 --> 00:02:51,992 function that is plugged into J here. 80 00:02:52,020 --> 00:02:53,817 That will work okay. 81 00:02:53,817 --> 00:02:55,476 But it turns out that if 82 00:02:55,480 --> 00:02:57,640 we use this particular cost function 83 00:02:57,640 --> 00:03:01,807 this would be a non-convex function of the parameters theta. 84 00:03:01,820 --> 00:03:03,968 Here's what I mean by non-convex. 85 00:03:03,990 --> 00:03:05,313 We have some cost function J 86 00:03:05,313 --> 00:03:08,118 of theta, and for logistic 87 00:03:08,140 --> 00:03:12,113 regression this function H here 88 00:03:12,113 --> 00:03:13,495 has a non linearity, right? 89 00:03:13,500 --> 00:03:14,538 It says, you know, 1 over 90 00:03:14,538 --> 00:03:16,384 1 plus E to the negative theta transfers 91 00:03:16,384 --> 00:03:19,591 X. So it's a pretty complicated nonlinear function. 92 00:03:19,591 --> 00:03:21,108 And if you take the sigmoid 93 00:03:21,130 --> 00:03:22,104 function and plug it in 94 00:03:22,104 --> 00:03:23,239 here and then take 95 00:03:23,300 --> 00:03:25,016 this cost function and plug 96 00:03:25,020 --> 00:03:26,746 it in there, and then plot 97 00:03:26,746 --> 00:03:28,200 what J of theta looks 98 00:03:28,210 --> 00:03:29,650 like, you find that 99 00:03:29,650 --> 00:03:33,493 J of theta can look like a function just like this. 100 00:03:33,500 --> 00:03:35,958 You know with many local optima and 101 00:03:35,958 --> 00:03:37,321 the formal term for this 102 00:03:37,340 --> 00:03:39,488 is that this a non convex function. 103 00:03:39,500 --> 00:03:40,644 And you can kind of tell. 104 00:03:40,644 --> 00:03:41,880 If you were to run gradient 105 00:03:41,880 --> 00:03:43,192 descent on this sort of 106 00:03:43,192 --> 00:03:45,160 function, it is not guaranteed 107 00:03:45,170 --> 00:03:47,747 to converge to the global minimum. 108 00:03:47,747 --> 00:03:48,867 Whereas in contrast, what 109 00:03:48,870 --> 00:03:50,350 we would like is to have 110 00:03:50,350 --> 00:03:52,100 a cost function J of theta 111 00:03:52,100 --> 00:03:53,599 that is convex, that is 112 00:03:53,599 --> 00:03:55,250 a single bow-shaped function that 113 00:03:55,250 --> 00:03:56,675 looks like this, so that 114 00:03:56,675 --> 00:03:58,543 if you run gradient descent, we 115 00:03:58,543 --> 00:04:01,147 would be guaranteed that gradient descent, you know, 116 00:04:01,170 --> 00:04:04,917 would converge to the global minimum. 117 00:04:04,917 --> 00:04:07,020 And the problem of using the 118 00:04:07,020 --> 00:04:08,460 the square cost function is that 119 00:04:08,520 --> 00:04:10,400 because of this very 120 00:04:10,400 --> 00:04:12,371 non linear sigmoid function that 121 00:04:12,371 --> 00:04:14,107 appears in the middle here, J of 122 00:04:14,107 --> 00:04:15,987 theta ends up being 123 00:04:15,987 --> 00:04:17,962 a non convex function if you 124 00:04:17,962 --> 00:04:21,294 were to define it as the square cost function. 125 00:04:21,294 --> 00:04:22,313 So what we'd would like to do 126 00:04:22,320 --> 00:04:23,822 is to instead come up with 127 00:04:23,822 --> 00:04:25,576 a different cost function that 128 00:04:25,576 --> 00:04:28,063 is convex and so 129 00:04:28,063 --> 00:04:29,257 that we can apply a great 130 00:04:29,280 --> 00:04:30,919 algorithm like gradient descent 131 00:04:30,940 --> 00:04:33,683 and be guaranteed to find a global minimum. 132 00:04:33,683 --> 00:04:37,295 Here's a cost function that we're going to use for logistic regression. 133 00:04:37,295 --> 00:04:39,313 We're going to say the cost 134 00:04:39,320 --> 00:04:40,710 or the penalty that the algorithm 135 00:04:40,710 --> 00:04:42,924 pays if it outputs 136 00:04:42,924 --> 00:04:44,596 a value H of X. 137 00:04:44,620 --> 00:04:46,722 So, this is some number like 0.7 138 00:04:46,722 --> 00:04:48,670 where it predicts a value H 139 00:04:48,670 --> 00:04:50,780 of X. And the actual 140 00:04:50,780 --> 00:04:52,032 cost label turns out to 141 00:04:52,032 --> 00:04:54,087 be Y. The cost is 142 00:04:54,090 --> 00:04:56,061 going to be minus log 143 00:04:56,100 --> 00:04:57,861 H of X if Y is equal 1. 144 00:04:57,861 --> 00:04:59,447 And minus log, 1 minus 145 00:04:59,460 --> 00:05:02,010 H of X if Y is equal to 0. 146 00:05:02,020 --> 00:05:04,205 This looks like a pretty complicated function. 147 00:05:04,230 --> 00:05:05,773 But let's plot function to 148 00:05:05,773 --> 00:05:08,147 gain some intuition about what it's doing. 149 00:05:08,160 --> 00:05:11,054 Let's start up with the case of Y equals 1. 150 00:05:11,070 --> 00:05:12,461 If Y is equal equal 151 00:05:12,461 --> 00:05:14,958 to 1, then the cost function 152 00:05:14,958 --> 00:05:18,240 is -log H of X, and 153 00:05:18,240 --> 00:05:19,601 if we plot that, so let's 154 00:05:19,601 --> 00:05:21,564 say that the horizontal 155 00:05:21,580 --> 00:05:22,961 axis is H of X. 156 00:05:22,961 --> 00:05:24,722 So we know that a hypothesis 157 00:05:24,730 --> 00:05:26,611 is going to output a value between 158 00:05:26,630 --> 00:05:28,465 0 and 1. 159 00:05:28,465 --> 00:05:28,465 Right? 160 00:05:28,490 --> 00:05:30,514 So H of X that varies 161 00:05:30,530 --> 00:05:31,940 between 0 and 1. 162 00:05:31,940 --> 00:05:35,469 If you plot what this cost function looks like. 163 00:05:35,470 --> 00:05:37,981 You find that it looks like this. 164 00:05:37,981 --> 00:05:39,044 One way to see why the 165 00:05:39,044 --> 00:05:41,363 plot like this it is because 166 00:05:41,440 --> 00:05:44,988 if you were to plot log Z 167 00:05:45,000 --> 00:05:47,656 with Z on the horizontal axis. 168 00:05:47,656 --> 00:05:48,794 Then that looks like that. 169 00:05:48,794 --> 00:05:50,369 And it's approach is minus infinity. 170 00:05:50,369 --> 00:05:53,700 So this is what the log function looks like. 171 00:05:53,700 --> 00:05:55,963 And so this is 0, this is 1. 172 00:05:55,980 --> 00:05:57,560 Here Z is of 173 00:05:57,560 --> 00:05:59,653 course playing the role of 174 00:05:59,653 --> 00:06:02,030 H of X. And so 175 00:06:02,030 --> 00:06:06,329 minus log Z will look like this. 176 00:06:06,330 --> 00:06:08,098 Right just flipping the sign. 177 00:06:08,100 --> 00:06:09,822 Minus log Z. And we're 178 00:06:09,822 --> 00:06:11,013 interested only in the 179 00:06:11,020 --> 00:06:12,580 range of when this function 180 00:06:12,610 --> 00:06:14,014 goes between 0 and 1. 181 00:06:14,014 --> 00:06:15,924 So, get rid of that. 182 00:06:15,924 --> 00:06:17,962 And so, we're just left with, 183 00:06:17,980 --> 00:06:21,555 you know, this part of the curve. 184 00:06:21,630 --> 00:06:23,200 And that's what this curve on the left looks like. 185 00:06:23,200 --> 00:06:25,472 Now this cost function 186 00:06:25,500 --> 00:06:29,666 has a few interesting and desirable properties. 187 00:06:29,690 --> 00:06:32,103 First you notice that if 188 00:06:32,103 --> 00:06:35,003 Y is equal to 1 and H of X is equal 1, in 189 00:06:35,010 --> 00:06:37,367 other words, if the hypothesis 190 00:06:37,410 --> 00:06:39,000 exactly, you know, predicts 191 00:06:39,000 --> 00:06:40,261 H equals 1, and Y 192 00:06:40,261 --> 00:06:42,744 is exactly equal to what I predicted. 193 00:06:42,744 --> 00:06:44,432 Then the cost is equal 0. 194 00:06:44,432 --> 00:06:44,432 Right? 195 00:06:44,432 --> 00:06:47,475 That corresponds to, the curve doesn't actually flatten out. 196 00:06:47,475 --> 00:06:49,866 The curve is still going. First, notice 197 00:06:49,880 --> 00:06:51,006 that if H of X 198 00:06:51,006 --> 00:06:53,056 equals 1, if the hypothesis 199 00:06:53,056 --> 00:06:55,113 predicts that Y is equal to 1. 200 00:06:55,113 --> 00:06:56,342 And if indeed Y is 201 00:06:56,342 --> 00:06:58,502 equal to 1 then the cost is equal to 0. 202 00:06:58,530 --> 00:07:00,975 That corresponds to this point down here. 203 00:07:00,975 --> 00:07:00,975 Right? 204 00:07:01,030 --> 00:07:02,332 If H of X is equal 205 00:07:02,332 --> 00:07:04,068 to 1, and we're only 206 00:07:04,068 --> 00:07:06,273 concerned the case that Y equals 1 here. 207 00:07:06,273 --> 00:07:08,366 But if H of X is equal to 1. 208 00:07:08,366 --> 00:07:11,063 Then the cost is down here is equal to 0. 209 00:07:11,063 --> 00:07:13,082 And that is what we like it to be. 210 00:07:13,082 --> 00:07:13,968 Because, you know, if we 211 00:07:13,968 --> 00:07:17,673 correctly predict the output Y then the cost is 0. 212 00:07:17,673 --> 00:07:21,466 But now, notice also 213 00:07:21,470 --> 00:07:23,456 that H of X approaches 0. 214 00:07:23,456 --> 00:07:25,037 So, that's H. As the 215 00:07:25,037 --> 00:07:26,909 output of the hypothesis approaches 0 216 00:07:26,909 --> 00:07:30,163 the cost blows up, and it goes to infinity. 217 00:07:30,163 --> 00:07:31,513 And what this does is 218 00:07:31,513 --> 00:07:34,271 it captures the intuition that if 219 00:07:34,310 --> 00:07:36,890 a hypothesis, you know, outputs 0. 220 00:07:36,890 --> 00:07:38,574 That's like saying, our hypothesis is 221 00:07:38,574 --> 00:07:39,960 saying, the chance of Y 222 00:07:39,960 --> 00:07:41,541 equals 1 is equal to 0. 223 00:07:41,541 --> 00:07:42,516 It's kind of like our going 224 00:07:42,520 --> 00:07:44,010 to our medical patient and saying, 225 00:07:44,020 --> 00:07:45,594 "The probability that you 226 00:07:45,610 --> 00:07:47,337 have a malignant tumor, the 227 00:07:47,337 --> 00:07:49,807 probability that Y equals 1 is zero." 228 00:07:49,807 --> 00:07:52,154 So, it's like absolutely impossible that your 229 00:07:52,160 --> 00:07:55,130 tumor is malignant. 230 00:07:55,150 --> 00:07:56,776 But if it turns out that 231 00:07:56,776 --> 00:08:00,111 the tumor, the patient's tumor, actually is malignant. 232 00:08:00,111 --> 00:08:01,879 So if Y is equal to 233 00:08:01,880 --> 00:08:03,291 1 even after we told them 234 00:08:03,300 --> 00:08:05,375 you know, the probability of it happening is 0. 235 00:08:05,390 --> 00:08:08,716 It's absolutely impossible for it to be malignant. 236 00:08:08,716 --> 00:08:09,759 But if we told them 237 00:08:09,760 --> 00:08:11,186 this with that level of certainty, 238 00:08:11,240 --> 00:08:13,018 and we turn out to be wrong, 239 00:08:13,018 --> 00:08:14,688 then we penalize the learning algorithm 240 00:08:14,690 --> 00:08:16,122 by a very, very large cost, 241 00:08:16,122 --> 00:08:17,963 and that's captured by having this 242 00:08:17,963 --> 00:08:20,474 cost goes infinity if Y 243 00:08:20,474 --> 00:08:21,900 equals 1 and H 244 00:08:21,900 --> 00:08:24,334 of X approaches 0. 245 00:08:24,334 --> 00:08:26,725 This might consider of 246 00:08:26,725 --> 00:08:28,875 Y1, let's look at what 247 00:08:28,875 --> 00:08:32,371 the cost function looks like for Y0. 248 00:08:32,410 --> 00:08:35,710 If Y is equal to 0, then the cost 249 00:08:35,720 --> 00:08:39,121 looks like this expression over here. 250 00:08:39,121 --> 00:08:40,403 And if you plot 251 00:08:40,403 --> 00:08:42,751 the function minus log 1 252 00:08:42,780 --> 00:08:45,839 minus Z what you 253 00:08:45,839 --> 00:08:49,245 get is the cost function actually looks like this. 254 00:08:49,245 --> 00:08:50,256 So, it goes from 0 to 1. 255 00:08:50,270 --> 00:08:53,263 Something like that. 256 00:08:53,280 --> 00:08:54,611 And so if you plot 257 00:08:54,611 --> 00:08:55,872 the cost function for the case 258 00:08:55,872 --> 00:08:57,823 of y equals zero, you find that it looks 259 00:08:57,823 --> 00:09:00,763 like this and what 260 00:09:00,763 --> 00:09:02,404 this curve does is it 261 00:09:02,404 --> 00:09:04,937 now blows up, 262 00:09:04,937 --> 00:09:08,273 and it goes to plus infinity as H of X goes to 1. 263 00:09:08,290 --> 00:09:09,880 Because it's saying that 264 00:09:09,900 --> 00:09:11,199 if Y turns out to be 265 00:09:11,200 --> 00:09:12,168 equal to 0, but we 266 00:09:12,168 --> 00:09:13,966 predicted that you know, Y is 267 00:09:13,966 --> 00:09:15,286 equal to 1 with almost 268 00:09:15,320 --> 00:09:17,281 certainty with probability 1, then 269 00:09:17,281 --> 00:09:21,569 we end up paying a very large cost. 270 00:09:21,569 --> 00:09:23,143 Let's plot the cost function for 271 00:09:23,143 --> 00:09:25,063 the case of Y equals 0. 272 00:09:25,063 --> 00:09:29,702 So if Y equals 0 that's going to be our cost function. 273 00:09:29,702 --> 00:09:31,914 If you look at this expression, 274 00:09:31,914 --> 00:09:33,726 and if you plot, you know, minus 275 00:09:33,726 --> 00:09:36,221 log 1 minus Z, if 276 00:09:36,221 --> 00:09:37,428 you figure out what that looks like, 277 00:09:37,428 --> 00:09:40,071 you get a figure that looks like this. 278 00:09:40,071 --> 00:09:41,669 Where, which goes from 0 279 00:09:41,680 --> 00:09:43,610 to 1 with the Z 280 00:09:43,610 --> 00:09:45,850 axis on the horizontal axis. 281 00:09:45,850 --> 00:09:47,221 So If you take this cost 282 00:09:47,221 --> 00:09:48,397 function and plot it for 283 00:09:48,397 --> 00:09:49,614 the case of Y equals 0, 284 00:09:49,614 --> 00:09:51,186 what you get is 285 00:09:51,186 --> 00:09:55,109 that the cost function looks like this. 286 00:09:55,109 --> 00:09:56,743 And what this cost function 287 00:09:56,743 --> 00:09:58,650 does is that it blows 288 00:09:58,660 --> 00:09:59,530 up or it goes to a 289 00:09:59,560 --> 00:10:01,448 positive infinity as each 290 00:10:01,448 --> 00:10:03,707 H of X goes to one 291 00:10:03,710 --> 00:10:05,443 and this captures the 292 00:10:05,443 --> 00:10:07,159 intuition that if a hypothesis 293 00:10:07,180 --> 00:10:08,847 predicted that, you know, H of 294 00:10:08,850 --> 00:10:10,406 X is equal to 1 with 295 00:10:10,406 --> 00:10:12,121 certainty, with like probability 1, 296 00:10:12,121 --> 00:10:14,283 it's absolutely got to be Y equals 1. 297 00:10:14,283 --> 00:10:15,563 But if Y turned out to 298 00:10:15,563 --> 00:10:17,219 be equal to 0 then 299 00:10:17,219 --> 00:10:18,206 it makes sense to make the 300 00:10:18,206 --> 00:10:21,940 hypothesis, or make the learning algorithm pay a very large cost. 301 00:10:21,940 --> 00:10:24,609 And conversely, if H 302 00:10:24,610 --> 00:10:25,942 of X is equal to 303 00:10:25,950 --> 00:10:27,483 0 and Y equals zero, 304 00:10:27,483 --> 00:10:28,983 then the hypothesis nailed it. 305 00:10:29,000 --> 00:10:30,626 The predicted Y is equal 306 00:10:30,630 --> 00:10:32,371 to zero and it turns 307 00:10:32,371 --> 00:10:34,376 out Y is equal to zero 308 00:10:34,376 --> 00:10:36,701 so at this point the cost 309 00:10:36,750 --> 00:10:40,139 function is going to be 0. 310 00:10:40,160 --> 00:10:42,163 In this video, we 311 00:10:42,163 --> 00:10:43,886 have defined the cost function 312 00:10:43,886 --> 00:10:46,428 for a single training example. 313 00:10:46,428 --> 00:10:50,251 The topic of convexity analysis is beyond the scope of this course. 314 00:10:50,270 --> 00:10:51,594 But it is possible to show 315 00:10:51,620 --> 00:10:53,080 that with our particular choice 316 00:10:53,150 --> 00:10:54,774 of cost function this would 317 00:10:54,774 --> 00:10:57,926 give us a convex optimization problem 318 00:10:57,960 --> 00:11:00,081 as cost function, overall cost function 319 00:11:00,081 --> 00:11:01,463 J of theta will be 320 00:11:01,463 --> 00:11:04,368 convex and local optima free. 321 00:11:04,370 --> 00:11:05,691 In the next video we're going 322 00:11:05,691 --> 00:11:07,753 to take these ideas of the 323 00:11:07,753 --> 00:11:08,923 cost function for a single 324 00:11:08,923 --> 00:11:10,839 training example and develop that 325 00:11:10,839 --> 00:11:12,522 further and define the 326 00:11:12,522 --> 00:11:13,773 cost function for the entire 327 00:11:13,780 --> 00:11:16,104 training set, and we'll also 328 00:11:16,104 --> 00:11:17,404 figure out a simpler way to 329 00:11:17,404 --> 00:11:19,699 write it than we have been using so far. 330 00:11:19,699 --> 00:11:21,016 And based on that will 331 00:11:21,030 --> 00:11:22,779 work out gradient descent, and 332 00:11:22,779 --> 00:11:25,835 that will give us our logistic regression algorithm.