1 00:00:00,300 --> 00:00:01,680 In the last video, we talked 2 00:00:01,990 --> 00:00:03,920 about gradient descent for minimizing 3 00:00:04,440 --> 00:00:06,700 the cost function J of theta for logistic regression. 4 00:00:07,800 --> 00:00:08,930 In this video, I'd like to 5 00:00:09,020 --> 00:00:10,250 tell you about some advanced 6 00:00:10,850 --> 00:00:12,340 optimization algorithms and some 7 00:00:12,670 --> 00:00:14,060 advanced optimization concepts. 8 00:00:15,180 --> 00:00:16,480 Using some of these ideas, we'll 9 00:00:16,630 --> 00:00:17,930 be able to get logistic regression 10 00:00:19,010 --> 00:00:20,220 to run much more quickly than 11 00:00:20,350 --> 00:00:21,970 it's possible with gradient descent. 12 00:00:22,880 --> 00:00:24,190 And this will also let 13 00:00:24,320 --> 00:00:26,060 the algorithms scale much better 14 00:00:26,670 --> 00:00:28,030 to very large machine learning problems, 15 00:00:28,660 --> 00:00:30,950 such as if we had a very large number of features. 16 00:00:31,850 --> 00:00:33,360 Here's an alternative view of 17 00:00:33,750 --> 00:00:34,910 what gradient descent is doing. 18 00:00:35,590 --> 00:00:38,030 We have some cost function J and we want to minimize it. 19 00:00:38,950 --> 00:00:39,980 So what we need to 20 00:00:40,340 --> 00:00:41,080 is, we need to write 21 00:00:41,330 --> 00:00:42,640 code that can take 22 00:00:42,850 --> 00:00:44,980 as input the parameters theta and 23 00:00:45,200 --> 00:00:46,470 they can compute two things: J 24 00:00:46,700 --> 00:00:48,190 of theta and these partial 25 00:00:48,620 --> 00:00:50,280 derivative terms for, you 26 00:00:50,530 --> 00:00:51,820 know, J equals 0, 1 27 00:00:51,890 --> 00:00:53,700 up to N. Given code that 28 00:00:53,830 --> 00:00:54,980 can do these two things, what 29 00:00:55,160 --> 00:00:56,710 gradient descent does is it 30 00:00:56,790 --> 00:00:58,620 repeatedly performs the following update. 31 00:00:59,100 --> 00:00:59,100 Right? 32 00:00:59,280 --> 00:01:00,500 So given the code that 33 00:01:00,670 --> 00:01:01,750 we wrote to compute these partial 34 00:01:02,090 --> 00:01:03,800 derivatives, gradient descent plugs 35 00:01:04,480 --> 00:01:07,330 in here and uses that to update our parameters theta. 36 00:01:08,650 --> 00:01:09,590 So another way of thinking 37 00:01:09,910 --> 00:01:11,070 about gradient descent is that 38 00:01:11,350 --> 00:01:12,670 we need to supply code to 39 00:01:12,810 --> 00:01:14,050 compute J of theta and 40 00:01:14,230 --> 00:01:15,700 these derivatives, and then 41 00:01:15,900 --> 00:01:16,930 these get plugged into gradient 42 00:01:17,370 --> 00:01:20,110 descents, which can then try to minimize the function for us. 43 00:01:20,970 --> 00:01:21,970 For gradient descent, I guess 44 00:01:22,480 --> 00:01:23,790 technically you don't actually need code 45 00:01:24,170 --> 00:01:26,520 to compute the cost function J of theta. 46 00:01:26,940 --> 00:01:28,980 You only need code to compute the derivative terms. 47 00:01:29,740 --> 00:01:30,480 But if you think of your 48 00:01:30,590 --> 00:01:32,300 code as also monitoring convergence 49 00:01:33,000 --> 00:01:34,060 of some such, 50 00:01:34,190 --> 00:01:35,440 we'll just think of 51 00:01:35,530 --> 00:01:37,380 ourselves as providing code to 52 00:01:37,510 --> 00:01:38,530 compute both the cost 53 00:01:38,890 --> 00:01:40,250 function and the derivative terms. 54 00:01:42,700 --> 00:01:44,130 So, having written code to 55 00:01:44,280 --> 00:01:45,860 compute these two things, one 56 00:01:46,090 --> 00:01:47,820 algorithm we can use is gradient descent. 57 00:01:48,910 --> 00:01:51,590 But gradient descent isn't the only algorithm we can use. 58 00:01:52,280 --> 00:01:53,690 And there are other algorithms, 59 00:01:54,330 --> 00:01:55,930 more advanced, more sophisticated ones, 60 00:01:56,720 --> 00:01:57,880 that, if we only provide 61 00:01:58,400 --> 00:01:59,520 them a way to compute 62 00:01:59,960 --> 00:02:01,550 these two things, then these 63 00:02:01,760 --> 00:02:03,040 are different approaches to optimize 64 00:02:03,490 --> 00:02:04,790 the cost function for us. 65 00:02:05,110 --> 00:02:07,910 So conjugate gradient BFGS and 66 00:02:08,110 --> 00:02:09,240 L-BFGS are examples of more 67 00:02:09,460 --> 00:02:11,490 sophisticated optimization algorithms that 68 00:02:11,640 --> 00:02:12,610 need a way to compute J 69 00:02:12,810 --> 00:02:13,670 of theta, and need a way 70 00:02:13,750 --> 00:02:15,430 to compute the derivatives, and can 71 00:02:15,670 --> 00:02:16,940 then use more sophisticated 72 00:02:17,620 --> 00:02:19,880 strategies than gradient descent to minimize the cost function. 73 00:02:21,260 --> 00:02:22,560 The details of exactly what 74 00:02:22,780 --> 00:02:25,920 these three algorithms is well beyond the scope of this course. 75 00:02:26,490 --> 00:02:28,200 And in fact you often 76 00:02:28,650 --> 00:02:30,570 end up spending, you know, many days, 77 00:02:31,060 --> 00:02:32,670 or a small number of weeks studying these algorithms. 78 00:02:33,240 --> 00:02:35,840 If you take a class and advance the numerical computing. 79 00:02:36,920 --> 00:02:38,200 But let me just tell you about some of their properties. 80 00:02:40,080 --> 00:02:42,150 These three algorithms have a number of advantages. 81 00:02:42,900 --> 00:02:44,070 One is that, with any 82 00:02:44,290 --> 00:02:45,850 of this algorithms you usually do 83 00:02:46,000 --> 00:02:48,970 not need to manually pick the learning rate alpha. 84 00:02:50,670 --> 00:02:51,450 So one way to think 85 00:02:51,650 --> 00:02:53,630 of these algorithms is that given 86 00:02:54,230 --> 00:02:56,900 is the way to compute the derivative and a cost function. 87 00:02:57,320 --> 00:02:59,740 You can think of these algorithms as having a clever inter-loop. 88 00:03:00,060 --> 00:03:00,680 And, in fact, they have a clever 89 00:03:01,810 --> 00:03:03,780 inter-loop called a line 90 00:03:04,200 --> 00:03:05,840 search algorithm that automatically 91 00:03:06,520 --> 00:03:08,010 tries out different values for 92 00:03:08,080 --> 00:03:09,360 the learning rate alpha and automatically 93 00:03:10,010 --> 00:03:11,090 picks a good learning rate alpha 94 00:03:12,030 --> 00:03:12,900 so that it can even pick 95 00:03:13,130 --> 00:03:14,570 a different learning rate for every iteration. 96 00:03:15,490 --> 00:03:18,230 And so then you don't need to choose it yourself. 97 00:03:21,430 --> 00:03:22,770 These algorithms actually do 98 00:03:22,910 --> 00:03:24,260 more sophisticated things than just 99 00:03:24,470 --> 00:03:25,640 pick a good learning rate, and 100 00:03:25,800 --> 00:03:27,300 so they often end up 101 00:03:27,490 --> 00:03:30,320 converging much faster than gradient descent. 102 00:03:32,470 --> 00:03:33,740 These algorithms actually do more 103 00:03:33,980 --> 00:03:35,160 sophisticated things than just 104 00:03:35,360 --> 00:03:36,740 pick a good learning rate, and 105 00:03:36,880 --> 00:03:38,770 so they often end up converging much 106 00:03:39,020 --> 00:03:40,840 faster than gradient descent, but 107 00:03:41,040 --> 00:03:42,230 detailed discussion of exactly 108 00:03:42,710 --> 00:03:44,420 what they do is beyond the scope of this course. 109 00:03:45,580 --> 00:03:47,060 In fact, I actually used 110 00:03:47,570 --> 00:03:49,020 to have used these algorithms for 111 00:03:49,170 --> 00:03:50,170 a long time, like maybe over 112 00:03:50,470 --> 00:03:53,070 a decade, quite frequently, and it 113 00:03:53,290 --> 00:03:54,410 was only, you know, a 114 00:03:54,510 --> 00:03:55,460 few years ago that I actually 115 00:03:56,150 --> 00:03:57,200 figured out for myself the details 116 00:03:57,780 --> 00:04:00,220 of what conjugate gradient, BFGS and O-BFGS do. 117 00:04:00,980 --> 00:04:02,740 So it is actually entirely possible 118 00:04:03,560 --> 00:04:05,380 to use these algorithms successfully and 119 00:04:05,480 --> 00:04:06,530 apply to lots of different learning 120 00:04:06,780 --> 00:04:08,490 problems without actually understanding 121 00:04:09,460 --> 00:04:11,140 the inter-loop of what these algorithms do. 122 00:04:12,270 --> 00:04:13,630 If these algorithms have a disadvantage, 123 00:04:14,200 --> 00:04:15,350 I'd say that the main 124 00:04:15,610 --> 00:04:16,970 disadvantage is that they're 125 00:04:17,110 --> 00:04:19,390 quite a lot more complex than gradient descent. 126 00:04:20,180 --> 00:04:21,700 And in particular, you probably should 127 00:04:21,970 --> 00:04:23,290 not implement these algorithms 128 00:04:23,850 --> 00:04:26,060 - conjugate gradient, L-BGFS, BFGS - 129 00:04:26,360 --> 00:04:29,520 yourself unless you're an expert in numerical computing. 130 00:04:30,720 --> 00:04:32,320 Instead, just as I 131 00:04:32,420 --> 00:04:33,640 wouldn't recommend that you write 132 00:04:33,850 --> 00:04:35,240 your own code to compute square 133 00:04:35,590 --> 00:04:36,660 roots of numbers or to 134 00:04:36,770 --> 00:04:39,010 compute inverses of matrices, for 135 00:04:39,140 --> 00:04:40,600 these algorithms also what I 136 00:04:40,710 --> 00:04:42,530 would recommend you do is just use a software library. 137 00:04:43,030 --> 00:04:43,770 So, you know, to take a square 138 00:04:44,120 --> 00:04:44,940 root what all of us 139 00:04:45,150 --> 00:04:46,440 do is use some function 140 00:04:47,080 --> 00:04:48,310 that someone else has 141 00:04:48,530 --> 00:04:50,200 written to compute the square roots of our numbers. 142 00:04:51,330 --> 00:04:53,530 And fortunately, Octave and 143 00:04:53,760 --> 00:04:55,070 the closely related language MATLAB 144 00:04:55,430 --> 00:04:57,110 - we'll be using that - 145 00:04:57,140 --> 00:04:58,370 Octave has a very good. Has a pretty 146 00:04:58,530 --> 00:05:02,410 reasonable library implementing some of these advanced optimization algorithms. 147 00:05:03,380 --> 00:05:04,350 And so if you just use 148 00:05:04,600 --> 00:05:06,800 the built-in library, you know, you get pretty good results. 149 00:05:08,010 --> 00:05:08,880 I should say that there is 150 00:05:09,370 --> 00:05:10,880 a difference between good 151 00:05:11,230 --> 00:05:12,740 and bad implementations of these algorithms. 152 00:05:13,690 --> 00:05:15,010 And so, if you're using a 153 00:05:15,120 --> 00:05:16,270 different language for your machine 154 00:05:16,470 --> 00:05:17,560 learning application, if you're using 155 00:05:18,190 --> 00:05:20,090 C, C++, Java, and 156 00:05:20,250 --> 00:05:24,060 so on, you 157 00:05:24,210 --> 00:05:24,710 might want to try out a couple 158 00:05:24,730 --> 00:05:25,660 of different libraries to make sure that you find a 159 00:05:25,740 --> 00:05:27,790 good library for implementing these algorithms. 160 00:05:28,250 --> 00:05:29,410 Because there is a difference in 161 00:05:29,480 --> 00:05:30,740 performance between a good implementation 162 00:05:31,680 --> 00:05:33,150 of, you know, contour gradient or 163 00:05:33,530 --> 00:05:35,150 LPFGS versus less good 164 00:05:35,350 --> 00:05:37,680 implementation of contour gradient or LPFGS. 165 00:05:43,060 --> 00:05:44,310 So now let's explain how 166 00:05:44,580 --> 00:05:47,080 to use these algorithms, I'm going to do so with an example. 167 00:05:48,970 --> 00:05:50,220 Let's say that you have a 168 00:05:50,370 --> 00:05:51,620 problem with two parameters 169 00:05:53,380 --> 00:05:55,580 equals theta zero and theta one. 170 00:05:56,410 --> 00:05:57,450 And let's say your cost function 171 00:05:57,970 --> 00:05:59,210 is J of theta equals theta 172 00:05:59,430 --> 00:06:01,540 one minus five squared, plus theta two minus five squared. 173 00:06:02,630 --> 00:06:04,080 So with this cost function. 174 00:06:04,590 --> 00:06:06,960 You know the value for theta 1 and theta 2. 175 00:06:07,080 --> 00:06:09,590 If you want to minimize J of theta as a function of theta. 176 00:06:09,940 --> 00:06:10,910 The value that minimizes it is 177 00:06:11,030 --> 00:06:12,040 going to be theta 1 178 00:06:12,420 --> 00:06:14,220 equals 5, theta 2 equals equals five. 179 00:06:15,230 --> 00:06:16,620 Now, again, I know some of 180 00:06:16,950 --> 00:06:18,320 you know more calculus than others, 181 00:06:19,010 --> 00:06:20,770 but the derivatives of the 182 00:06:20,850 --> 00:06:23,420 cost function J turn out to be these two expressions. 183 00:06:24,270 --> 00:06:25,060 I've done the calculus. 184 00:06:26,260 --> 00:06:27,250 So if you want to apply 185 00:06:27,480 --> 00:06:29,220 one of the advanced optimization algorithms 186 00:06:29,810 --> 00:06:31,380 to minimize cost function J. 187 00:06:31,660 --> 00:06:32,630 So, you know, if we 188 00:06:32,880 --> 00:06:34,680 didn't know the minimum was at 189 00:06:34,780 --> 00:06:36,140 5, 5, but if you want to have 190 00:06:36,240 --> 00:06:37,550 a cost function 5 the minimum 191 00:06:37,970 --> 00:06:39,840 numerically using something like 192 00:06:40,040 --> 00:06:41,560 gradient descent but preferably more 193 00:06:41,730 --> 00:06:43,430 advanced than gradient descent, what 194 00:06:43,550 --> 00:06:45,010 you would do is implement an octave 195 00:06:45,570 --> 00:06:46,690 function like this, so we 196 00:06:46,860 --> 00:06:48,190 implement a cost function, 197 00:06:49,210 --> 00:06:51,180 cost function theta function like that, 198 00:06:52,180 --> 00:06:53,250 and what this does is that 199 00:06:53,380 --> 00:06:55,660 it returns two arguments, the 200 00:06:55,760 --> 00:06:57,780 first J-val, is how 201 00:06:58,910 --> 00:07:00,020 we would compute the cost function 202 00:07:00,680 --> 00:07:01,780 J. And so this says J-val 203 00:07:02,080 --> 00:07:03,210 equals, you know, theta 204 00:07:03,440 --> 00:07:04,630 one minus five squared plus theta 205 00:07:05,330 --> 00:07:06,230 two minus five squared. 206 00:07:06,540 --> 00:07:09,140 So it's just computing this cost function over here. 207 00:07:10,540 --> 00:07:12,040 And the second argument that 208 00:07:12,260 --> 00:07:14,190 this function returns is gradient. 209 00:07:14,840 --> 00:07:16,030 So gradient is going to 210 00:07:16,160 --> 00:07:17,320 be a two by one vector, 211 00:07:18,870 --> 00:07:20,050 and the two elements of the 212 00:07:20,120 --> 00:07:22,100 gradient vector correspond to 213 00:07:22,800 --> 00:07:24,670 the two partial derivative terms over here. 214 00:07:27,150 --> 00:07:28,570 Having implemented this cost function, 215 00:07:29,580 --> 00:07:30,390 you would, you can then 216 00:07:31,510 --> 00:07:33,010 call the advanced optimization 217 00:07:34,270 --> 00:07:35,720 function called the fminunc 218 00:07:35,950 --> 00:07:36,900 - it stands for function 219 00:07:37,610 --> 00:07:39,360 minimization unconstrained in Octave 220 00:07:40,300 --> 00:07:41,520 -and the way you call this is as follows. 221 00:07:41,790 --> 00:07:42,350 You set a few options. 222 00:07:43,230 --> 00:07:43,580 This is a options 223 00:07:44,330 --> 00:07:46,680 as a data structure that stores the options you want. 224 00:07:47,320 --> 00:07:48,960 So grant up on, 225 00:07:49,160 --> 00:07:52,100 this sets the gradient objective parameter to on. 226 00:07:52,270 --> 00:07:55,180 It just means you are indeed going to provide a gradient to this algorithm. 227 00:07:56,150 --> 00:07:57,550 I'm going to set the maximum number 228 00:07:57,840 --> 00:07:59,280 of iterations to, let's say, one hundred. 229 00:07:59,580 --> 00:08:02,230 We're going give it an initial guess for theta. 230 00:08:02,720 --> 00:08:03,680 There's a 2 by 1 vector. 231 00:08:04,440 --> 00:08:06,860 And then this command calls fminunc. 232 00:08:07,530 --> 00:08:10,290 This at symbol presents a 233 00:08:10,420 --> 00:08:11,810 pointer to the cost function 234 00:08:13,010 --> 00:08:14,320 that we just defined up there. 235 00:08:15,060 --> 00:08:16,020 And if you call this, 236 00:08:16,270 --> 00:08:18,290 this will compute, you know, will use 237 00:08:18,620 --> 00:08:20,490 one of the more advanced optimization algorithms. 238 00:08:21,110 --> 00:08:23,350 And if you want to think it as just like gradient descent. 239 00:08:23,690 --> 00:08:25,170 But automatically choosing the learning 240 00:08:25,500 --> 00:08:27,290 rate alpha for so you don't have to do so yourself. 241 00:08:28,210 --> 00:08:29,880 But it will then attempt to 242 00:08:30,160 --> 00:08:32,000 use the sort of advanced optimization algorithms. 243 00:08:32,640 --> 00:08:33,770 Like gradient descent on steroids. 244 00:08:34,400 --> 00:08:36,490 To try to find the optimal value of theta for you. 245 00:08:37,180 --> 00:08:39,040 Let me actually show you what this looks like in Octave. 246 00:08:40,690 --> 00:08:42,460 So I've written this cost function 247 00:08:42,900 --> 00:08:46,440 of theta function exactly as we had it on the previous line. 248 00:08:46,650 --> 00:08:49,070 It computes J-val which is the cost function. 249 00:08:49,920 --> 00:08:51,810 And it computes the gradient with 250 00:08:52,040 --> 00:08:53,050 the two elements being the partial 251 00:08:53,450 --> 00:08:54,430 derivatives of the cost function 252 00:08:55,220 --> 00:08:56,200 with respect to, you know, 253 00:08:56,360 --> 00:08:57,910 the two parameters, theta one and theta two. 254 00:08:59,040 --> 00:09:00,360 Now let's switch to my Octave window. 255 00:09:00,710 --> 00:09:02,900 I'm gonna type in those commands I had just now. 256 00:09:03,470 --> 00:09:05,850 So, options equals optimset. This is 257 00:09:06,630 --> 00:09:08,510 the notation for setting my 258 00:09:09,670 --> 00:09:11,190 parameters on my options, 259 00:09:11,710 --> 00:09:13,850 for my optimization algorithm. Grant option on, maxIter, 100 260 00:09:14,130 --> 00:09:17,600 so that says 100 261 00:09:18,310 --> 00:09:19,610 iterations, and I am 262 00:09:19,730 --> 00:09:22,090 going to provide the gradient to my algorithm. 263 00:09:23,490 --> 00:09:27,190 Let's say initial theta equals zero's two by one. 264 00:09:27,980 --> 00:09:29,280 So that's my initial guess for theta. 265 00:09:30,500 --> 00:09:31,390 And now I have of theta, 266 00:09:32,620 --> 00:09:35,100 function val exit flag 267 00:09:37,610 --> 00:09:39,430 equals fminunc constraint. 268 00:09:40,570 --> 00:09:41,600 A pointer to the cost function. 269 00:09:43,010 --> 00:09:44,700 and provide my initial guess. 270 00:09:46,090 --> 00:09:49,060 And the options like so. 271 00:09:49,820 --> 00:09:52,760 And if I hit enter this will run the optimization algorithm. 272 00:09:53,940 --> 00:09:54,810 And it returns pretty quickly. 273 00:09:55,790 --> 00:09:57,040 This funny formatting that's because 274 00:09:57,430 --> 00:09:58,430 my line, you know, my 275 00:09:59,700 --> 00:10:00,290 code wrapped around. 276 00:10:00,680 --> 00:10:02,540 So, this funny thing 277 00:10:02,760 --> 00:10:04,890 is just because my command line had wrapped around. 278 00:10:05,490 --> 00:10:06,290 But what this says is that 279 00:10:06,550 --> 00:10:08,500 numerically renders, you know, think 280 00:10:08,670 --> 00:10:10,400 of it as gradient descent 281 00:10:10,440 --> 00:10:11,620 on steroids, they found the optimal value of 282 00:10:11,760 --> 00:10:13,150 a theta is theta 1 283 00:10:13,400 --> 00:10:15,670 equals 5, theta 2 equals 5, exactly as we're hoping for. 284 00:10:16,520 --> 00:10:18,760 The function value at the 285 00:10:18,840 --> 00:10:21,430 optimum is essentially 10 to the minus 30. 286 00:10:21,670 --> 00:10:23,160 So that's essentially zero, which 287 00:10:23,370 --> 00:10:24,760 is also what we're hoping for. 288 00:10:24,840 --> 00:10:27,060 And the exit flag is 289 00:10:27,240 --> 00:10:29,080 1, and this shows 290 00:10:29,730 --> 00:10:31,400 what the convergence status of this. 291 00:10:31,800 --> 00:10:33,010 And if you want you can do 292 00:10:33,150 --> 00:10:35,020 help fminunc to 293 00:10:35,130 --> 00:10:36,480 read the documentation for how 294 00:10:36,680 --> 00:10:38,650 to interpret the exit flag. 295 00:10:38,760 --> 00:10:41,600 But the exit flag let's you verify whether or not this algorithm thing has converged. 296 00:10:43,960 --> 00:10:46,450 So that's how you run these algorithms in Octave. 297 00:10:47,480 --> 00:10:48,920 I should mention, by the way, 298 00:10:48,940 --> 00:10:51,020 that for the Octave implementation, this value 299 00:10:51,640 --> 00:10:53,010 of theta, your parameter vector 300 00:10:53,370 --> 00:10:54,940 of theta, must be in 301 00:10:55,280 --> 00:10:58,210 rd for d greater than or equal to 2. 302 00:10:58,450 --> 00:11:00,330 So if theta is just a real number. 303 00:11:00,770 --> 00:11:02,040 So, if it is not at least 304 00:11:02,160 --> 00:11:03,160 a two-dimensional vector 305 00:11:03,800 --> 00:11:04,860 or some higher than two-dimensional 306 00:11:05,160 --> 00:11:06,840 vector, this fminunc 307 00:11:07,560 --> 00:11:08,760 may not work, so and if 308 00:11:09,140 --> 00:11:10,310 in case you have a 309 00:11:10,590 --> 00:11:11,590 one-dimensional function that you use 310 00:11:11,830 --> 00:11:12,930 to optimize, you can look 311 00:11:13,100 --> 00:11:14,680 in the octave documentation for fminunc 312 00:11:14,950 --> 00:11:16,230 for additional details. 313 00:11:18,230 --> 00:11:19,360 So, that's how we optimize 314 00:11:19,620 --> 00:11:21,640 our trial example of this 315 00:11:22,190 --> 00:11:23,810 simple quick driving cost function. 316 00:11:24,440 --> 00:11:26,520 However, we apply this to let's just say progression. 317 00:11:27,720 --> 00:11:29,270 In logistic regression we have 318 00:11:29,520 --> 00:11:31,290 a parameter vector theta, and 319 00:11:31,430 --> 00:11:32,210 I'm going to use a mix 320 00:11:32,620 --> 00:11:34,880 of octave notation and sort of math notation. 321 00:11:35,300 --> 00:11:36,400 But I hope this explanation 322 00:11:36,870 --> 00:11:38,050 will be clear, but our parameter 323 00:11:38,520 --> 00:11:40,360 vector theta comprises these 324 00:11:40,540 --> 00:11:41,780 parameters theta 0 through theta 325 00:11:42,210 --> 00:11:44,230 n because octave indexes, 326 00:11:46,090 --> 00:11:48,040 vectors using indexing from 327 00:11:48,460 --> 00:11:49,640 1, you know, theta 0 is 328 00:11:49,710 --> 00:11:51,190 actually written theta 1 329 00:11:51,330 --> 00:11:53,290 in octave, theta 1 is gonna be written. 330 00:11:53,930 --> 00:11:54,690 So, if theta 2 in octave 331 00:11:55,280 --> 00:11:56,180 and that's gonna be a written 332 00:11:56,780 --> 00:11:58,430 theta n+1, right? 333 00:11:58,610 --> 00:12:00,650 And that's because Octave indexes 334 00:12:01,320 --> 00:12:03,070 is vectors starting from index 335 00:12:03,430 --> 00:12:05,200 of 1 and so the index of 0. 336 00:12:06,920 --> 00:12:07,950 So what we need 337 00:12:08,160 --> 00:12:09,670 to do then is write a 338 00:12:09,880 --> 00:12:12,070 cost function that captures 339 00:12:12,710 --> 00:12:14,210 the cost function for logistic regression. 340 00:12:15,170 --> 00:12:16,450 Concretely, the cost function 341 00:12:16,880 --> 00:12:18,310 needs to return J-val, which is, you know, J-val 342 00:12:18,940 --> 00:12:20,430 as you need some codes to 343 00:12:20,640 --> 00:12:22,440 compute J of theta and 344 00:12:22,710 --> 00:12:24,010 we also need to give it the gradient. 345 00:12:24,540 --> 00:12:25,460 So, gradient 1 is going 346 00:12:25,920 --> 00:12:27,080 to be some code to compute 347 00:12:27,280 --> 00:12:29,100 the partial derivative in respect to 348 00:12:29,390 --> 00:12:31,250 theta 0, the next partial 349 00:12:31,600 --> 00:12:34,300 derivative respect to theta 1 and so on. 350 00:12:34,770 --> 00:12:36,260 Once again, this is gradient 351 00:12:37,500 --> 00:12:38,390 1, gradient 2 and so 352 00:12:39,030 --> 00:12:40,330 on, rather than gradient 0, gradient 353 00:12:40,500 --> 00:12:42,730 1 because octave indexes 354 00:12:43,460 --> 00:12:46,200 is vectors starting from one rather than from zero. 355 00:12:47,440 --> 00:12:48,460 But the main concept I hope 356 00:12:48,690 --> 00:12:49,540 you take away from this slide 357 00:12:49,900 --> 00:12:50,870 is, that what you need to do, 358 00:12:51,070 --> 00:12:54,370 is write a function that returns 359 00:12:55,500 --> 00:12:56,930 the cost function and returns the gradient. 360 00:12:58,410 --> 00:12:59,750 And so in order to 361 00:12:59,960 --> 00:13:01,410 apply this to logistic regression 362 00:13:02,100 --> 00:13:03,430 or even to linear regression, if 363 00:13:03,560 --> 00:13:06,230 you want to use these optimization algorithms for linear regression. 364 00:13:07,340 --> 00:13:08,350 What you need to do is plug in 365 00:13:08,500 --> 00:13:09,960 the appropriate code to compute 366 00:13:10,820 --> 00:13:12,280 these things over here. 367 00:13:15,100 --> 00:13:17,910 So, now you know how to use these advanced optimization algorithms. 368 00:13:19,030 --> 00:13:21,170 Because, using, because for 369 00:13:21,320 --> 00:13:22,660 these algorithms, you're using a 370 00:13:22,870 --> 00:13:25,190 sophisticated optimization library, it makes 371 00:13:25,690 --> 00:13:26,710 the just a little bit 372 00:13:26,940 --> 00:13:28,510 more opaque and so 373 00:13:28,740 --> 00:13:30,390 just maybe a little bit harder to debug. 374 00:13:31,290 --> 00:13:32,660 But because these algorithms often 375 00:13:33,010 --> 00:13:34,370 run much faster than gradient descent, 376 00:13:35,010 --> 00:13:36,760 often quite typically whenever 377 00:13:37,060 --> 00:13:38,180 I have a large machine learning 378 00:13:38,410 --> 00:13:39,500 problem, I will use 379 00:13:39,760 --> 00:13:42,110 these algorithms instead of using gradient descent. 380 00:13:43,900 --> 00:13:45,070 And with these ideas, hopefully, 381 00:13:45,450 --> 00:13:46,710 you'll be able to get logistic progression 382 00:13:47,350 --> 00:13:48,780 and also linear regression to work 383 00:13:49,100 --> 00:13:51,410 on much larger problems. 384 00:13:51,830 --> 00:13:53,820 So, that's it for advanced optimization concepts. 385 00:13:55,120 --> 00:13:56,170 And in the next and 386 00:13:56,320 --> 00:13:57,720 final video on Logistic Regression, 387 00:13:58,550 --> 00:13:59,470 I want to tell you how to 388 00:13:59,600 --> 00:14:00,990 take the logistic regression algorithm 389 00:14:01,520 --> 00:14:02,790 that you already know about and make 390 00:14:02,990 --> 00:14:05,420 it work also on multi-class classification problems.