1 00:00:00,170 --> 00:00:04,848 [MUSIC] 2 00:00:04,848 --> 00:00:05,670 Very good. 3 00:00:06,850 --> 00:00:12,240 We've now seen the simple gradient ascent algorithm for logistic regression. 4 00:00:12,240 --> 00:00:15,590 How we update the parameters, how we implement it and 5 00:00:15,590 --> 00:00:18,985 we talked a little bit about how to set that step size parameter 6 00:00:18,985 --> 00:00:24,510 [INAUDIBLE] parameter and what impact it has on the progress of our algorithm. 7 00:00:24,510 --> 00:00:30,140 Now I'm going to take a little bit of time to show you how to derive the derivative 8 00:00:30,140 --> 00:00:34,860 of the likelihood function for logistic regression, how that gradient is computed. 9 00:00:34,860 --> 00:00:40,030 The material we're going to talk about here is quite mathematical. 10 00:00:40,030 --> 00:00:44,770 This is really PhD level material and it can be a little 11 00:00:46,270 --> 00:00:52,330 annoying but for some folks, it might be exciting to go through every detail. 12 00:00:52,330 --> 00:00:56,270 For others, this could be something that you skip and it doesn't change anything. 13 00:00:56,270 --> 00:01:02,610 Up to you, you're warned but it's there for when you want to learn more about it. 14 00:01:02,610 --> 00:01:05,548 We're going to jump in to deriving the gradient for 15 00:01:05,548 --> 00:01:10,300 the likelihood function for logistic regression. 16 00:01:10,300 --> 00:01:12,090 Again, PhD level material. 17 00:01:12,090 --> 00:01:12,740 Here we go. 18 00:01:13,990 --> 00:01:19,100 As we said, our goal is to pick the coefficients w to maximize 19 00:01:19,100 --> 00:01:21,620 the likelihood function and 20 00:01:21,620 --> 00:01:26,050 that is the product of our data points are the probability of yi given xi and w. 21 00:01:26,050 --> 00:01:31,220 Now, it turns out that all the math that we need 22 00:01:31,220 --> 00:01:36,130 to do becomes quite a lot simpler if you take the log of that likelihood function. 23 00:01:36,130 --> 00:01:41,845 I'm going to call that ll(w) and this is the natural log ln, 24 00:01:41,845 --> 00:01:49,160 natural log of that product. 25 00:01:49,160 --> 00:01:54,290 Now, turns out for most of machine learning, especially for stuff, 26 00:01:54,290 --> 00:01:59,050 you often take the log, and it usually makes your math a lot simpler. 27 00:01:59,050 --> 00:02:00,150 Logs are your friends. 28 00:02:01,560 --> 00:02:05,860 Let me do kind of a quick review of the log, natural log function. 29 00:02:05,860 --> 00:02:09,652 Log lm and raise the function here that I am showing. 30 00:02:09,652 --> 00:02:13,358 You might have on the x-axis some value z, 31 00:02:13,358 --> 00:02:18,320 on the y-axis this is what the log of z looks like and 32 00:02:18,320 --> 00:02:21,810 you see that it's something that actually at zero would be minus infinity. 33 00:02:21,810 --> 00:02:25,640 It grows quickly at first and then kind of much, much, 34 00:02:25,640 --> 00:02:29,328 much slower later but this is what the log does. 35 00:02:29,328 --> 00:02:33,370 The reason that it's useful is that the big product that we had 36 00:02:33,370 --> 00:02:34,810 actually becomes a sum. 37 00:02:34,810 --> 00:02:42,410 In general, if you have the log of a times b that is equal to the log a plus log b. 38 00:02:44,200 --> 00:02:48,640 Similarly, if you have the log of a over b, 39 00:02:48,640 --> 00:02:52,210 that is the log of a minus the log of b. 40 00:02:53,720 --> 00:02:54,850 Here's a side note. 41 00:02:54,850 --> 00:02:58,640 I remember, in high school, I think, I learned about log function, and 42 00:02:58,640 --> 00:03:03,470 that the log of a times b is log a plus log b and I thought my god, 43 00:03:03,470 --> 00:03:05,590 this is the most useless thing I've ever seen. 44 00:03:05,590 --> 00:03:07,630 Why are they spending time teaching it? 45 00:03:07,630 --> 00:03:11,590 It really took about six years before I actually saw it was useful for something. 46 00:03:11,590 --> 00:03:14,930 It's actually extremely useful for machine learning, a funny side note. 47 00:03:14,930 --> 00:03:18,300 But anyway, a log has some very interesting properties. 48 00:03:18,300 --> 00:03:22,040 The other property it has is that you take a function F and 49 00:03:22,040 --> 00:03:26,830 you compute it's maximum, taking the log doesn't change anything. 50 00:03:29,780 --> 00:03:36,456 If you don't note W hat here to be, 51 00:03:36,456 --> 00:03:42,212 we're going to call it the arg 52 00:03:42,212 --> 00:03:46,587 max over w of f(w) and 53 00:03:46,587 --> 00:03:52,344 this notation here just means 54 00:03:52,344 --> 00:03:58,810 the w that makes f(w) largest. 55 00:03:58,810 --> 00:04:02,630 And if you were to take the log of that function, 56 00:04:02,630 --> 00:04:07,840 let's call W hat underscore ln to be 57 00:04:07,840 --> 00:04:12,134 the thing that maximizes the log of f(w). 58 00:04:12,134 --> 00:04:16,294 So w hat ln = 59 00:04:16,294 --> 00:04:21,030 arg max over w of 60 00:04:21,030 --> 00:04:26,090 the log (f(w)) 61 00:04:26,090 --> 00:04:31,090 because log is what's called the positive monotonic function. 62 00:04:32,500 --> 00:04:36,320 This doesn't change the optimum, this transformation. 63 00:04:37,720 --> 00:04:43,960 It turns out that w hat is going to be equal to w hat ln. 64 00:04:43,960 --> 00:04:48,070 What we did in the previous slide, we just take the log of the likelihood function, 65 00:04:48,070 --> 00:04:51,210 still keeps the optimum exactly at the same place. 66 00:04:51,210 --> 00:04:54,120 And it's going to make your math quite a bit easier. 67 00:04:54,120 --> 00:04:58,469 [MUSIC]