1 00:00:00,000 --> 00:00:04,003 [MUSIC] 2 00:00:04,003 --> 00:00:08,270 We'll start with a very quick review of gradient ascent. 3 00:00:08,270 --> 00:00:13,000 In the regression course, Emily went into quite a lot of detail explaining 4 00:00:13,000 --> 00:00:17,040 the gradient ascent algorithm, where it comes from, and the details. 5 00:00:17,040 --> 00:00:20,620 I really recommend you go back to that if you want to 6 00:00:20,620 --> 00:00:23,390 remind yourself of where it all comes from. 7 00:00:23,390 --> 00:00:26,730 I'm just going to do a very, very, very quick review here so 8 00:00:26,730 --> 00:00:28,750 that we remember the fundamentals. 9 00:00:30,380 --> 00:00:33,140 You can think about the gradient ascent algorithm as 10 00:00:33,140 --> 00:00:34,700 a kind of hill climbing algorithm. 11 00:00:34,700 --> 00:00:41,865 If you look at the picture on the left here if you only had one parameter w, 12 00:00:41,865 --> 00:00:49,032 you can imagine starting at some point, let's say w t with t iteration, 13 00:00:49,032 --> 00:00:55,640 and then moving little bit uphill to the next parameter, w t+1. 14 00:00:57,538 --> 00:01:03,020 And the amount you move from one to the other has to do with this term over here, 15 00:01:03,020 --> 00:01:07,870 which is the derivative of our likelihood function with respect to the parameter w. 16 00:01:07,870 --> 00:01:13,600 And its computed at the current parameter w t. 17 00:01:14,990 --> 00:01:20,270 Now remember we have this little extra coefficient parameter eta, 18 00:01:20,270 --> 00:01:21,440 which we call the step size. 19 00:01:22,710 --> 00:01:25,360 A little later in the module, we're going to discuss how that step size 20 00:01:25,360 --> 00:01:28,590 actually gets picked and what effect it has. 21 00:01:28,590 --> 00:01:33,040 But imagine it's given, so you're going to move by eta times the step size. 22 00:01:33,040 --> 00:01:37,500 And so you move here, then you move a little bit, and you keep moving, moving, 23 00:01:37,500 --> 00:01:42,890 moving, moving uphill, and eventually you get to the top of the hill to the optimum, 24 00:01:42,890 --> 00:01:47,830 which may be, we're going to call w star for right now. 25 00:01:47,830 --> 00:01:52,440 And this is our goal, to get to that very, very top of the hill. 26 00:01:54,400 --> 00:01:58,530 Now that we seen kind of hill climbing the abstract way, getting to that top of 27 00:01:58,530 --> 00:02:04,180 the hill, so we step up until we hit the top here which is w's term. 28 00:02:04,180 --> 00:02:08,450 We can talk about when we know we are done in the interpretive algorithm, 29 00:02:08,450 --> 00:02:09,927 when have we converged? 30 00:02:09,927 --> 00:02:13,990 So you might remember that at the optimum of convex functions and 31 00:02:13,990 --> 00:02:15,980 what we are dealing with today is a convex function and 32 00:02:15,980 --> 00:02:20,770 we talked in quite a bit of detail what that means, but just in general we'll have 33 00:02:20,770 --> 00:02:25,905 the derivative with respect to w for likelihood is going to be equal to 34 00:02:25,905 --> 00:02:30,820 zero at the optimum here, because that curve flattens out. 35 00:02:32,350 --> 00:02:35,430 Now, if we could get to the point where the derivative is equal to zero we 36 00:02:35,430 --> 00:02:37,240 would be absolutely done. 37 00:02:37,240 --> 00:02:39,518 However we're never going to get there exactly, so 38 00:02:39,518 --> 00:02:42,076 we're going to stop when the derivative is small enough. 39 00:02:42,076 --> 00:02:47,252 So we stop when the derivative with respect to 40 00:02:47,252 --> 00:02:52,980 parameter w computed to recurrent parameter wt. 41 00:02:54,110 --> 00:02:57,570 So we'll stop when the derivative is smaller than some 42 00:02:57,570 --> 00:02:59,310 taller s parameter epsilon. 43 00:03:00,750 --> 00:03:04,630 And, you know, sent out some small numbers, 0.001 or something. 44 00:03:04,630 --> 00:03:08,260 Depends on your problem, you'll explore this in your homework. 45 00:03:08,260 --> 00:03:11,373 But keep going, you won't get exactly to the top, but 46 00:03:11,373 --> 00:03:15,382 you're going to get pretty close to the top and that will be your w hat. 47 00:03:15,382 --> 00:03:18,193 So we'll continue going up and up and 48 00:03:18,193 --> 00:03:21,930 up until our derivative is sufficiently small. 49 00:03:21,930 --> 00:03:26,810 Now that was for a one dimensional space, but we have now a large animation 50 00:03:26,810 --> 00:03:30,150 space because we can have thousands of coefficients we're trying to fit. 51 00:03:31,900 --> 00:03:36,530 So, instead of computing a one dimensional derivative, 52 00:03:36,530 --> 00:03:38,200 we need what's called a gradient. 53 00:03:38,200 --> 00:03:43,400 And a gradient is just a stacking here of the derivative of l 54 00:03:43,400 --> 00:03:46,720 with respect to the first parameter the zero. 55 00:03:46,720 --> 00:03:52,140 The partial relative of l with respect to the second parameter of w1 all the way to 56 00:03:52,140 --> 00:03:57,500 the derivative l the partial derivative with respect to the last parameter wd. 57 00:03:58,600 --> 00:04:05,043 So this is a D + 1 dimensional vector for 58 00:04:05,043 --> 00:04:09,880 when you have D features. 59 00:04:09,880 --> 00:04:15,590 And in our case here, if we start from this point and the derivative 60 00:04:15,590 --> 00:04:21,190 is taken this way, the gradient is going to be something like this maybe. 61 00:04:21,190 --> 00:04:23,550 And that's what the gradient correspond to. 62 00:04:26,270 --> 00:04:30,310 And then we're just going to follow that gradient all the way 63 00:04:30,310 --> 00:04:31,070 to the top of the hill. 64 00:04:34,740 --> 00:04:36,840 In a little bit more detail, more pictorially, 65 00:04:36,840 --> 00:04:40,270 we're going to set some sub-points. 66 00:04:40,270 --> 00:04:42,180 They're mainly heuristics for where to start, but 67 00:04:42,180 --> 00:04:45,440 let's say we start over here and we're just going to follow the gradient. 68 00:04:45,440 --> 00:04:50,080 And so on the right side I'm showing you 69 00:04:50,080 --> 00:04:54,307 what are called the contour plots. 70 00:04:54,307 --> 00:04:59,060 And so it corresponds to the level set to the contours of this line. 71 00:04:59,060 --> 00:05:02,023 So if I start on the left side over here, it's kind of like starting over here. 72 00:05:02,023 --> 00:05:04,111 I take a gradient step and 73 00:05:04,111 --> 00:05:08,684 I can keep following it until I get to the optimum here, 74 00:05:08,684 --> 00:05:13,657 which is going to be rw* and that corresponds to climbing this 75 00:05:13,657 --> 00:05:18,870 hill until I get to the top of the hill, which is going to be rw*. 76 00:05:18,870 --> 00:05:21,970 And again, Emily went to quite a bit of detail on contour plots and 77 00:05:21,970 --> 00:05:23,340 how they relate to that 3D plot. 78 00:05:24,884 --> 00:05:25,950 But that's our goal. 79 00:05:27,950 --> 00:05:29,420 So finally, just as a reminder, 80 00:05:29,420 --> 00:05:33,910 here's what the gradient ascent algorithm looks like. 81 00:05:33,910 --> 00:05:40,537 You start from some point, w 0, and so you might say w = 0, 82 00:05:40,537 --> 00:05:46,720 or random parameters or something else. 83 00:05:48,680 --> 00:05:50,250 And you just keep following the gradient 84 00:05:51,850 --> 00:05:55,310 until the magnitude of the gradient is sufficiently small. 85 00:05:56,430 --> 00:05:59,310 So that's our algorithm, that's going to take us to the optimum. 86 00:06:00,760 --> 00:06:05,090 And it's simplified over here and the eta here is our famous step size. 87 00:06:07,590 --> 00:06:13,353 Great, so now we're done for a review of gradient ascent. 88 00:06:13,353 --> 00:06:17,369 [MUSIC]