1 00:00:00,129 --> 00:00:04,336 [MUSIC] 2 00:00:04,336 --> 00:00:07,379 So in these algorithms, I said that there's a stepsize. 3 00:00:07,379 --> 00:00:10,680 And you might be wondering, well, how do you choose that stepsize? 4 00:00:10,680 --> 00:00:13,820 So, remember, the stepsize we denoted as eta. 5 00:00:15,780 --> 00:00:18,440 And there are lots of different choices you can make here, and 6 00:00:18,440 --> 00:00:19,770 clearly they're very important. 7 00:00:19,770 --> 00:00:22,520 This determines how much you're changing 8 00:00:22,520 --> 00:00:25,510 your W at every iteration of this algorithm. 9 00:00:26,550 --> 00:00:30,570 One choice you can look at is something called fixed stepsize or constant 10 00:00:30,570 --> 00:00:36,630 stepsize, where, as the name implies, you just set eta equal to some constant. 11 00:00:36,630 --> 00:00:40,770 So for example, maybe 0.1. 12 00:00:40,770 --> 00:00:44,150 But what can happen is that you can jump around quite a lot. 13 00:00:44,150 --> 00:00:46,760 So let's say I start far away. 14 00:00:46,760 --> 00:00:48,050 There's a really big radiant. 15 00:00:48,050 --> 00:00:52,810 I take a very big step and I keep taking these big steps. 16 00:00:52,810 --> 00:00:58,750 And then here, I end up jumping over the optimal to a point here and 17 00:00:58,750 --> 00:01:04,140 then I jump back and then I'm going back and forth, and back and forth. 18 00:01:04,140 --> 00:01:08,550 And I converge very slowly to the optimal itself. 19 00:01:08,550 --> 00:01:15,040 And so an analogy here is you can think of a snowboarder in a half-pipe where they 20 00:01:15,040 --> 00:01:19,420 just keeping ramping up, and back, and up, and back, and slowly it's decaying. 21 00:01:20,790 --> 00:01:24,860 And slowly they're gonna just settle in to the groove of that half-pipe. 22 00:01:24,860 --> 00:01:28,610 But you can imagine going back and forth and back and 23 00:01:28,610 --> 00:01:31,550 forth quite a number of times. 24 00:01:31,550 --> 00:01:35,220 But I should mention that this fixed stepsize setting 25 00:01:36,960 --> 00:01:40,230 will work for a lot of cases we're going to look at in this module. 26 00:01:41,710 --> 00:01:46,676 Because in most of the cases these functions are going to be something 27 00:01:46,676 --> 00:01:49,170 that's called strongly convex. 28 00:01:49,170 --> 00:01:52,020 That's just a really well behaved convex function. 29 00:01:52,020 --> 00:01:53,610 It's very clearly convex. 30 00:01:53,610 --> 00:01:57,880 It just never flattens out very much. 31 00:01:57,880 --> 00:02:03,600 And in those settings, you will converge using this fixed stepsize, 32 00:02:03,600 --> 00:02:08,040 it just might take a little while. 33 00:02:08,040 --> 00:02:12,840 On the other hand, a common choice is to decrease 34 00:02:12,840 --> 00:02:18,490 the stepsize as the number of iterations increase. 35 00:02:18,490 --> 00:02:21,310 Okay? So this means that there's some stepsize, 36 00:02:21,310 --> 00:02:25,522 is often called a schedule, so I can write that. 37 00:02:25,522 --> 00:02:33,560 Or stepsize schedule that you need to set, and you need. 38 00:02:35,210 --> 00:02:42,010 And common choices for this are that the stepsize you're using at iteration t, 39 00:02:43,010 --> 00:02:48,170 is equal to some alpha over t or another common choice 40 00:02:48,170 --> 00:02:53,570 is some alpha over square root of t. 41 00:02:55,780 --> 00:02:58,050 And let's just plot what these functions look like. 42 00:02:58,050 --> 00:03:03,260 Both of these functions look something like this. 43 00:03:03,260 --> 00:03:11,004 So the stepsize, so this would be the plot of eta of t over iterations t. 44 00:03:14,212 --> 00:03:19,290 So the stepsize is decreasing with the number of iterations. 45 00:03:19,290 --> 00:03:20,590 So why does this make sense? 46 00:03:20,590 --> 00:03:25,220 Well, when I start out, I'm typically assuming that I'm gonna be far or 47 00:03:25,220 --> 00:03:28,470 decently far from the optimal, so I want to take large steps, 48 00:03:28,470 --> 00:03:31,900 but as I'm running my algorithm, hoping I'm getting closer, and 49 00:03:31,900 --> 00:03:36,220 I want to hone in more rapidly to the optimal. 50 00:03:36,220 --> 00:03:38,470 So that's why I start decreasing the stepsize. 51 00:03:38,470 --> 00:03:43,590 So that when I'm out here, initially maybe I still take a really large jump. 52 00:03:43,590 --> 00:03:49,740 But then as I'm going in, maybe I'll jump across once, 53 00:03:49,740 --> 00:03:55,200 but I'm gonna hone in much more rapidly to this minimum here. 54 00:03:56,900 --> 00:03:59,440 But one thing you have to be careful about is not 55 00:03:59,440 --> 00:04:01,790 decreasing the stepsize too rapidly. 56 00:04:01,790 --> 00:04:06,000 Because if you're doing that, you're gonna, again, take a while to converge. 57 00:04:06,000 --> 00:04:09,910 Because you're just gonna be taking very, very, very small steps. 58 00:04:09,910 --> 00:04:14,860 Okay, so in summary choosing your stepsize is just a bit of an art. 59 00:04:14,860 --> 00:04:18,170 But here we've gone through a couple of examples of things that are commonly 60 00:04:18,170 --> 00:04:19,190 used out there. 61 00:04:19,190 --> 00:04:23,410 Okay, so the other part of the algorithm that we left unspecified 62 00:04:23,410 --> 00:04:27,580 is this part where we say, while not converged. 63 00:04:27,580 --> 00:04:29,100 How are we going to assess our convergence? 64 00:04:30,900 --> 00:04:35,330 Well, we know that the optimum occurs when the derivative of the function 65 00:04:37,140 --> 00:04:38,175 is equal to 0. 66 00:04:39,530 --> 00:04:44,130 But what we're gonna see in practice, 67 00:04:44,130 --> 00:04:48,970 is that the derivative, it's gonna get smaller and smaller and 68 00:04:48,970 --> 00:04:51,830 smaller and smaller and smaller, it's gonna get very, very, very, very, very, 69 00:04:51,830 --> 00:04:55,590 very, very, very small, but it won't be exactly 0. 70 00:04:55,590 --> 00:04:58,230 At some point, we're going to want to say, okay, that's good enough. 71 00:04:58,230 --> 00:04:59,610 We're close enough to the optimum. 72 00:04:59,610 --> 00:05:01,140 I'm gonna terminate this algorithm. 73 00:05:01,140 --> 00:05:03,970 And I'm gonna say that this is my solution. 74 00:05:03,970 --> 00:05:08,940 So what we're saying is that, what we're gonna need to specify 75 00:05:08,940 --> 00:05:12,740 is something where we say, when the absolute value of the derivative, 76 00:05:12,740 --> 00:05:15,440 I don't care if I'm a little bit to the right or 77 00:05:15,440 --> 00:05:20,730 a little bit to the left of the optimum, but just what the absolute value is. 78 00:05:20,730 --> 00:05:24,310 If that is less than sum epsilon. 79 00:05:24,310 --> 00:05:26,129 This is a threshold I'm setting. 80 00:05:30,962 --> 00:05:36,109 Then if this is satisfied, then I'm gonna terminate the algorithm and 81 00:05:36,109 --> 00:05:38,780 return whatever solution I have Wt. 82 00:05:40,110 --> 00:05:45,650 So in practice, we're just gonna choose epsilon to be very small. 83 00:05:45,650 --> 00:05:49,850 And I wanna emphasize that what very small means depends on 84 00:05:49,850 --> 00:05:55,190 the data that you're looking at, what the form of this function is. 85 00:05:55,190 --> 00:05:58,340 What are the range of gradients we might expect? 86 00:05:59,530 --> 00:06:04,300 Is the value of the function, a plot of the value of the function over iterations? 87 00:06:04,300 --> 00:06:07,140 And you'll tend to see that the value decrease, and 88 00:06:07,140 --> 00:06:09,330 it's basically not changing very much. 89 00:06:09,330 --> 00:06:11,711 And at that point, you know that you've converged. 90 00:06:11,711 --> 00:06:15,929 [MUSIC]