1 00:00:00,400 --> 00:00:01,739 In this video, I wanna give 2 00:00:01,739 --> 00:00:05,133 you more practical tips for getting gradient descent to work. 3 00:00:05,133 --> 00:00:06,563 The ideas in this video will 4 00:00:06,563 --> 00:00:09,414 center around the learning rate alpha. 5 00:00:09,551 --> 00:00:11,625 Concretely, here's the gradient 6 00:00:11,640 --> 00:00:13,677 descent update rule and what 7 00:00:13,677 --> 00:00:14,908 I want to do in this video 8 00:00:14,908 --> 00:00:16,784 is tell you about what 9 00:00:16,784 --> 00:00:18,629 I think of as debugging and some 10 00:00:18,629 --> 00:00:19,994 tips for making sure that 11 00:00:19,994 --> 00:00:22,385 Gradient Descent is working correctly 12 00:00:22,390 --> 00:00:23,632 and second, I want to tell you 13 00:00:23,632 --> 00:00:25,879 how to choose the rates 14 00:00:25,890 --> 00:00:27,079 out for, but this is how 15 00:00:27,079 --> 00:00:29,222 I go about choosing it. 16 00:00:29,222 --> 00:00:30,702 Here's something that I often do 17 00:00:30,702 --> 00:00:34,125 to make sure gradient descent is working correctly. 18 00:00:34,125 --> 00:00:35,852 The job of gradient descent is 19 00:00:35,852 --> 00:00:37,107 to find a value of 20 00:00:37,107 --> 00:00:38,710 theta for you that, you 21 00:00:38,710 --> 00:00:42,692 know, hopefully minimizes the cost function j of theta. 22 00:00:42,692 --> 00:00:44,285 What I often do is therefore 23 00:00:44,300 --> 00:00:46,121 pluck the cost function j 24 00:00:46,121 --> 00:00:49,731 of theta as gradient descent runs. 25 00:00:49,750 --> 00:00:51,367 So, the x-axis here is 26 00:00:51,367 --> 00:00:52,828 the number of iteration of gradient 27 00:00:52,850 --> 00:00:54,278 descent and as gradient descent 28 00:00:54,278 --> 00:00:55,985 runs, you'll hopefully get a 29 00:00:55,985 --> 00:00:59,722 plot that maybe looks like this. 30 00:00:59,722 --> 00:01:01,249 Notice that the x-axis is 31 00:01:01,249 --> 00:01:03,592 a number of iterations previously 32 00:01:03,592 --> 00:01:05,098 we were looking at plots of 33 00:01:05,098 --> 00:01:07,050 J of theta where the 34 00:01:07,050 --> 00:01:08,931 X-axis, where the horizontal axis, 35 00:01:08,950 --> 00:01:13,122 was the parameter vector theta but this is not where this is. 36 00:01:13,122 --> 00:01:15,068 Concretely, what this point 37 00:01:15,090 --> 00:01:17,725 is is I'm going 38 00:01:17,725 --> 00:01:20,608 to rank gradient descent for hundred iterations. 39 00:01:20,608 --> 00:01:22,608 And whatever value I get 40 00:01:22,620 --> 00:01:24,095 for theta after a hundred 41 00:01:24,110 --> 00:01:25,620 of the rations and get, 42 00:01:25,620 --> 00:01:27,139 you know, some value of theta 43 00:01:27,150 --> 00:01:29,150 after a hundred iterations and I'm 44 00:01:29,150 --> 00:01:30,683 going to evaluate the cost 45 00:01:30,683 --> 00:01:32,857 function J of theta for 46 00:01:32,857 --> 00:01:34,118 the value of theta I get 47 00:01:34,120 --> 00:01:36,272 after a hundred iterations and this 48 00:01:36,272 --> 00:01:37,718 vertical height is the 49 00:01:37,718 --> 00:01:39,961 value of J of theta for 50 00:01:39,961 --> 00:01:41,135 the value of theta I got 51 00:01:41,135 --> 00:01:42,212 after a hundred other ratios of 52 00:01:42,220 --> 00:01:44,050 gradient descent and this 53 00:01:44,050 --> 00:01:46,515 point here, that corresponds 54 00:01:46,515 --> 00:01:48,253 to the value of J of 55 00:01:48,253 --> 00:01:50,120 theta for the theta 56 00:01:50,120 --> 00:01:52,097 that I get after I've 57 00:01:52,097 --> 00:01:55,172 run grade and descent for two hundred iterations. 58 00:01:55,172 --> 00:01:56,713 So what this plot is showing, 59 00:01:56,720 --> 00:01:58,177 is it's showing the value of 60 00:01:58,200 --> 00:02:02,025 your cost function after iteration of gradient descent. 61 00:02:02,025 --> 00:02:03,335 And, if gradient descent is 62 00:02:03,350 --> 00:02:05,180 working properly, then J 63 00:02:05,190 --> 00:02:08,952 of theta should decrease. 64 00:02:08,952 --> 00:02:12,219 after every iteration. 65 00:02:16,451 --> 00:02:19,248 And one useful thing 66 00:02:19,248 --> 00:02:20,580 that this sort of plot can 67 00:02:20,580 --> 00:02:22,545 tell you also is that 68 00:02:22,545 --> 00:02:24,147 if you look at the specific figure 69 00:02:24,160 --> 00:02:26,015 that I've drawn, it looks like 70 00:02:26,030 --> 00:02:27,581 by the time you've gotten out 71 00:02:27,581 --> 00:02:29,744 to three hundred iterations, 72 00:02:29,744 --> 00:02:31,348 between three and four hundred 73 00:02:31,348 --> 00:02:32,908 iterations, in this segment, it 74 00:02:32,910 --> 00:02:35,792 looks like J of theta hasn't gone down much more. 75 00:02:35,810 --> 00:02:36,930 So by the time you get 76 00:02:36,960 --> 00:02:38,785 to four hundred iterations, it looks 77 00:02:38,810 --> 00:02:41,554 like this curve has flattened out here. 78 00:02:41,554 --> 00:02:43,334 And so, way out 79 00:02:43,340 --> 00:02:44,533 here at four hundred iterations, it 80 00:02:44,533 --> 00:02:45,847 looks like grade and descend has 81 00:02:45,850 --> 00:02:47,868 more or less converged because your 82 00:02:47,880 --> 00:02:50,493 cost function isn't going down much more. 83 00:02:50,493 --> 00:02:51,625 So looking at this figure can 84 00:02:51,625 --> 00:02:53,390 also help you judge 85 00:02:53,420 --> 00:02:56,812 whether or not gradient descent has converged. 86 00:02:57,550 --> 00:02:58,900 By the way, the number of 87 00:02:58,900 --> 00:03:00,820 iterations that gradient descent takes 88 00:03:00,820 --> 00:03:02,098 to converge for a physical 89 00:03:02,098 --> 00:03:04,251 application can vary a lot. 90 00:03:04,251 --> 00:03:06,092 So maybe for one application gradient 91 00:03:06,130 --> 00:03:07,832 descent may converge after just 92 00:03:07,832 --> 00:03:10,198 thirty iterations, for a 93 00:03:10,210 --> 00:03:12,670 different application gradient descent 94 00:03:12,670 --> 00:03:15,042 made the 3,000 iterations. 95 00:03:15,050 --> 00:03:17,996 For another learning algorithm 96 00:03:17,996 --> 00:03:19,823 it may take three million iterations. 97 00:03:19,823 --> 00:03:20,768 It turns out to be 98 00:03:20,768 --> 00:03:22,312 very difficult to tell in 99 00:03:22,312 --> 00:03:24,333 advance how many iterations gradient 100 00:03:24,360 --> 00:03:26,252 descent needs to converge, and 101 00:03:26,252 --> 00:03:28,935 is usually by plotting this sort of plot. 102 00:03:28,940 --> 00:03:32,958 Plotting the cause function as we increase the number of iterations. 103 00:03:32,960 --> 00:03:34,347 It's usually by looking at these 104 00:03:34,347 --> 00:03:35,610 plots that I tried to tell 105 00:03:35,610 --> 00:03:38,470 if gradient descent has converged. 106 00:03:38,590 --> 00:03:40,118 It is also possible to come 107 00:03:40,130 --> 00:03:42,748 up with automatic convergence test; namely 108 00:03:42,748 --> 00:03:44,306 to have an algorithm to try 109 00:03:44,306 --> 00:03:46,593 to tell you if gradient descent 110 00:03:46,593 --> 00:03:48,615 has converged and here's maybe 111 00:03:48,620 --> 00:03:50,268 a pretty typical example of an 112 00:03:50,268 --> 00:03:52,505 automatic convergence test and 113 00:03:52,540 --> 00:03:54,981 so, you test the clear convergence 114 00:03:54,981 --> 00:03:57,009 if your cause function jf theta 115 00:03:57,020 --> 00:03:58,396 decreases by less than 116 00:03:58,396 --> 00:04:01,435 some small value epsilon, some 117 00:04:01,435 --> 00:04:02,412 small value ten to the 118 00:04:02,412 --> 00:04:05,272 minus three in one iteration, 119 00:04:05,272 --> 00:04:07,065 but I find that usually 120 00:04:07,070 --> 00:04:10,740 choosing what this threshold is is pretty difficult. 121 00:04:10,740 --> 00:04:12,049 So, in order to check 122 00:04:12,049 --> 00:04:14,058 your gradient descent has converged, I 123 00:04:14,090 --> 00:04:15,361 actually tend to look at 124 00:04:15,361 --> 00:04:17,074 plots like like this 125 00:04:17,074 --> 00:04:18,299 figure on the left rather than 126 00:04:18,310 --> 00:04:21,778 rely on an automatic convergence test. 127 00:04:21,778 --> 00:04:22,778 Looking at this sort of 128 00:04:22,780 --> 00:04:24,340 figure can also tell you or 129 00:04:24,340 --> 00:04:25,812 give you an advanced warning if maybe 130 00:04:25,820 --> 00:04:28,659 gradient descent is not working correctly. 131 00:04:28,690 --> 00:04:30,185 Concretely, if you plug 132 00:04:30,200 --> 00:04:31,641 jf theta as a function 133 00:04:31,650 --> 00:04:34,850 of number of iterations, then, if 134 00:04:34,850 --> 00:04:35,864 you see a figure like this, 135 00:04:35,864 --> 00:04:37,105 where J of theta is actually 136 00:04:37,120 --> 00:04:39,136 increasing, then that gives 137 00:04:39,136 --> 00:04:42,898 you a clear sign that gradient descent is not working. 138 00:04:42,898 --> 00:04:44,552 And a figure like this 139 00:04:44,552 --> 00:04:48,253 usually means that you should be using smaller learning rate alpha. 140 00:04:48,270 --> 00:04:49,697 If J of theta is actually 141 00:04:49,697 --> 00:04:51,564 increasing, the most common 142 00:04:51,580 --> 00:04:53,199 cause for that is if 143 00:04:53,199 --> 00:04:54,904 you're trying to minimize 144 00:04:54,904 --> 00:04:59,346 the function that maybe looks like this. 145 00:04:59,346 --> 00:05:00,518 That's if your learning rate is 146 00:05:00,518 --> 00:05:01,614 too big then if you 147 00:05:01,614 --> 00:05:03,202 start off there, gradient descent 148 00:05:03,202 --> 00:05:05,516 may overshoot the minimum, send 149 00:05:05,516 --> 00:05:07,145 you there, then if only there's 150 00:05:07,145 --> 00:05:08,473 too big, you may overshoot again, 151 00:05:08,500 --> 00:05:10,493 it will send you there and 152 00:05:10,500 --> 00:05:12,279 so on so that what 153 00:05:12,279 --> 00:05:13,810 you really wanted was really 154 00:05:13,810 --> 00:05:17,991 start here and for to slowly go downhill. 155 00:05:17,991 --> 00:05:19,465 But if the learning is too 156 00:05:19,465 --> 00:05:21,252 big then gradient descent can 157 00:05:21,252 --> 00:05:22,749 instead keep on over 158 00:05:22,760 --> 00:05:24,454 shooting the minimum so 159 00:05:24,454 --> 00:05:26,147 that you actually end up 160 00:05:26,160 --> 00:05:27,170 getting worse and worse instead 161 00:05:27,210 --> 00:05:28,720 of getting the higher values of 162 00:05:28,780 --> 00:05:30,744 the cost function j of theta 163 00:05:30,744 --> 00:05:31,751 so do you end up with a 164 00:05:31,751 --> 00:05:33,263 plot like and if you 165 00:05:33,263 --> 00:05:34,248 see a plot like this the 166 00:05:34,248 --> 00:05:36,106 fix usually is to just 167 00:05:36,106 --> 00:05:38,182 use a smaller value of alpha. 168 00:05:38,182 --> 00:05:39,640 Oh, and also of course make 169 00:05:39,790 --> 00:05:41,872 sure that your code does not have a bug in it. 170 00:05:41,872 --> 00:05:43,268 But usually to watch it 171 00:05:43,268 --> 00:05:44,709 out of the firms is the 172 00:05:44,709 --> 00:05:48,105 most common, could be a common problem. 173 00:05:49,050 --> 00:05:50,595 Similarly, sometimes, you may 174 00:05:50,595 --> 00:05:52,115 also see j of theta 175 00:05:52,120 --> 00:05:53,188 do something like this and it 176 00:05:53,188 --> 00:05:54,158 go down for a while then 177 00:05:54,160 --> 00:05:56,325 go up then go down for a while then go up. 178 00:05:56,330 --> 00:05:57,366 Go down for a while, it 179 00:05:57,366 --> 00:05:58,910 goes up and so on and 180 00:05:58,930 --> 00:06:00,150 and to fix for something like 181 00:06:00,150 --> 00:06:04,052 this is also to use a smaller value of alpha. 182 00:06:04,090 --> 00:06:05,129 I'm not going to prove it 183 00:06:05,129 --> 00:06:07,128 here, but undeniable assumptions about 184 00:06:07,128 --> 00:06:10,806 the cost function, which does proof of linear regression. 185 00:06:10,830 --> 00:06:12,608 You can show of mathematicians have 186 00:06:12,608 --> 00:06:13,913 shown that if your learning 187 00:06:13,913 --> 00:06:15,885 rate offer is small enough 188 00:06:15,885 --> 00:06:19,025 then j of theta should decrease on every single iteration. 189 00:06:19,030 --> 00:06:21,342 So, if this doesn't happen, probably 190 00:06:21,342 --> 00:06:22,338 means algorithm is too big then 191 00:06:22,338 --> 00:06:23,992 you should send a smaller, but of 192 00:06:23,992 --> 00:06:24,867 course, you all So you don't 193 00:06:24,890 --> 00:06:25,788 want your learning rate to be 194 00:06:25,788 --> 00:06:27,068 too small because if you 195 00:06:27,070 --> 00:06:28,095 do that, if you were 196 00:06:28,095 --> 00:06:31,543 to do that, then gradient descent can be slow to converge. 197 00:06:31,543 --> 00:06:32,812 And if alpha were too 198 00:06:32,812 --> 00:06:34,804 small, you might end up 199 00:06:34,804 --> 00:06:36,945 starting out here, say, and, 200 00:06:36,960 --> 00:06:38,248 you know, end up taking just 201 00:06:38,248 --> 00:06:40,408 minuscule, minuscule baby steps. 202 00:06:40,408 --> 00:06:41,338 Right? 203 00:06:41,338 --> 00:06:42,974 And just taking a lot 204 00:06:42,980 --> 00:06:47,064 of iterations before you finally get to the minimum. 205 00:06:47,090 --> 00:06:48,118 And so, if alpha is too 206 00:06:48,118 --> 00:06:49,500 small, gradient descent can 207 00:06:49,570 --> 00:06:52,989 make very slow progress and be slow to converge. 208 00:06:53,005 --> 00:06:55,377 To summarize, if the learning 209 00:06:55,377 --> 00:06:57,301 rate is too small, you can 210 00:06:57,301 --> 00:06:59,672 have a slow convergence problem, and 211 00:06:59,672 --> 00:07:01,161 if the learning rate is too 212 00:07:01,161 --> 00:07:02,494 large, j of theta may 213 00:07:02,494 --> 00:07:04,378 not decrease on every iteration 214 00:07:04,378 --> 00:07:06,023 and may not even converge. 215 00:07:06,023 --> 00:07:08,579 In some cases, if the learning 216 00:07:08,579 --> 00:07:10,957 rate is too large, slow convergence 217 00:07:10,990 --> 00:07:14,710 is also possible, but the 218 00:07:14,800 --> 00:07:16,312 more common problem you see 219 00:07:16,312 --> 00:07:17,380 is that just that j of 220 00:07:17,440 --> 00:07:20,532 theta may not decrease on every iteration. 221 00:07:20,540 --> 00:07:22,223 And in order to debug all 222 00:07:22,223 --> 00:07:24,539 of these things, often plotting that 223 00:07:24,539 --> 00:07:26,053 j of theta as a function 224 00:07:26,070 --> 00:07:29,315 of the number of iterations can help you figure out what's going on. 225 00:07:29,315 --> 00:07:31,258 Concretely, what I actually 226 00:07:31,258 --> 00:07:32,525 do when I run gradient 227 00:07:32,525 --> 00:07:34,997 descent is I would try a range of values. 228 00:07:35,000 --> 00:07:36,555 So just try running gradient descent 229 00:07:36,580 --> 00:07:37,988 with a range of values for 230 00:07:37,988 --> 00:07:39,902 alpha, like 0.001, 0.01, 231 00:07:39,902 --> 00:07:41,471 so these are a 232 00:07:41,471 --> 00:07:43,275 factor of 10 differences, and 233 00:07:43,280 --> 00:07:44,449 for these differences of this 234 00:07:44,449 --> 00:07:45,769 of alpha, just plot j of 235 00:07:45,769 --> 00:07:47,015 theta as a function of number 236 00:07:47,030 --> 00:07:49,202 of iterations and then pick 237 00:07:49,202 --> 00:07:51,094 the value of alpha that, you 238 00:07:51,094 --> 00:07:54,805 know, seems to be causing j of theta to decrease rapidly. 239 00:07:54,805 --> 00:07:58,067 In fact, what I do actually isn't these steps of ten. 240 00:07:58,067 --> 00:07:59,305 So, you know, this is 241 00:07:59,305 --> 00:08:01,780 a scale factor of ten if you reach the top. 242 00:08:01,869 --> 00:08:03,860 What I'll actually do is try 243 00:08:03,870 --> 00:08:07,164 this range of values and 244 00:08:07,164 --> 00:08:09,770 so on where this is, 245 00:08:09,816 --> 00:08:11,365 you know, 0.001 246 00:08:11,365 --> 00:08:13,598 then increase the linear rate to 247 00:08:13,598 --> 00:08:15,182 3.4 to get 0.03 and then 248 00:08:15,182 --> 00:08:17,356 to step up this is another 249 00:08:17,356 --> 00:08:20,187 roughly 3 fold increase point 250 00:08:20,187 --> 00:08:22,145 of 0.03 to 0.01s and so these 251 00:08:22,145 --> 00:08:24,605 are roughly, you know, 252 00:08:24,605 --> 00:08:27,343 trying out gradient descents with each 253 00:08:27,343 --> 00:08:28,580 value I try being about 254 00:08:28,580 --> 00:08:30,592 3X bigger than the previous value. 255 00:08:30,592 --> 00:08:32,256 So what I'll do is a range 256 00:08:32,256 --> 00:08:33,495 of values until I've made sure 257 00:08:33,495 --> 00:08:34,725 that I've found one value that 258 00:08:34,725 --> 00:08:35,757 is too small and made sure 259 00:08:35,757 --> 00:08:37,137 I found one value that is 260 00:08:37,137 --> 00:08:38,394 too large, and then I sort 261 00:08:38,394 --> 00:08:40,560 of try to pick the largest 262 00:08:40,560 --> 00:08:42,279 possible value or just something 263 00:08:42,279 --> 00:08:43,764 slightly smaller than the 264 00:08:43,764 --> 00:08:46,488 largest reasonable value that I found. 265 00:08:46,488 --> 00:08:48,178 And when I do that 266 00:08:48,178 --> 00:08:49,592 usually it just gives me 267 00:08:49,592 --> 00:08:51,968 a good learning rate for my problem. 268 00:08:51,968 --> 00:08:53,203 And if you do this 269 00:08:53,203 --> 00:08:54,345 too, hopefully you will be 270 00:08:54,345 --> 00:08:55,906 able to choose a good 271 00:08:55,906 --> 00:08:57,340 learning rate for your implementation 272 00:08:57,340 --> 00:08:58,563 of gradient descent.