1 00:00:00,190 --> 00:00:01,270 In this video and in 2 00:00:01,440 --> 00:00:02,720 the video after this one, I 3 00:00:02,850 --> 00:00:04,040 wanna tell you about some of 4 00:00:04,180 --> 00:00:06,940 the practical tricks for making gradient descent work well. 5 00:00:07,680 --> 00:00:10,250 In this video, I want to tell you about an idea called feature skill. 6 00:00:11,770 --> 00:00:12,210 Here's the idea. 7 00:00:13,030 --> 00:00:14,080 If you have a problem where you 8 00:00:14,180 --> 00:00:15,880 have multiple features, if you 9 00:00:16,320 --> 00:00:17,410 make sure that the features 10 00:00:18,050 --> 00:00:19,440 are on a similar scale, by 11 00:00:19,570 --> 00:00:20,480 which I mean make sure that 12 00:00:20,650 --> 00:00:22,130 the different features take on 13 00:00:22,300 --> 00:00:23,390 similar ranges of values, 14 00:00:24,420 --> 00:00:26,490 then gradient descents can converge more quickly. 15 00:00:27,510 --> 00:00:28,680 Concretely let's say you 16 00:00:28,820 --> 00:00:29,860 have a problem with two features 17 00:00:30,380 --> 00:00:31,680 where X1 is the size 18 00:00:31,950 --> 00:00:32,860 of house and takes on values 19 00:00:33,530 --> 00:00:34,540 between say zero to two thousand 20 00:00:35,490 --> 00:00:36,270 and two is the number 21 00:00:36,520 --> 00:00:37,570 of bedrooms, and maybe that takes 22 00:00:37,820 --> 00:00:39,250 on values between one and five. 23 00:00:40,100 --> 00:00:41,690 If you plot the contours of 24 00:00:41,800 --> 00:00:43,000 the cos function J of theta, 25 00:00:44,810 --> 00:00:46,540 then the contours may look 26 00:00:46,750 --> 00:00:49,010 like this, where, let's see, 27 00:00:49,230 --> 00:00:50,570 J of theta is a function 28 00:00:50,910 --> 00:00:53,590 of parameters theta zero, theta one and theta two. 29 00:00:54,300 --> 00:00:55,400 I'm going to ignore theta zero, 30 00:00:56,020 --> 00:00:57,230 so let's about theta 0 31 00:00:57,480 --> 00:00:58,730 and pretend as a function of 32 00:00:58,840 --> 00:01:01,080 only theta 1 and theta 33 00:01:01,510 --> 00:01:02,810 2, but if x1 can take on 34 00:01:02,940 --> 00:01:04,110 them, you know, much larger range 35 00:01:04,370 --> 00:01:05,790 of values and x2 It turns 36 00:01:06,120 --> 00:01:07,270 out that the contours of the 37 00:01:07,340 --> 00:01:08,320 cause function J of theta 38 00:01:09,420 --> 00:01:11,400 can take on this very 39 00:01:11,690 --> 00:01:14,720 very skewed elliptical shape, except 40 00:01:15,070 --> 00:01:16,620 that with the so 2000 to 41 00:01:16,770 --> 00:01:18,470 5 ratio, it can be even more secure. 42 00:01:18,800 --> 00:01:20,190 So, this is very, very tall 43 00:01:20,560 --> 00:01:23,070 and skinny ellipses, or these 44 00:01:23,320 --> 00:01:24,950 very tall skinny ovals, can form 45 00:01:25,310 --> 00:01:27,940 the contours of the cause function J of theta. 46 00:01:29,420 --> 00:01:30,860 And if you run gradient descents 47 00:01:30,930 --> 00:01:34,290 on this cos-function, your 48 00:01:34,830 --> 00:01:36,480 gradients may end up 49 00:01:36,970 --> 00:01:38,660 taking a long time and 50 00:01:39,080 --> 00:01:40,360 can oscillate back and forth 51 00:01:41,100 --> 00:01:43,130 and take a long time before it 52 00:01:43,190 --> 00:01:46,120 can finally find its way to the global minimum. 53 00:01:47,470 --> 00:01:48,720 In fact, you can imagine if these 54 00:01:48,890 --> 00:01:50,400 contours are exaggerated even 55 00:01:50,580 --> 00:01:51,970 more when you draw incredibly 56 00:01:52,480 --> 00:01:54,300 skinny, tall skinny contours, 57 00:01:56,230 --> 00:01:57,030 and it can be even more extreme 58 00:01:57,380 --> 00:01:59,060 than, then, gradient descent 59 00:01:59,790 --> 00:02:02,310 just have a much 60 00:02:02,630 --> 00:02:04,280 harder time taking it's way, 61 00:02:04,690 --> 00:02:06,030 meandering around, it can take 62 00:02:06,120 --> 00:02:08,270 a long time to find this way to the global minimum. 63 00:02:12,130 --> 00:02:14,370 In these settings, a useful 64 00:02:14,780 --> 00:02:16,280 thing to do is to scale the features. 65 00:02:17,380 --> 00:02:18,760 Concretely if you instead 66 00:02:19,200 --> 00:02:20,370 define the feature X 67 00:02:20,570 --> 00:02:21,770 one to be the size of 68 00:02:21,870 --> 00:02:23,070 the house divided by two thousand, 69 00:02:24,040 --> 00:02:25,140 and define X two to be 70 00:02:25,270 --> 00:02:26,520 maybe the number of bedrooms divided 71 00:02:26,940 --> 00:02:29,010 by five, then the 72 00:02:29,170 --> 00:02:30,020 count well as of the 73 00:02:30,090 --> 00:02:31,840 cost function J can become 74 00:02:32,900 --> 00:02:34,430 much more, much less 75 00:02:34,840 --> 00:02:36,990 skewed so the contours may look more like circles. 76 00:02:38,210 --> 00:02:39,180 And if you run gradient 77 00:02:39,520 --> 00:02:40,540 descent on a cost function like 78 00:02:40,750 --> 00:02:42,120 this, then gradient descent, 79 00:02:44,110 --> 00:02:45,630 you can show mathematically, you can 80 00:02:45,860 --> 00:02:47,430 find a much more direct path 81 00:02:47,540 --> 00:02:48,830 to the global minimum rather than taking 82 00:02:49,390 --> 00:02:51,200 a much more convoluted path 83 00:02:51,530 --> 00:02:52,530 where you're sort of trying to 84 00:02:52,620 --> 00:02:53,520 follow a much more complicated 85 00:02:54,310 --> 00:02:55,910 trajectory to get to the global minimum. 86 00:02:57,300 --> 00:02:58,710 So, by scaling the features so 87 00:02:58,950 --> 00:03:01,000 that there are, the consumer ranges of values. 88 00:03:01,620 --> 00:03:02,810 In this example, we end up 89 00:03:02,970 --> 00:03:04,150 with both features, X one 90 00:03:04,300 --> 00:03:06,960 and X two, between zero and one. 91 00:03:09,580 --> 00:03:12,290 You can wind up with an implementation of gradient descent. 92 00:03:12,690 --> 00:03:13,810 They can convert much faster. 93 00:03:18,120 --> 00:03:19,640 More generally, when we're performing 94 00:03:20,160 --> 00:03:21,240 feature scaling, what we often 95 00:03:21,530 --> 00:03:22,480 want to do is get every 96 00:03:22,750 --> 00:03:25,670 feature into approximately a -1 97 00:03:25,780 --> 00:03:28,170 to +1 range and concretely, 98 00:03:28,960 --> 00:03:31,710 your feature x0 is always equal to 1. 99 00:03:31,760 --> 00:03:32,810 So, that's already in that range, 100 00:03:34,110 --> 00:03:35,150 but you may end up dividing 101 00:03:35,630 --> 00:03:36,950 other features by different numbers 102 00:03:37,330 --> 00:03:39,150 to get them to this range. 103 00:03:39,510 --> 00:03:41,520 The numbers -1 and +1 aren't too important. 104 00:03:42,270 --> 00:03:42,900 So, if you have a feature, 105 00:03:44,150 --> 00:03:45,340 x1 that winds up 106 00:03:45,510 --> 00:03:48,000 being between zero and three, that's not a problem. 107 00:03:48,400 --> 00:03:49,410 If you end up having a different 108 00:03:49,600 --> 00:03:51,190 feature that winds being 109 00:03:52,140 --> 00:03:54,020 between -2 and + 0.5, 110 00:03:54,300 --> 00:03:55,710 again, this is close enough 111 00:03:56,070 --> 00:03:57,070 to minus one and plus one 112 00:03:57,320 --> 00:03:59,160 that, you know, that's fine, and that's fine. 113 00:04:00,310 --> 00:04:01,260 It's only if you have a 114 00:04:01,340 --> 00:04:02,580 different feature, say X 3 115 00:04:02,820 --> 00:04:04,780 that is between, that 116 00:04:05,840 --> 00:04:09,070 ranges from -100 tp +100 117 00:04:09,330 --> 00:04:10,850 , then, this is a 118 00:04:11,090 --> 00:04:13,570 very different values than minus 1 and plus 1. 119 00:04:13,860 --> 00:04:15,020 So, this might be a 120 00:04:15,230 --> 00:04:17,480 less well-skilled feature and similarly, 121 00:04:17,970 --> 00:04:19,340 if your features take on a 122 00:04:19,420 --> 00:04:20,680 very, very small range of 123 00:04:20,950 --> 00:04:22,060 values so if X 4 124 00:04:22,340 --> 00:04:25,530 takes on values between minus 125 00:04:25,740 --> 00:04:28,290 0.0001 and positive 0.0001, then 126 00:04:29,720 --> 00:04:30,780 again this takes on a 127 00:04:30,910 --> 00:04:31,960 much smaller range of values 128 00:04:32,460 --> 00:04:33,760 than the minus one to plus one range. 129 00:04:34,040 --> 00:04:36,630 And again I would consider this feature poorly scaled. 130 00:04:37,850 --> 00:04:39,150 So you want the range of 131 00:04:39,430 --> 00:04:40,350 values, you know, can be 132 00:04:41,070 --> 00:04:42,010 bigger than plus or smaller 133 00:04:42,370 --> 00:04:43,840 than plus one, but just 134 00:04:44,040 --> 00:04:45,170 not much bigger, like plus 135 00:04:45,610 --> 00:04:47,470 100 here, or too 136 00:04:47,650 --> 00:04:49,990 much smaller like 0.00 one over there. 137 00:04:50,770 --> 00:04:52,530 Different people have different rules of thumb. 138 00:04:52,870 --> 00:04:53,910 But the one that I use is 139 00:04:54,070 --> 00:04:55,440 that if a feature takes 140 00:04:55,670 --> 00:04:56,750 on the range of values from 141 00:04:56,980 --> 00:04:58,590 say minus three the plus 142 00:04:58,840 --> 00:05:00,120 3 how you should think that should 143 00:05:00,170 --> 00:05:01,690 be just fine, but maybe 144 00:05:02,000 --> 00:05:03,050 it takes on much larger values 145 00:05:03,440 --> 00:05:04,360 than plus 3 or minus 3 146 00:05:04,530 --> 00:05:06,400 unless not to worry and if 147 00:05:06,700 --> 00:05:09,660 it takes on values from say minus one-third to one-third. 148 00:05:10,920 --> 00:05:12,020 You know, I think that's fine 149 00:05:12,270 --> 00:05:14,880 too or 0 to one-third or minus one-third to 0. 150 00:05:14,910 --> 00:05:17,890 I guess that's typical range of value sector 0 okay. 151 00:05:18,560 --> 00:05:19,310 But it will take on a 152 00:05:19,450 --> 00:05:20,640 much tinier range of values 153 00:05:20,900 --> 00:05:23,220 like x4 here than gain on mine not to worry. 154 00:05:23,790 --> 00:05:25,060 So, the take-home message 155 00:05:25,500 --> 00:05:26,780 is don't worry if your 156 00:05:27,000 --> 00:05:28,550 features are not exactly on 157 00:05:28,700 --> 00:05:30,920 the same scale or exactly in the same range of values. 158 00:05:31,170 --> 00:05:31,930 But so long as they're all 159 00:05:32,090 --> 00:05:35,060 close enough to this gradient descent it should work okay. 160 00:05:35,930 --> 00:05:37,530 In addition to dividing by 161 00:05:37,930 --> 00:05:39,960 so that the maximum value when 162 00:05:40,220 --> 00:05:42,080 performing feature scaling sometimes 163 00:05:42,730 --> 00:05:45,070 people will also do what's called mean normalization. 164 00:05:45,330 --> 00:05:47,150 And what I mean by 165 00:05:47,320 --> 00:05:48,130 that is that you want 166 00:05:48,350 --> 00:05:49,810 to take a feature Xi and replace 167 00:05:50,230 --> 00:05:51,850 it with Xi minus new i 168 00:05:52,870 --> 00:05:55,260 to make your features have approximately 0 mean. 169 00:05:56,530 --> 00:05:57,730 And obviously we want 170 00:05:57,890 --> 00:05:59,260 to apply this to the future 171 00:05:59,650 --> 00:06:00,750 x zero, because the future 172 00:06:00,940 --> 00:06:02,260 x zero is always equal to 173 00:06:02,360 --> 00:06:03,600 one, so it cannot have an 174 00:06:03,810 --> 00:06:05,100 average value of zero. 175 00:06:06,370 --> 00:06:07,760 But it concretely for other 176 00:06:07,950 --> 00:06:09,320 features if the range 177 00:06:09,600 --> 00:06:10,320 of sizes of the house 178 00:06:10,960 --> 00:06:14,170 takes on values between 0 179 00:06:14,310 --> 00:06:15,080 to 2000 and if you know, 180 00:06:15,230 --> 00:06:16,230 the average size of a 181 00:06:16,470 --> 00:06:18,340 house is equal to 182 00:06:18,500 --> 00:06:20,080 1000 then you might 183 00:06:21,470 --> 00:06:21,950 use this formula. 184 00:06:23,940 --> 00:06:24,970 Size, set the feature 185 00:06:25,250 --> 00:06:26,270 X1 to the size minus 186 00:06:26,590 --> 00:06:28,010 the average value divided by 2000 187 00:06:28,630 --> 00:06:31,820 and similarly, on average 188 00:06:32,530 --> 00:06:34,010 if your houses have 189 00:06:34,520 --> 00:06:37,630 one to five bedrooms and if 190 00:06:39,240 --> 00:06:40,460 on average a house has 191 00:06:40,890 --> 00:06:41,920 two bedrooms then you might 192 00:06:42,110 --> 00:06:44,750 use this formula to mean 193 00:06:45,080 --> 00:06:47,460 normalize your second feature x2. 194 00:06:49,340 --> 00:06:50,720 In both of these cases, you 195 00:06:50,840 --> 00:06:52,730 therefore wind up with features x1 and x2. 196 00:06:52,930 --> 00:06:54,490 They can take on values roughly 197 00:06:54,880 --> 00:06:56,580 between minus .5 and positive .5. 198 00:06:57,130 --> 00:06:57,880 Exactly not true - X2 199 00:06:58,210 --> 00:07:00,920 can actually be slightly larger than .5 but, close enough. 200 00:07:01,800 --> 00:07:03,140 And the more general rule is 201 00:07:03,530 --> 00:07:04,860 that you might take a 202 00:07:04,900 --> 00:07:06,390 feature X1 and replace 203 00:07:08,060 --> 00:07:10,110 it with X1 minus mu1 204 00:07:10,940 --> 00:07:13,410 over S1 where to 205 00:07:13,550 --> 00:07:15,890 define these terms mu1 is 206 00:07:16,200 --> 00:07:18,290 the average value of x1 207 00:07:19,960 --> 00:07:21,310 in the training sets 208 00:07:22,320 --> 00:07:24,190 and S1 is the 209 00:07:24,350 --> 00:07:27,420 range of values of that 210 00:07:27,820 --> 00:07:28,940 feature and by range, I 211 00:07:29,040 --> 00:07:30,110 mean let's say the maximum 212 00:07:30,630 --> 00:07:31,900 value minus the minimum 213 00:07:32,290 --> 00:07:33,350 value or for those 214 00:07:33,590 --> 00:07:35,360 of you that understand the deviation 215 00:07:35,850 --> 00:07:37,390 of the variable is setting S1 216 00:07:37,760 --> 00:07:40,790 to be the standard deviation of the variable would be fine, too. 217 00:07:41,020 --> 00:07:43,240 But taking, you know, this max minus min would be fine. 218 00:07:44,330 --> 00:07:45,170 And similarly for the second 219 00:07:45,610 --> 00:07:47,380 feature, x2, you replace 220 00:07:47,840 --> 00:07:49,740 x2 with this sort of 221 00:07:51,040 --> 00:07:52,220 subtract the mean of the feature 222 00:07:52,800 --> 00:07:54,110 and divide it by the range 223 00:07:54,380 --> 00:07:55,980 of values meaning the max minus min. 224 00:07:56,880 --> 00:07:57,910 And this sort of formula will 225 00:07:58,370 --> 00:07:59,630 get your features, you know, maybe 226 00:07:59,850 --> 00:08:01,020 not exactly, but maybe roughly 227 00:08:01,920 --> 00:08:03,320 into these sorts of 228 00:08:03,490 --> 00:08:04,820 ranges, and by the 229 00:08:04,890 --> 00:08:05,700 way, for those of you that 230 00:08:05,940 --> 00:08:07,570 are being super careful technically if 231 00:08:07,710 --> 00:08:09,300 we're taking the range as max 232 00:08:09,610 --> 00:08:12,410 minus min this five here will actually become a four. 233 00:08:13,140 --> 00:08:14,390 So if max is 5 234 00:08:14,600 --> 00:08:15,830 minus 1 then the range of 235 00:08:16,320 --> 00:08:17,160 their own values is actually 236 00:08:17,860 --> 00:08:18,530 equal to 4, but all of these 237 00:08:18,690 --> 00:08:20,380 are approximate and any value 238 00:08:20,830 --> 00:08:22,010 that gets the features into 239 00:08:22,450 --> 00:08:24,750 anything close to these sorts of ranges will do fine. 240 00:08:25,200 --> 00:08:27,220 And the feature scaling 241 00:08:27,660 --> 00:08:28,520 doesn't have to be too exact, 242 00:08:29,050 --> 00:08:30,390 in order to get gradient 243 00:08:30,790 --> 00:08:32,290 descent to run quite a lot faster. 244 00:08:34,610 --> 00:08:35,840 So, now you know 245 00:08:36,020 --> 00:08:37,420 about feature scaling and if 246 00:08:37,530 --> 00:08:39,040 you apply this simple trick, it 247 00:08:39,250 --> 00:08:40,650 and make gradient descent run much 248 00:08:40,870 --> 00:08:43,680 faster and converge in a lot fewer other iterations. 249 00:08:44,990 --> 00:08:45,540 That was feature scaling. 250 00:08:46,080 --> 00:08:47,190 In the next video, I'll tell 251 00:08:47,350 --> 00:08:49,410 you about another trick to make 252 00:08:49,710 --> 00:08:50,970 gradient descent work well in practice.