1 00:00:00,260 --> 00:00:01,340 We've talked about how to evaluate 2 00:00:01,960 --> 00:00:03,360 learning algorithms, talked about model selection, 3 00:00:04,150 --> 00:00:06,490 talked a lot about bias and variance. 4 00:00:06,970 --> 00:00:08,110 So how does this help us figure 5 00:00:08,330 --> 00:00:09,730 out what are potentially fruitful, 6 00:00:10,340 --> 00:00:11,710 potentially not fruitful things to 7 00:00:11,950 --> 00:00:13,980 try to do to improve the performance of a learning algorithm. 8 00:00:15,480 --> 00:00:16,660 Let's go back to our original 9 00:00:16,940 --> 00:00:18,890 motivating example and go for the result. 10 00:00:21,030 --> 00:00:22,570 So here is our earlier example 11 00:00:23,000 --> 00:00:24,120 of maybe having fit regularized 12 00:00:24,720 --> 00:00:27,640 linear regression and finding that it doesn't work as well as we're hoping. 13 00:00:28,300 --> 00:00:30,080 We said that we had this menu of options. 14 00:00:30,910 --> 00:00:32,430 So is there some way to 15 00:00:32,590 --> 00:00:34,530 figure out which of these might be fruitful options? 16 00:00:35,480 --> 00:00:36,490 The first thing all of this 17 00:00:36,660 --> 00:00:38,770 was getting more training examples. 18 00:00:39,550 --> 00:00:40,700 What this is good for, 19 00:00:40,880 --> 00:00:42,890 is this helps to fix high variance. 20 00:00:45,320 --> 00:00:46,610 And concretely, if you instead 21 00:00:47,150 --> 00:00:48,550 have a high bias problem and 22 00:00:48,680 --> 00:00:50,530 don't have any variance problem, then 23 00:00:50,830 --> 00:00:52,000 we saw in the previous video 24 00:00:52,500 --> 00:00:53,560 that getting more training examples, 25 00:00:54,640 --> 00:00:56,380 while maybe just isn't going to help much at all. 26 00:00:57,360 --> 00:00:58,320 So the first option is useful 27 00:00:58,780 --> 00:01:00,230 only if you, say, plot 28 00:01:00,580 --> 00:01:01,620 the learning curves and figure 29 00:01:01,720 --> 00:01:02,820 out that you have at least 30 00:01:02,860 --> 00:01:03,970 a bit of a variance, meaning that 31 00:01:04,170 --> 00:01:06,530 the cross-validation error is, you know, 32 00:01:06,680 --> 00:01:08,800 quite a bit bigger than your training set error. 33 00:01:08,910 --> 00:01:10,400 How about trying a smaller set of features? 34 00:01:10,940 --> 00:01:11,920 Well, trying a smaller 35 00:01:12,350 --> 00:01:13,570 set of features, that's again 36 00:01:13,970 --> 00:01:16,060 something that fixes high variance. 37 00:01:17,100 --> 00:01:18,080 And in other words, if you figure 38 00:01:18,420 --> 00:01:19,440 out, by looking at learning curves 39 00:01:19,820 --> 00:01:20,830 or something else that you used, 40 00:01:21,190 --> 00:01:22,110 that have a high bias 41 00:01:22,370 --> 00:01:23,460 problem; then for goodness 42 00:01:23,670 --> 00:01:25,000 sakes, don't waste your time 43 00:01:25,540 --> 00:01:27,250 trying to carefully select out 44 00:01:27,450 --> 00:01:29,130 a smaller set of features to use. 45 00:01:29,330 --> 00:01:31,190 Because if you have a high bias problem, using 46 00:01:32,060 --> 00:01:33,220 fewer features is not going to help. 47 00:01:33,890 --> 00:01:35,270 Whereas in contrast, if you 48 00:01:35,490 --> 00:01:36,730 look at the learning curves or something 49 00:01:36,900 --> 00:01:38,020 else you figure out that you 50 00:01:38,360 --> 00:01:39,780 have a high variance problem, then, 51 00:01:40,320 --> 00:01:41,730 indeed trying to select 52 00:01:42,160 --> 00:01:43,180 out a smaller set of features, 53 00:01:43,440 --> 00:01:45,380 that might indeed be a very good use of your time. 54 00:01:45,790 --> 00:01:47,120 How about trying to get additional 55 00:01:47,710 --> 00:01:49,640 features, adding features, usually, 56 00:01:50,170 --> 00:01:51,380 not always, but usually we 57 00:01:51,490 --> 00:01:53,020 think of this as a solution 58 00:01:54,070 --> 00:01:56,920 for fixing high bias problems. 59 00:01:57,600 --> 00:01:58,700 So if you are adding extra 60 00:01:58,980 --> 00:02:00,640 features it's usually because 61 00:02:01,750 --> 00:02:03,150 your current hypothesis is too 62 00:02:03,280 --> 00:02:04,280 simple, and so we want 63 00:02:04,540 --> 00:02:06,520 to try to get additional features to 64 00:02:06,730 --> 00:02:08,540 make our hypothesis better able 65 00:02:09,060 --> 00:02:10,800 to fit the training set. And 66 00:02:11,420 --> 00:02:13,460 similarly, adding polynomial features; 67 00:02:13,770 --> 00:02:14,930 this is another way of adding 68 00:02:15,140 --> 00:02:16,420 features and so there 69 00:02:16,570 --> 00:02:18,220 is another way to try 70 00:02:18,430 --> 00:02:19,950 to fix the high bias problem. 71 00:02:21,020 --> 00:02:22,820 And, if concretely if 72 00:02:23,210 --> 00:02:24,350 your learning curves show you 73 00:02:24,570 --> 00:02:25,410 that you still have a high 74 00:02:25,520 --> 00:02:27,190 variance problem, then, you know, again this 75 00:02:27,320 --> 00:02:29,360 is maybe a less good use of your time. 76 00:02:30,640 --> 00:02:32,690 And finally, decreasing and increasing lambda. 77 00:02:33,200 --> 00:02:34,090 This are quick and easy to try, 78 00:02:34,470 --> 00:02:36,000 I guess these are less likely to 79 00:02:36,140 --> 00:02:38,190 be a waste of, you know, many months of your life. 80 00:02:39,070 --> 00:02:41,530 But decreasing lambda, you 81 00:02:41,650 --> 00:02:43,400 already know fixes high bias. 82 00:02:45,360 --> 00:02:46,340 In case this isn't clear to 83 00:02:46,500 --> 00:02:47,340 you, you know, I do encourage 84 00:02:47,810 --> 00:02:50,350 you to pause the video and think through this that 85 00:02:50,990 --> 00:02:52,790 convince yourself that decreasing lambda 86 00:02:53,620 --> 00:02:55,030 helps fix high bias, whereas increasing 87 00:02:55,590 --> 00:02:57,480 lambda fixes high variance. 88 00:02:59,870 --> 00:03:00,930 And if you aren't sure why 89 00:03:01,270 --> 00:03:02,470 this is the case, do 90 00:03:02,650 --> 00:03:04,130 pause the video and make 91 00:03:04,150 --> 00:03:05,820 sure you can convince yourself that this is the case. 92 00:03:06,580 --> 00:03:07,320 Or take a look at the curves 93 00:03:07,800 --> 00:03:09,040 that we were plotting at the 94 00:03:09,190 --> 00:03:10,590 end of the previous video and 95 00:03:10,720 --> 00:03:11,650 try to make sure you understand 96 00:03:12,170 --> 00:03:13,670 why these are the case. 97 00:03:15,080 --> 00:03:16,120 Finally, let us take everything 98 00:03:16,440 --> 00:03:17,840 we have learned and relate it back 99 00:03:18,400 --> 00:03:19,980 to neural networks and so, 100 00:03:20,130 --> 00:03:21,190 here is some practical 101 00:03:21,720 --> 00:03:22,720 advice for how I usually 102 00:03:23,520 --> 00:03:25,060 choose the architecture or the 103 00:03:25,530 --> 00:03:28,660 connectivity pattern of the neural networks I use. 104 00:03:30,070 --> 00:03:31,190 So, if you are fitting 105 00:03:31,410 --> 00:03:33,160 a neural network, one option would 106 00:03:33,400 --> 00:03:34,680 be to fit, say, a pretty 107 00:03:34,840 --> 00:03:36,540 small neural network with you know, relatively 108 00:03:37,530 --> 00:03:38,670 few hidden units, maybe just 109 00:03:38,930 --> 00:03:40,430 one hidden unit. If you're fitting 110 00:03:40,890 --> 00:03:42,670 a neural network, one option would 111 00:03:42,800 --> 00:03:44,440 be to fit a relatively small 112 00:03:44,920 --> 00:03:46,500 neural network with, say, 113 00:03:48,030 --> 00:03:49,630 relatively few, maybe only one 114 00:03:49,980 --> 00:03:51,760 hidden layer and maybe 115 00:03:52,070 --> 00:03:53,370 only a relatively few number 116 00:03:53,750 --> 00:03:55,160 of hidden units. 117 00:03:55,570 --> 00:03:56,580 So, a network like this might have relatively 118 00:03:57,050 --> 00:03:59,170 few parameters and be more prone to underfitting. 119 00:04:00,450 --> 00:04:01,850 The main advantage of these small 120 00:04:02,260 --> 00:04:04,760 neural networks is that the computation will be cheaper. 121 00:04:05,820 --> 00:04:06,910 An alternative would be to 122 00:04:07,010 --> 00:04:08,470 fit a, maybe relatively large 123 00:04:08,900 --> 00:04:10,790 neural network with either more 124 00:04:10,970 --> 00:04:12,370 hidden units--there's a lot 125 00:04:12,560 --> 00:04:14,940 of hidden in one there--or with more hidden layers. 126 00:04:16,200 --> 00:04:17,800 And so these neural networks tend 127 00:04:18,010 --> 00:04:20,870 to have more parameters and therefore be more prone to overfitting. 128 00:04:22,410 --> 00:04:24,010 One disadvantage, often not a 129 00:04:24,050 --> 00:04:25,160 major one but something to 130 00:04:25,250 --> 00:04:26,440 think about, is that if you have 131 00:04:27,000 --> 00:04:28,450 a large number of neurons 132 00:04:28,960 --> 00:04:30,040 in your network, then it can 133 00:04:30,230 --> 00:04:31,920 be more computationally expensive. 134 00:04:33,070 --> 00:04:35,790 Although within reason, this is often hopefully not a huge problem. 135 00:04:36,840 --> 00:04:38,420 The main potential problem of 136 00:04:38,540 --> 00:04:39,710 these much larger neural networks is that it could be more prone to overfitting 137 00:04:39,980 --> 00:04:44,120 and it turns out if you're applying neural 138 00:04:44,700 --> 00:04:46,900 network very often using 139 00:04:47,240 --> 00:04:48,900 a large neural network often it's actually the larger, the better 140 00:04:50,610 --> 00:04:51,700 but if it's overfitting, you can 141 00:04:51,890 --> 00:04:53,800 then use regularization to address 142 00:04:54,230 --> 00:04:56,510 overfitting, usually using 143 00:04:56,910 --> 00:04:58,480 a larger neural network by using 144 00:04:58,720 --> 00:04:59,980 regularization to address is 145 00:05:00,310 --> 00:05:01,910 overfitting that's often more 146 00:05:02,130 --> 00:05:04,160 effective than using a smaller neural network. 147 00:05:05,100 --> 00:05:06,940 And the main possible disadvantage is 148 00:05:07,130 --> 00:05:09,420 that it can be more computationally expensive. 149 00:05:10,470 --> 00:05:11,940 And finally, one of the other decisions is, say, 150 00:05:12,280 --> 00:05:14,340 the number of hidden layers you want to have, right? 151 00:05:14,480 --> 00:05:16,400 So, do you want 152 00:05:17,030 --> 00:05:18,130 one hidden layer or do 153 00:05:18,380 --> 00:05:19,700 you want three hidden layers, as 154 00:05:20,040 --> 00:05:21,790 we've shown here, or do you want two hidden layers? 155 00:05:23,250 --> 00:05:24,850 And usually, as I 156 00:05:24,980 --> 00:05:25,720 think I said in the previous 157 00:05:26,190 --> 00:05:27,420 video, using a single 158 00:05:27,640 --> 00:05:29,570 hidden layer is a reasonable default, but 159 00:05:29,780 --> 00:05:30,800 if you want to choose the 160 00:05:30,890 --> 00:05:32,400 number of hidden layers, one 161 00:05:32,580 --> 00:05:33,610 other thing you can try is 162 00:05:34,270 --> 00:05:35,800 find yourself a training cross-validation, 163 00:05:36,660 --> 00:05:38,320 and test set split and try 164 00:05:38,730 --> 00:05:40,070 training neural networks with one 165 00:05:40,260 --> 00:05:41,210 hidden layer or two hidden 166 00:05:41,490 --> 00:05:42,810 layers or three hidden layers and 167 00:05:43,230 --> 00:05:44,300 see which of those neural 168 00:05:44,460 --> 00:05:47,460 networks performs best on the cross-validation sets. 169 00:05:48,180 --> 00:05:49,190 You take your three neural networks 170 00:05:49,660 --> 00:05:50,510 with one, two and three hidden 171 00:05:50,780 --> 00:05:52,410 layers, and compute the 172 00:05:52,570 --> 00:05:53,870 cross validation error at Jcv 173 00:05:54,140 --> 00:05:55,120 and all of 174 00:05:55,240 --> 00:05:56,630 them and use that to 175 00:05:56,960 --> 00:05:58,350 select which of these 176 00:05:58,690 --> 00:06:00,290 is you think the best neural network. 177 00:06:02,580 --> 00:06:04,020 So, that's it for 178 00:06:04,230 --> 00:06:05,490 bias and variance and ways 179 00:06:05,780 --> 00:06:08,170 like learning curves, who tried to diagnose these problems. 180 00:06:08,560 --> 00:06:09,860 As far as what 181 00:06:09,930 --> 00:06:11,020 you think is implied, for one 182 00:06:11,250 --> 00:06:12,480 might be truthful or not 183 00:06:12,630 --> 00:06:13,500 truthful things to try 184 00:06:13,910 --> 00:06:15,720 to improve the performance of a learning algorithm. 185 00:06:16,960 --> 00:06:18,000 If you understood the contents 186 00:06:18,990 --> 00:06:20,700 of the last few videos and if 187 00:06:20,790 --> 00:06:22,020 you apply them you actually 188 00:06:22,630 --> 00:06:24,300 be much more effective already and 189 00:06:24,430 --> 00:06:25,890 getting learning algorithms to work on problems 190 00:06:26,610 --> 00:06:27,970 and even a large fraction, 191 00:06:28,560 --> 00:06:29,810 maybe the majority of practitioners 192 00:06:30,540 --> 00:06:31,860 of machine learning here in 193 00:06:32,060 --> 00:06:34,760 Silicon Valley today doing these things as their full-time jobs. 194 00:06:35,820 --> 00:06:37,560 So I hope that these 195 00:06:37,990 --> 00:06:39,110 pieces of advice 196 00:06:39,560 --> 00:06:41,420 on by experience in diagnostics 197 00:06:42,730 --> 00:06:44,110 will help you to much effectively 198 00:06:44,790 --> 00:06:47,270 and powerfully apply learning and 199 00:06:48,000 --> 00:06:49,300 get them to work very well.