1 00:00:00,000 --> 00:00:05,018 In this video, I'm going to talk about some exciting recent work, which I think 2 00:00:05,018 --> 00:00:09,779 will go a long way towards answering the question how do you settle those 3 00:00:09,779 --> 00:00:15,933 hyper-parameters in a neural network? This recent work uses a different kind of 4 00:00:15,933 --> 00:00:22,260 machine learning to help us decide what values to use for hyper-parameters. 5 00:00:22,260 --> 00:00:26,502 In other words, it's using machine learning to replace the graduate student 6 00:00:26,502 --> 00:00:30,236 who fiddles around with all these different settings of the 7 00:00:30,236 --> 00:00:35,256 hyper-parameters to find out what works. It relies on a way of modeling smooth 8 00:00:35,256 --> 00:00:39,390 functions called Gaussian processes, which I had always thought of as 9 00:00:39,390 --> 00:00:44,063 inadequate for doing things like speech and vision and I still think they are 10 00:00:44,063 --> 00:00:47,538 inadequate for that. But when you're in a domain where you 11 00:00:47,538 --> 00:00:52,391 don't have much prior knowledge and the only thing that you can really appeal to 12 00:00:52,391 --> 00:00:57,304 is that you expect similar inputs to have similar outputs, then Gaussian processes 13 00:00:57,304 --> 00:01:00,419 are ideal. And that's the domain we're in when we're 14 00:01:00,419 --> 00:01:04,673 fiddling around with vectors of hyper-parameters hoping to find a vector 15 00:01:04,673 --> 00:01:09,168 of hyper-parameters that works well. So, for example is the number of hidden 16 00:01:09,168 --> 00:01:12,233 units, the number of layers, the weight penalty, 17 00:01:12,233 --> 00:01:17,496 whether it's used drop out or not. Those are all hyper-parameters and different 18 00:01:17,496 --> 00:01:22,826 combinations of them work well together. So this is a very hard space to explore 19 00:01:22,826 --> 00:01:26,404 by hand. It's very easy when we're exploring by 20 00:01:26,404 --> 00:01:31,083 hand to fail to notice things. Gaussian processes are very good at 21 00:01:31,083 --> 00:01:36,542 noticing trends in the data and they provide a very good way of finding good 22 00:01:36,542 --> 00:01:40,300 sets of hyper-parameters if you have enough computers. 23 00:01:41,400 --> 00:01:46,575 One of the commonest reasons that people give for not using neural networks is 24 00:01:46,575 --> 00:01:50,408 that it requires a lot of skill to set the hyper-parameters. 25 00:01:50,408 --> 00:01:55,327 This is actually a pretty good reason. If you don't have much experience, it's 26 00:01:55,327 --> 00:01:59,544 easy to get stuck using a completely wrong value for one of the 27 00:01:59,544 --> 00:02:04,619 hyper-parameters, and then nothing works. You have to set things like the number of 28 00:02:04,619 --> 00:02:06,838 layers, the number of units per layer, 29 00:02:06,838 --> 00:02:10,676 what types of units to use, the weight penalty, the learning rate, 30 00:02:10,676 --> 00:02:14,935 the momentum, and so on and so on. If you use a learning rate that's 100 31 00:02:14,935 --> 00:02:19,013 times too big or 100 times too small, your network simply won't work. 32 00:02:19,013 --> 00:02:22,252 One way to approach this is to do a naive grid search. 33 00:02:22,252 --> 00:02:25,911 That is, for each of these hyper-parameters, you make a list of 34 00:02:25,911 --> 00:02:30,230 alternative values and then you try all possible combinations of values. 35 00:02:30,230 --> 00:02:32,630 You can see that this is going to blow up. 36 00:02:32,630 --> 00:02:35,276 If you have more than a few hyper-parameters, 37 00:02:35,276 --> 00:02:39,830 you're going to end up with many more combinations than you can possibly try. 38 00:02:39,830 --> 00:02:44,115 It turns out that there's something that's considerably better than doing a 39 00:02:44,115 --> 00:02:47,781 naive grid search. We can just sample random combinations. 40 00:02:47,781 --> 00:02:53,288 That is for each hyper-parameter, we make a list of alternatives and then we pick 41 00:02:53,288 --> 00:02:58,720 one thing randomly from each list. The reason that's better is because some 42 00:02:58,720 --> 00:03:03,446 of the hyper-parameters won't have much effect and others will have a lot of 43 00:03:03,446 --> 00:03:06,369 effect. And what we don't want to do is exactly 44 00:03:06,369 --> 00:03:10,224 repeat the settings of the hyper-parameters that have a lot of 45 00:03:10,224 --> 00:03:14,797 effect for different settings of hyper-parameters that don't have much 46 00:03:14,797 --> 00:03:17,162 effect. We don't learn much that way. 47 00:03:17,162 --> 00:03:22,350 In a grid search, you'll have several points along each axis that are identical 48 00:03:22,350 --> 00:03:26,882 for all the other parameters. And so, if moving along that axis of the 49 00:03:26,882 --> 00:03:32,202 grid search makes no difference, you've replicated the same experiment many times 50 00:03:32,202 --> 00:03:36,220 and haven't learned anything about the other parameters. 51 00:03:36,220 --> 00:03:40,548 There's something you can do that's much better than random combinations, 52 00:03:40,548 --> 00:03:45,173 and basically it amounts to saying, let's use machine learning to simulate the 53 00:03:45,173 --> 00:03:49,680 graduate student who is trying to decide what the hyper-parameter should be. 54 00:03:51,140 --> 00:03:56,179 So, instead of using random combinations, we look at the results we've got so far 55 00:03:56,179 --> 00:04:00,084 and try and predict what combinations are likely to work well. 56 00:04:00,084 --> 00:04:04,619 That is, we have to predict regions of the hyper-parameter space, in which we 57 00:04:04,619 --> 00:04:09,146 expect to get good results. It's not sufficient just to say how well 58 00:04:09,146 --> 00:04:12,350 we expect to do. We also have to have an idea of the 59 00:04:12,350 --> 00:04:15,246 uncertainty. We might, for example, have a region, 60 00:04:15,246 --> 00:04:19,067 where we expect to do about the same as we're currently doing, 61 00:04:19,067 --> 00:04:23,688 but maybe we would do much better. In that case, it would be worth going and 62 00:04:23,688 --> 00:04:27,509 exploring that region. It's even worth exploring regions where 63 00:04:27,509 --> 00:04:31,500 we expect to do worse, but we might just do a lot better. 64 00:04:31,500 --> 00:04:36,148 Now we're going to assume that the amount of computation involved in evaluating one 65 00:04:36,148 --> 00:04:40,797 setting of the hyper-parameters is huge. It involves training a big neura; network 66 00:04:40,797 --> 00:04:45,040 on a huge data set and it might take several days on a big computer. 67 00:04:45,040 --> 00:04:50,676 Relative to that amount of work, building a model to predict how well a setting of 68 00:04:50,676 --> 00:04:56,037 the hyper-parameters will do, given all the settings we've experimented with so 69 00:04:56,037 --> 00:05:00,024 far is much less work. And so it's going to require much less 70 00:05:00,024 --> 00:05:05,523 computation to fit the predictive model to the results of the experiments we've 71 00:05:05,523 --> 00:05:08,960 seen so far than it is to run a single experiment. 72 00:05:08,960 --> 00:05:13,551 So what kind of model are we going to use for predicting the results of future 73 00:05:13,551 --> 00:05:16,632 experiments? It turns out there's a kind of model I 74 00:05:16,632 --> 00:05:21,440 haven't talked about in the course called Gaussian process models. 75 00:05:21,440 --> 00:05:26,696 Basically, all these models do is assume that similar inputs give similar outputs. 76 00:05:26,696 --> 00:05:30,265 They don't have any more sophisticated prior than that, 77 00:05:30,265 --> 00:05:34,288 but they're very good at using that prior in an effective way. 78 00:05:34,288 --> 00:05:39,285 So if you don't know much about what you expect hyper-parameters could do, a weak 79 00:05:39,285 --> 00:05:42,740 prior like that is probably the best you can do. 80 00:05:42,740 --> 00:05:47,836 Gaussian processes are able to learn for each input dimension what the appropriate 81 00:05:47,836 --> 00:05:52,374 scale is for measuring similarity. So for example, if the number of hidden 82 00:05:52,374 --> 00:05:57,098 units could be 200 or it could 300, the question is are those similar number 83 00:05:57,098 --> 00:06:01,884 or are those very different numbers? Should we expect the results we get with 84 00:06:01,884 --> 00:06:06,981 200 to be very similar to the results we get with 300 or should we expect them to 85 00:06:06,981 --> 00:06:11,514 be very different? If we don't know anything about neural 86 00:06:11,514 --> 00:06:17,214 nets, initially we have no idea, but we could look at the results of experiments 87 00:06:17,214 --> 00:06:20,578 so far. And if experiments with 200 and 88 00:06:20,578 --> 00:06:25,385 experiments with 300 tend to give very similar answers when you take into 89 00:06:25,385 --> 00:06:28,892 account the other differences between the experiments, 90 00:06:28,892 --> 00:06:33,894 then 200 is probably similar to 300. And so, we set a scale for that dimension 91 00:06:33,894 --> 00:06:38,831 such that you need differences of much more than that to expect to get very 92 00:06:38,831 --> 00:06:41,949 different results. Now, it's important that Gaussian 93 00:06:41,949 --> 00:06:46,430 processes models do more than just predicting the expected outcome of a 94 00:06:46,430 --> 00:06:50,959 particular experiment. That is how well the neural net that we 95 00:06:50,959 --> 00:06:56,032 train will do on a validation set. In addition to predicting a mean value 96 00:06:56,032 --> 00:07:00,548 for how well they expect the neural network to do, they predict a 97 00:07:00,548 --> 00:07:03,540 distribution, they predict the variance. 98 00:07:03,540 --> 00:07:09,260 They're called Gaussian processes because their predictions are Gaussian. 99 00:07:09,260 --> 00:07:14,213 When they're making a prediction for new settings of hyper-parameters that are 100 00:07:14,213 --> 00:07:19,354 close to several consistent settings that we've already run, so we know the answer. 101 00:07:19,354 --> 00:07:23,994 The predictions will tend to be fairly sharp, that is well have low variance, 102 00:07:23,994 --> 00:07:28,571 but when they are predictions for experiments with hyper-parameters that 103 00:07:28,571 --> 00:07:33,399 are very different from in setting the hyper-parameters we'd experimented it 104 00:07:33,399 --> 00:07:36,722 with so far, the predictions made by Gaussian process 105 00:07:36,722 --> 00:07:42,815 models will have very high variance. So here's quite a good strategy for using 106 00:07:42,815 --> 00:07:47,140 Gaussian processes to decide what to try next. 107 00:07:47,140 --> 00:07:51,836 So remember, we have one kind of learning model, which is a big neural network that 108 00:07:51,836 --> 00:07:55,559 takes a long time to route, and we're trying to figure out a good 109 00:07:55,559 --> 00:07:58,080 setting of the hyper-parameters to try next. 110 00:07:58,080 --> 00:08:02,434 We have a different kind of machine learning algorith, called a Gaussian 111 00:08:02,434 --> 00:08:07,492 process, that's looking at the results of the experiments we've done so far and 112 00:08:07,492 --> 00:08:12,679 trying to predict for some proposed new setting of the hyper-parameters How well 113 00:08:12,679 --> 00:08:16,969 the neural network would do and also how unsure that prediction is? 114 00:08:16,969 --> 00:08:22,027 So what we're going to do is we're going to keep track of the hyper-parameters 115 00:08:22,027 --> 00:08:26,317 that have worked best so far. That is a single setting of all the 116 00:08:26,317 --> 00:08:31,120 hyper-parameters that gave us the neural net with the highest performance so far. 117 00:08:31,120 --> 00:08:36,722 Now, when we run the next experiment, our best setting so far might be replaced by 118 00:08:36,722 --> 00:08:42,116 the new experiment because it gives better performances in neural net or it 119 00:08:42,116 --> 00:08:46,197 might stay the same. So since we're going to substitute the 120 00:08:46,197 --> 00:08:51,384 results of the new experiment it is better than anything we've seen so far, 121 00:08:51,384 --> 00:08:57,480 our best setting so far can only improve. So here's a good strategy for what 122 00:08:57,480 --> 00:09:01,320 setting of the hyper-parameters to try next. 123 00:09:01,320 --> 00:09:05,164 We pick a setting of the hyper-parameters, 124 00:09:05,164 --> 00:09:11,220 such that the expected improvement in our best setting is big. 125 00:09:11,220 --> 00:09:16,381 We don't worry about the fact that we might do an experiment that leads to a 126 00:09:16,381 --> 00:09:20,202 really bad result, because if it gets a really bad result, 127 00:09:20,202 --> 00:09:24,090 we won't replace our best so far with this new experiment. 128 00:09:24,090 --> 00:09:29,029 Also, when learn something. This is a phenomenon that managers of 129 00:09:29,029 --> 00:09:34,651 hedge funds know about. I often tell the client if the fund goes 130 00:09:34,651 --> 00:09:41,000 up, I'll take 3% of your profits. If the fund goes down, you lose. 131 00:09:41,000 --> 00:09:46,388 Now that's a crazy thing for a client to agree to, because that gives the hedge 132 00:09:46,388 --> 00:09:51,846 fund manager huge incentive for taking huge risks, because he has no significant 133 00:09:51,846 --> 00:09:54,915 answer. But, for finding hyper-parameters that 134 00:09:54,915 --> 00:09:59,895 work well, it's a sensible strategy. So, consider these three predictions, A, 135 00:09:59,895 --> 00:10:02,828 B, and C. We're going to suppose that A, B, and C 136 00:10:02,828 --> 00:10:08,000 are different settings of the hyper-parameters that have not been tried 137 00:10:08,000 --> 00:10:13,524 and those green Gaussians are the prediction of our Gaussian process model 138 00:10:13,524 --> 00:10:16,722 for how well each of those setting would do. 139 00:10:16,722 --> 00:10:22,391 For setting A, the mean is well below our current best so far and there's only 140 00:10:22,391 --> 00:10:28,495 moderate variance. For setting B, the mean is closer to our 141 00:10:28,495 --> 00:10:32,710 best so far, but since there isn't much variance, 142 00:10:32,710 --> 00:10:38,721 there really isn't that much upside. For setting C, the mean is actually lower 143 00:10:38,721 --> 00:10:43,780 than for setting B, but because it's high variance there's a big upside. 144 00:10:43,780 --> 00:10:48,981 We're going to take the area under Gaussian C that's above the red line and 145 00:10:48,981 --> 00:10:54,528 we're going to take the moment of that area above the red line and that's the 146 00:10:54,528 --> 00:10:59,681 thing we're looking for the matching margin and you can see that C has a much 147 00:10:59,681 --> 00:11:04,481 bigger moment than B or A. It may only have the same area as B above 148 00:11:04,481 --> 00:11:07,728 the line, but some of that area is much further 149 00:11:07,728 --> 00:11:11,681 above the line, so we might get a very big win if we try 150 00:11:11,681 --> 00:11:15,210 setting C. So that's the one our policy would tell 151 00:11:15,210 --> 00:11:18,600 us to pick here. Here's the worst part, 152 00:11:18,600 --> 00:11:25,380 B is intermediate and c is the best bet. So how well does this work? 153 00:11:25,380 --> 00:11:30,580 Well, if you got the resources to run a lot of experiments, it's much better than 154 00:11:30,580 --> 00:11:34,610 a person of finding good combinations of the hyper-parameters. 155 00:11:34,610 --> 00:11:39,420 The policy I gave you so far is a strictly sequential policy that assumes 156 00:11:39,420 --> 00:11:42,670 that it can see all of the experiments run so far, 157 00:11:42,670 --> 00:11:47,895 but there's no reason why you shouldn't make it a bit more complicated and run a 158 00:11:47,895 --> 00:11:53,120 whole bunch of experiments in parallel. Using a Gaussian process model to predict 159 00:11:53,120 --> 00:11:57,829 how well a particular setting of the hyper-parameters will do is sensible, 160 00:11:57,829 --> 00:12:00,926 because it's not the kind of task we're good at. 161 00:12:00,926 --> 00:12:05,506 It's not like visional speech, and it's not clear that there's a lot of 162 00:12:05,506 --> 00:12:08,474 complicated structure to be found in the data. 163 00:12:08,474 --> 00:12:13,892 It may be that the only real structure is that things are smooth and they have some 164 00:12:13,892 --> 00:12:17,308 scale. Also, a person can't keep in mind the 165 00:12:17,308 --> 00:12:20,785 results of 50 different experiments, to see what they predict. 166 00:12:20,785 --> 00:12:25,060 If you're doing all this by hand, you might just fail to notice that all of 167 00:12:25,060 --> 00:12:29,791 your good results had very small learning rates, and all of your really bad results 168 00:12:29,791 --> 00:12:33,781 had very big learning rates, because you're attending to lots of other 169 00:12:33,781 --> 00:12:38,669 things that you're varying. A Gaussian process model would not miss a 170 00:12:38,669 --> 00:12:42,358 trend like that. One final reason why Gaussian process 171 00:12:42,358 --> 00:12:47,550 models are a very good way of setting hyper-parameters is they're much less 172 00:12:47,550 --> 00:12:52,339 likely than a person to cheat. Typically when we're doing research, we 173 00:12:52,339 --> 00:12:57,004 want to compare a new method that we thought of with some old or standard 174 00:12:57,004 --> 00:12:59,842 method, and there's a very strong tendency to 175 00:12:59,842 --> 00:13:04,886 work harder to find good hyperparameters for our new method than for the stupid 176 00:13:04,886 --> 00:13:08,365 old method. That's why when you compare methods, you 177 00:13:08,365 --> 00:13:11,882 should really compare the results got by different groups, 178 00:13:11,882 --> 00:13:16,551 where for each method, the results are produced by the group that believes in 179 00:13:16,551 --> 00:13:19,522 that method. If we use Gaussian process models to 180 00:13:19,522 --> 00:13:24,312 search for good sets of hyper-parameters, they're going to do just as hard a search 181 00:13:24,312 --> 00:13:29,285 for the type of model we don't believe in as they are for the type of model we do 182 00:13:29,285 --> 00:13:29,952 believe in.