In this video, I'm going to talk about some exciting recent work, which I think will go a long way towards answering the question how do you settle those hyper-parameters in a neural network? This recent work uses a different kind of machine learning to help us decide what values to use for hyper-parameters. In other words, it's using machine learning to replace the graduate student who fiddles around with all these different settings of the hyper-parameters to find out what works. It relies on a way of modeling smooth functions called Gaussian processes, which I had always thought of as inadequate for doing things like speech and vision and I still think they are inadequate for that. But when you're in a domain where you don't have much prior knowledge and the only thing that you can really appeal to is that you expect similar inputs to have similar outputs, then Gaussian processes are ideal. And that's the domain we're in when we're fiddling around with vectors of hyper-parameters hoping to find a vector of hyper-parameters that works well. So, for example is the number of hidden units, the number of layers, the weight penalty, whether it's used drop out or not. Those are all hyper-parameters and different combinations of them work well together. So this is a very hard space to explore by hand. It's very easy when we're exploring by hand to fail to notice things. Gaussian processes are very good at noticing trends in the data and they provide a very good way of finding good sets of hyper-parameters if you have enough computers. One of the commonest reasons that people give for not using neural networks is that it requires a lot of skill to set the hyper-parameters. This is actually a pretty good reason. If you don't have much experience, it's easy to get stuck using a completely wrong value for one of the hyper-parameters, and then nothing works. You have to set things like the number of layers, the number of units per layer, what types of units to use, the weight penalty, the learning rate, the momentum, and so on and so on. If you use a learning rate that's 100 times too big or 100 times too small, your network simply won't work. One way to approach this is to do a naive grid search. That is, for each of these hyper-parameters, you make a list of alternative values and then you try all possible combinations of values. You can see that this is going to blow up. If you have more than a few hyper-parameters, you're going to end up with many more combinations than you can possibly try. It turns out that there's something that's considerably better than doing a naive grid search. We can just sample random combinations. That is for each hyper-parameter, we make a list of alternatives and then we pick one thing randomly from each list. The reason that's better is because some of the hyper-parameters won't have much effect and others will have a lot of effect. And what we don't want to do is exactly repeat the settings of the hyper-parameters that have a lot of effect for different settings of hyper-parameters that don't have much effect. We don't learn much that way. In a grid search, you'll have several points along each axis that are identical for all the other parameters. And so, if moving along that axis of the grid search makes no difference, you've replicated the same experiment many times and haven't learned anything about the other parameters. There's something you can do that's much better than random combinations, and basically it amounts to saying, let's use machine learning to simulate the graduate student who is trying to decide what the hyper-parameter should be. So, instead of using random combinations, we look at the results we've got so far and try and predict what combinations are likely to work well. That is, we have to predict regions of the hyper-parameter space, in which we expect to get good results. It's not sufficient just to say how well we expect to do. We also have to have an idea of the uncertainty. We might, for example, have a region, where we expect to do about the same as we're currently doing, but maybe we would do much better. In that case, it would be worth going and exploring that region. It's even worth exploring regions where we expect to do worse, but we might just do a lot better. Now we're going to assume that the amount of computation involved in evaluating one setting of the hyper-parameters is huge. It involves training a big neura; network on a huge data set and it might take several days on a big computer. Relative to that amount of work, building a model to predict how well a setting of the hyper-parameters will do, given all the settings we've experimented with so far is much less work. And so it's going to require much less computation to fit the predictive model to the results of the experiments we've seen so far than it is to run a single experiment. So what kind of model are we going to use for predicting the results of future experiments? It turns out there's a kind of model I haven't talked about in the course called Gaussian process models. Basically, all these models do is assume that similar inputs give similar outputs. They don't have any more sophisticated prior than that, but they're very good at using that prior in an effective way. So if you don't know much about what you expect hyper-parameters could do, a weak prior like that is probably the best you can do. Gaussian processes are able to learn for each input dimension what the appropriate scale is for measuring similarity. So for example, if the number of hidden units could be 200 or it could 300, the question is are those similar number or are those very different numbers? Should we expect the results we get with 200 to be very similar to the results we get with 300 or should we expect them to be very different? If we don't know anything about neural nets, initially we have no idea, but we could look at the results of experiments so far. And if experiments with 200 and experiments with 300 tend to give very similar answers when you take into account the other differences between the experiments, then 200 is probably similar to 300. And so, we set a scale for that dimension such that you need differences of much more than that to expect to get very different results. Now, it's important that Gaussian processes models do more than just predicting the expected outcome of a particular experiment. That is how well the neural net that we train will do on a validation set. In addition to predicting a mean value for how well they expect the neural network to do, they predict a distribution, they predict the variance. They're called Gaussian processes because their predictions are Gaussian. When they're making a prediction for new settings of hyper-parameters that are close to several consistent settings that we've already run, so we know the answer. The predictions will tend to be fairly sharp, that is well have low variance, but when they are predictions for experiments with hyper-parameters that are very different from in setting the hyper-parameters we'd experimented it with so far, the predictions made by Gaussian process models will have very high variance. So here's quite a good strategy for using Gaussian processes to decide what to try next. So remember, we have one kind of learning model, which is a big neural network that takes a long time to route, and we're trying to figure out a good setting of the hyper-parameters to try next. We have a different kind of machine learning algorith, called a Gaussian process, that's looking at the results of the experiments we've done so far and trying to predict for some proposed new setting of the hyper-parameters How well the neural network would do and also how unsure that prediction is? So what we're going to do is we're going to keep track of the hyper-parameters that have worked best so far. That is a single setting of all the hyper-parameters that gave us the neural net with the highest performance so far. Now, when we run the next experiment, our best setting so far might be replaced by the new experiment because it gives better performances in neural net or it might stay the same. So since we're going to substitute the results of the new experiment it is better than anything we've seen so far, our best setting so far can only improve. So here's a good strategy for what setting of the hyper-parameters to try next. We pick a setting of the hyper-parameters, such that the expected improvement in our best setting is big. We don't worry about the fact that we might do an experiment that leads to a really bad result, because if it gets a really bad result, we won't replace our best so far with this new experiment. Also, when learn something. This is a phenomenon that managers of hedge funds know about. I often tell the client if the fund goes up, I'll take 3% of your profits. If the fund goes down, you lose. Now that's a crazy thing for a client to agree to, because that gives the hedge fund manager huge incentive for taking huge risks, because he has no significant answer. But, for finding hyper-parameters that work well, it's a sensible strategy. So, consider these three predictions, A, B, and C. We're going to suppose that A, B, and C are different settings of the hyper-parameters that have not been tried and those green Gaussians are the prediction of our Gaussian process model for how well each of those setting would do. For setting A, the mean is well below our current best so far and there's only moderate variance. For setting B, the mean is closer to our best so far, but since there isn't much variance, there really isn't that much upside. For setting C, the mean is actually lower than for setting B, but because it's high variance there's a big upside. We're going to take the area under Gaussian C that's above the red line and we're going to take the moment of that area above the red line and that's the thing we're looking for the matching margin and you can see that C has a much bigger moment than B or A. It may only have the same area as B above the line, but some of that area is much further above the line, so we might get a very big win if we try setting C. So that's the one our policy would tell us to pick here. Here's the worst part, B is intermediate and c is the best bet. So how well does this work? Well, if you got the resources to run a lot of experiments, it's much better than a person of finding good combinations of the hyper-parameters. The policy I gave you so far is a strictly sequential policy that assumes that it can see all of the experiments run so far, but there's no reason why you shouldn't make it a bit more complicated and run a whole bunch of experiments in parallel. Using a Gaussian process model to predict how well a particular setting of the hyper-parameters will do is sensible, because it's not the kind of task we're good at. It's not like visional speech, and it's not clear that there's a lot of complicated structure to be found in the data. It may be that the only real structure is that things are smooth and they have some scale. Also, a person can't keep in mind the results of 50 different experiments, to see what they predict. If you're doing all this by hand, you might just fail to notice that all of your good results had very small learning rates, and all of your really bad results had very big learning rates, because you're attending to lots of other things that you're varying. A Gaussian process model would not miss a trend like that. One final reason why Gaussian process models are a very good way of setting hyper-parameters is they're much less likely than a person to cheat. Typically when we're doing research, we want to compare a new method that we thought of with some old or standard method, and there's a very strong tendency to work harder to find good hyperparameters for our new method than for the stupid old method. That's why when you compare methods, you should really compare the results got by different groups, where for each method, the results are produced by the group that believes in that method. If we use Gaussian process models to search for good sets of hyper-parameters, they're going to do just as hard a search for the type of model we don't believe in as they are for the type of model we do believe in.