In this video, I'm going to talk about 
some exciting recent work, which I think 
will go a long way towards answering the 
question how do you settle those 
hyper-parameters in a neural network? 
This recent work uses a different kind of 
machine learning to help us decide what 
values to use for hyper-parameters. 
In other words, it's using machine 
learning to replace the graduate student 
who fiddles around with all these 
different settings of the 
hyper-parameters to find out what works. 
It relies on a way of modeling smooth 
functions called Gaussian processes, 
which I had always thought of as 
inadequate for doing things like speech 
and vision and I still think they are 
inadequate for that. 
But when you're in a domain where you 
don't have much prior knowledge and the 
only thing that you can really appeal to 
is that you expect similar inputs to have 
similar outputs, then Gaussian processes 
are ideal. 
And that's the domain we're in when we're 
fiddling around with vectors of 
hyper-parameters hoping to find a vector 
of hyper-parameters that works well. 
So, for example is the number of hidden 
units, the number of layers, the weight 
penalty, 
whether it's used drop out or not. Those 
are all hyper-parameters and different 
combinations of them work well together. 
So this is a very hard space to explore 
by hand. 
It's very easy when we're exploring by 
hand to fail to notice things. 
Gaussian processes are very good at 
noticing trends in the data and they 
provide a very good way of finding good 
sets of hyper-parameters if you have 
enough computers. 
One of the commonest reasons that people 
give for not using neural networks is 
that it requires a lot of skill to set 
the hyper-parameters. 
This is actually a pretty good reason. 
If you don't have much experience, it's 
easy to get stuck using a completely 
wrong value for one of the 
hyper-parameters, and then nothing works. 
You have to set things like the number of 
layers, 
the number of units per layer, 
what types of units to use, 
the weight penalty, the learning rate, 
the momentum, and so on and so on. 
If you use a learning rate that's 100 
times too big or 100 times too small, 
your network simply won't work. 
One way to approach this is to do a naive 
grid search. 
That is, for each of these 
hyper-parameters, you make a list of 
alternative values and then you try all 
possible combinations of values. 
You can see that this is going to blow 
up. 
If you have more than a few 
hyper-parameters, 
you're going to end up with many more 
combinations than you can possibly try. 
It turns out that there's something 
that's considerably better than doing a 
naive grid search. 
We can just sample random combinations. 
That is for each hyper-parameter, we make 
a list of alternatives and then we pick 
one thing randomly from each list. 
The reason that's better is because some 
of the hyper-parameters won't have much 
effect and others will have a lot of 
effect. 
And what we don't want to do is exactly 
repeat the settings of the 
hyper-parameters that have a lot of 
effect for different settings of 
hyper-parameters that don't have much 
effect. 
We don't learn much that way. 
In a grid search, you'll have several 
points along each axis that are identical 
for all the other parameters. 
And so, if moving along that axis of the 
grid search makes no difference, you've 
replicated the same experiment many times 
and haven't learned anything about the 
other parameters. 
There's something you can do that's much 
better than random combinations, 
and basically it amounts to saying, let's 
use machine learning to simulate the 
graduate student who is trying to decide 
what the hyper-parameter should be. 
So, instead of using random combinations, 
we look at the results we've got so far 
and try and predict what combinations are 
likely to work well. 
That is, we have to predict regions of 
the hyper-parameter space, in which we 
expect to get good results. 
It's not sufficient just to say how well 
we expect to do. 
We also have to have an idea of the 
uncertainty. 
We might, for example, have a region, 
where we expect to do about the same as 
we're currently doing, 
but maybe we would do much better. 
In that case, it would be worth going and 
exploring that region. 
It's even worth exploring regions where 
we expect to do worse, 
but we might just do a lot better. 
Now we're going to assume that the amount 
of computation involved in evaluating one 
setting of the hyper-parameters is huge. 
It involves training a big neura; network 
on a huge data set and it might take 
several days on a big computer. 
Relative to that amount of work, building 
a model to predict how well a setting of 
the hyper-parameters will do, given all 
the settings we've experimented with so 
far is much less work. 
And so it's going to require much less 
computation to fit the predictive model 
to the results of the experiments we've 
seen so far than it is to run a single 
experiment. 
So what kind of model are we going to use 
for predicting the results of future 
experiments? 
It turns out there's a kind of model I 
haven't talked about in the course called 
Gaussian process models. 
Basically, all these models do is assume 
that similar inputs give similar outputs. 
They don't have any more sophisticated 
prior than that, 
but they're very good at using that prior 
in an effective way. 
So if you don't know much about what you 
expect hyper-parameters could do, a weak 
prior like that is probably the best you 
can do. 
Gaussian processes are able to learn for 
each input dimension what the appropriate 
scale is for measuring similarity. 
So for example, if the number of hidden 
units could be 200 or it could 300, 
the question is are those similar number 
or are those very different numbers? 
Should we expect the results we get with 
200 to be very similar to the results we 
get with 300 or should we expect them to 
be very different? 
If we don't know anything about neural 
nets, initially we have no idea, but we 
could look at the results of experiments 
so far. 
And if experiments with 200 and 
experiments with 300 tend to give very 
similar answers when you take into 
account the other differences between the 
experiments, 
then 200 is probably similar to 300. 
And so, we set a scale for that dimension 
such that you need differences of much 
more than that to expect to get very 
different results. 
Now, it's important that Gaussian 
processes models do more than just 
predicting the expected outcome of a 
particular experiment. 
That is how well the neural net that we 
train will do on a validation set. 
In addition to predicting a mean value 
for how well they expect the neural 
network to do, they predict a 
distribution, 
they predict the variance. 
They're called Gaussian processes because 
their predictions are Gaussian. 
When they're making a prediction for new 
settings of hyper-parameters that are 
close to several consistent settings that 
we've already run, so we know the answer. 
The predictions will tend to be fairly 
sharp, that is well have low variance, 
but when they are predictions for 
experiments with hyper-parameters that 
are very different from in setting the 
hyper-parameters we'd experimented it 
with so far, 
the predictions made by Gaussian process 
models will have very high variance. 
So here's quite a good strategy for using 
Gaussian processes to decide what to try 
next. 
So remember, we have one kind of learning 
model, which is a big neural network that 
takes a long time to route, 
and we're trying to figure out a good 
setting of the hyper-parameters to try 
next. 
We have a different kind of machine 
learning algorith, called a Gaussian 
process, that's looking at the results of 
the experiments we've done so far and 
trying to predict for some proposed new 
setting of the hyper-parameters How well 
the neural network would do and also how 
unsure that prediction is? 
So what we're going to do is we're going 
to keep track of the hyper-parameters 
that have worked best so far. 
That is a single setting of all the 
hyper-parameters that gave us the neural 
net with the highest performance so far. 
Now, when we run the next experiment, our 
best setting so far might be replaced by 
the new experiment because it gives 
better performances in neural net or it 
might stay the same. 
So since we're going to substitute the 
results of the new experiment it is 
better than anything we've seen so far, 
our best setting so far can only improve. 
So here's a good strategy for what 
setting of the hyper-parameters to try 
next. 
We pick a setting of the 
hyper-parameters, 
such that the expected improvement in our 
best setting is big. 
We don't worry about the fact that we 
might do an experiment that leads to a 
really bad result, 
because if it gets a really bad result, 
we won't replace our best so far with 
this new experiment. 
Also, when learn something. 
This is a phenomenon that managers of 
hedge funds know about. 
I often tell the client if the fund goes 
up, I'll take 3% of your profits. 
If the fund goes down, you lose. 
Now that's a crazy thing for a client to 
agree to, because that gives the hedge 
fund manager huge incentive for taking 
huge risks, because he has no significant 
answer. 
But, for finding hyper-parameters that 
work well, it's a sensible strategy. 
So, consider these three predictions, A, 
B, and C. 
We're going to suppose that A, B, and C 
are different settings of the 
hyper-parameters that have not been tried 
and those green Gaussians are the 
prediction of our Gaussian process model 
for how well each of those setting would 
do. 
For setting A, the mean is well below our 
current best so far and there's only 
moderate variance. 
For setting B, the mean is closer to our 
best so far, 
but since there isn't much variance, 
there really isn't that much upside. 
For setting C, the mean is actually lower 
than for setting B, but because it's high 
variance there's a big upside. 
We're going to take the area under 
Gaussian C that's above the red line and 
we're going to take the moment of that 
area above the red line and that's the 
thing we're looking for the matching 
margin and you can see that C has a much 
bigger moment than B or A. 
It may only have the same area as B above 
the line, 
but some of that area is much further 
above the line, 
so we might get a very big win if we try 
setting C. 
So that's the one our policy would tell 
us to pick here. 
Here's the worst part, 
B is intermediate and c is the best bet. 
So how well does this work? 
Well, if you got the resources to run a 
lot of experiments, it's much better than 
a person of finding good combinations of 
the hyper-parameters. 
The policy I gave you so far is a 
strictly sequential policy that assumes 
that it can see all of the experiments 
run so far, 
but there's no reason why you shouldn't 
make it a bit more complicated and run a 
whole bunch of experiments in parallel. 
Using a Gaussian process model to predict 
how well a particular setting of the 
hyper-parameters will do is sensible, 
because it's not the kind of task we're 
good at. 
It's not like visional speech, and it's 
not clear that there's a lot of 
complicated structure to be found in the 
data. 
It may be that the only real structure is 
that things are smooth and they have some 
scale. 
Also, a person can't keep in mind the 
results of 50 different experiments, to 
see what they predict. 
If you're doing all this by hand, you 
might just fail to notice that all of 
your good results had very small learning 
rates, and all of your really bad results 
had very big learning rates, 
because you're attending to lots of other 
things that you're varying. 
A Gaussian process model would not miss a 
trend like that. 
One final reason why Gaussian process 
models are a very good way of setting 
hyper-parameters is they're much less 
likely than a person to cheat. 
Typically when we're doing research, we 
want to compare a new method that we 
thought of with some old or standard 
method, 
and there's a very strong tendency to 
work harder to find good hyperparameters 
for our new method than for the stupid 
old method. 
That's why when you compare methods, you 
should really compare the results got by 
different groups, 
where for each method, the results are 
produced by the group that believes in 
that method. 
If we use Gaussian process models to 
search for good sets of hyper-parameters, 
they're going to do just as hard a search 
for the type of model we don't believe in 
as they are for the type of model we do 
believe in.