In the previous videos, we put
together almost all
the pieces you need in order
to implement and train in your network.
There's just one last idea I
need to share with you, which
is the idea of random initialization.
When you're running an algorithm like
gradient descent or also the
advanced optimization algorithms, we
need to pick some initial value for the parameters theta.
So for the advanced optimization algorithm, you know,
it assumes that you will
pass it some initial value
for the parameters theta.
Now let's consider gradient descent.
For that, you know, we also need to initialize theta to something.
And then we can slowly take steps
go downhill, using graded descent,
to go downhill to minimize the function J of theta.
So what do we set the initial value of theta to?
Is it possible to set
the initial value of theta
to the vector of all zeroes.
Whereas this worked okay when we were using logistic regression.
Initializing all of your
parameters to zero actually
does not work when you're trading a neural network.
Consider training the following neural network.
And let's say we initialized all of the parameters in the network to zero.
And if you do that then
what that means is that
at the initialization this blue weight, that I'm covering blue
is going to equal to that weight.
So, they're both zero.
And this weight that I'm covering
in in red, is equal to that weight.
Which I'm covering it in red.
And also this weight, well
which I'm covering it in green
is going to be equal to the value of that weight.
And what that means is that both of your hidden units: a1 and a2
are going to be computing the same function
of your inputs.
And thus, you end up with
for everyone of your training your examples.
You end up with a(2)1 equals a(2)2.
and moreover because, I'm not
going to show this too much
detail, but because these out
going weights are the same you
can also show that the
delta values are also going to be the same.
So concretely, you end up
with delta 1 1,
delta 2 1, equals delta 2 2.
And if you work through the
map further, what you can
show is that the partial derivatives
with respect to your parameters will satisfy the following.
That the partial derivative
of the cost
function with respect to
writing out the derivatives respect to
these two blue weights neural network.
You'll find that these two partial
derivatives are going to be equal to each other.
And so, what this means, is
that even after say, one gradient descent update.
You're going to update, say this
first blue weight with, you know, learning rate times this.
And you're going to update the second
blue weight to a sum learning rate times this.
But what this means is
that even after one gradient
descent update, those two
blue weights, those two blue
color parameters will end
up the same as each other.
So they'll be some non-zero
value now, but this value
will be equal to that value.
And similarly, even after one gradient descent update.
This value will equal to that value.
There will be some non-zero values.
Just that the two red values will be equal to each other.
And similarly the two green
weights, they'll both change values
but they'll both end up the same value as each other.
So after each update, the parameters corresponding
to the inputs going to each
of the two hidden units identical.
That's just saying that the two
green weights must be sustained,
the two red weights must be
sustained, the two blue weights
are still the same and what
that means is that even after
one iteration of say, gradient
descent, you find that
your two hidden units are still
computing exactly the same function that the input.
So you still have this a(1)2 equals a(2)2.
And so you're back to this case.
And as keep running gradient descent.
The blue weights, the two blue weights will stay the same as each other.
The two red weights will stay the same as each other.
The two green weights will stay the same as each other.
And what this means
is that your neural network really
can't compute very interesting functions.
Imagine that you had
not only two hidden
units but imagine
that you had many many hidden units.
Then what this is saying is that
all of your hidden units are
computing the exact same
feature, all of your hidden units are computing all of the exact same function of the input.
And this is a highly redundant representation.
Because that means that your
final logistic regression unit, you know, really only gets to see one feature.
Because all of these are the same
and this prevents your neural network from learning something interesting.
In order to get around this
problem, the way we initialize
the parameters of a neural network
therefore, is with random initialization.
Concretely, the problem we
saw on the previous slide
is sometimes called the problem
of symmetric weights, that is if the weights all being the same.
And so this random initialization
is how we perform symmetry breaking.
So what we do is we
initialize each value of
theta to a random
number between minus epsilon and epsilon.
So this is a notation to
mean numbers between minus epsilon and plus epsilon.
So my weights on my
parameters are all going
to be randomly initialized between minus epsilon and plus epsilon.
The way I write code to do
this in octave, this I've said you know theta 1 to be equal to this.
So this rand 10 by 11.
That's how you compute
a random 10 by 11
dimensional matrix, and all
of the values are between 0 and 1.
So these are going to
be real numbers that take on
any continuous values between 0 and 1.
And so, if you take a
number between 0 and
1, multiply it by 2
times an epsilon, and
minus an epsilon, then you
end up with a number that's
between minus epsilon and plus epsilon.
And incidentally, this epsilon here
has nothing to do
with the epsilon that we were
using when we were doing gradient checking.
So when we were doing numerical gradient checking,
there we were adding some values of epsilon to theta.
This is, you know, an unrelated value of epsilon.
Which is why I am denoting
in it epsilon, just to distinguish
it from the value of epsilon we were using in gradient checking.
Absolutely, if you want to
initialize theta 2
to a random 1 by
11 matrix, you can do so using this piece of code here.
So, to summarize, to
train a neural network, what you
should do is randomly initialize the
weights to, you know, small
values close to 0, between
minus epsilon and plus epsilon,
say, and then implement
back-propagation; do gradient checking;
and use either gradient
descent or one of the
advanced optimization algorithms to try
to minimize J of theta
as a function of the
parameters theta starting from just
randomly chosen initial value for the parameters.
And by doing symmetry breaking, which is this process.
Hopefully, gradient descent or the
advanced optimization algorithms will be
able to find a good value of theta.