In the previous video, we talked about the back propagation algorithm.
To a lot of people
seeing it for the first time,
the first impression is often
that wow, this is
a very complicated algorithm and there are all
these different steps. And I'm
not quite sure how they fit
together and its like kind
of a black box with all these complicated steps.
In case that 's how you are
feeling about back propagation, that's
actually okay.
Back propagation may be unfortunately
is a less mathematically clean or
less mathematically simple algorithm
compared to linear regression or
logistic regression, and I've
actually used back propagation, you know, pretty
successfully for many years and
even today, I still don't sometimes
feel like I have a very
good sense of just what
it's doing most of intuition about
what background propagation is doing.
For those of you that are doing
the programming exercises that will
at least mechanically step you
through the different steps of
how to implement back prop
so you will be able to get it to work for yourself.
And what I want to do
in this video is look a
little bit more at the
mechanical steps of back propagation
and try to give you a little
more intuition about what the
mechanical steps of back prop
is doing to hopefully convince you
that, you know, it is at least a reasonable algorithm.
In case even after this video, in
case back propagation still seems
very black box and kind
of like, you know, too many complicated
steps, a little bit magical to to
you, that's actually okay.
And even though,
you know, I have used back prop
for many years, sometimes it's a difficult algorithm to understand.
But hopefully this video will help a little bit.
In order to better understand back
propagation, let's take another
closer look at what forward propagation is doing.
Here's the neural network with two
input units that is not
counting the bias unit, and
two hidden units in this
layer and two hidden units
in the next layer, and then
finally one output unit.
And again, these counts 2,
2, 2 are not counting these bias units on top.
In order to illustrate forward
propagation, I'm going to
draw this network a little bit differently.
And in particular, I'm going to
draw this neural network with the
nodes drawn as these very
fat ellipses, so that I can write text in them.
When performing forward propagation,
we might have some particular example,
say some example x(i) comma
y(i) and it will
be this x(i) that we
feed into the input layer, so
that this may be,
x(i)1 and x(i)2 are the
values we set the input
layer to and when we
forward propagate it to the
first hidden layer here, what
we do is compute z(2)1 and
z(2)2, so these are the
weighted sum of inputs of the
input units and then
we apply the sigmoid of
the logistic function and the
sigmoid activation function applied
to the z value, gives us
these activation values.
So that gives us a(2)1
and a(2)2, and then we
forward propagate again to get,
you know, here z(3)1,
apply the sigmoid of the
logistic function, the activation function
to that, to get the
31 and similarly Like so,
until we get z(4)1, apply the
activation function this gives
us a(4)1 which is the
finer output value of the network.
Let's erase this arrow to
give myself some space, and if
you look at what this
computation really is doing, focusing
on this hidden unit
lets say we have that
this weight, shown in
magenta there, is my
weight theta 2(1)0 the
indexing is not important, and this
way here which I guess
I am highlighting in red, that
is theta 2(1)1 and
this weight here, which I'm
drawing in green, in a cyan,
is theta 2(1)2 so
the way the computers value z(3)1
is z(3)1 is as
equal to this Weight,
times this value so that's
theta 2(1)0 times 1,
plus this red
weight times this value, so
that's theta 2(1)1 times
a(2)1, and finally this
cyan red times this value,
which is therefore, plus theta
2(1)2 times a(2)1.
And so that's forward propagation.
And it turns out that, as
we see later on in this
video, what back propagation
is doing, is doing a
process very similar to
this, except that instead of
the computations flowing from the
left to the right of this network,
the computations is there flow
from the right to the
left of the network, and using
a very similar computation as this,
and I'll say in two
slides exactly what I mean by that.
To better understand what back
propagation is doing, let's look
at the cost function, it's just the
cost function that we had for
when we have only one output unit.
If we have more than
one output unit, we just
have a summation, you know, over the
output units index, if only
one output unit then this
is a cost function operation and
we do forward propagation and back propagation on one example at a time.
So, let's just focus on the
single example x(i)y(i), and focus
on the case of having one output
unit so y(i) here
is just a real number, and
let's ignore regularization, so lambda
equals zero, and this final
term, that regularization term goes away.
Now, if you look inside
this summation, you find that
the cost term associated with
the I'f training example, that
is, the cost associated with training
example x(i)y(i), that's
going to be given by this expression, that the
cost, sort of, of training example
i is written as follows.
And what this cost
function does, is it plays
a role similar to the square error.
So, rather than looking at this
complicated expression, if you
want you can think of cos
of i being approximately, you know, the square of
difference between or the
neural network outputs versus what
is the actual value. Just as
in logistic regression, we actually
prefer to use this slightly
more complicated cost function using
the log, but for the
purpose of intuition, feel free
to think of the cost function
as being sort of the squared
error cost function, and so
this cos of i measures how
well is the network doing on
correctly predicting example i.
How close is the output
to the actually observed label y(i).
Now let's look at what back propagation is doing.
One useful intuition is that
back propagation is computing these
delta superscript l
subscript j terms, and we
can think of these as
the quote error of the
activation value that we
got for unit j in
the layer, in the
lth layer.
More formally, and this is
maybe only for those of
you that are familiar with calculus,
more formally, what the delta
terms actually are is this:
they're the partial derivative with respect
to z(l)j, that is
the weighted sum of inputs that
we're computing the z terms,
partial derivative respect of these things of the cost function.
So concretely the cost function
is a function of the label
y and of the
value, this h of
x output value neural network. And
if we could go inside the neural network
and just change those z(l)j
values a little bit, then
that would affect these values that the neural net.
And so that will end up changing the cost function.
And again really this
is only for those of you expert in calculus.
If you are familiar with comfortable with partial derivatives.
What these delta terms are,
is they're, they turn out to
be the partial derivative of the
cos function with respect to these intermediate terms that we're computing.
And so their measure of
how much would we like to
change the neural network's weights in
order to affect these intermediate values
of the computation, so as
to affect the final output the
neural network h of x and
therefore affect the overall cost.
In case this last part of
this partial derivative intuition, in case
that didn't make sense, don't worry
about it, the rest of this
we can do without really
talking partial derivatives but let's
look in more detail at what
back propagation is doing.
For the output layer, if first
sets this delta term, we say
delta 4(1), as y(i)
if we're doing forward propagation
and back propagation on this
training example i. It says it's y(i)
minus a(4)1,
so it's really the error, it's
the difference between the actual value
of y minus what was
the value predicted.
And so we're going to compute delta
4(1) like so.
Next we're going to do propagate these values backwards.
I explain this in a second
and end up computing the delta terms of the previous layer.
We're going to end up
with delta 3(1); delta 3(2);
and then we're going to
propagate this further
backward and end up
computing delta 2(1) and
delta 2(2).
Now the back propagation calculation
is a lot like running the
forward propagation algorithm, but doing it backwards.
So here's what I mean.
Let's look at how we end
up with this value of Delta 2(2).
So we have Delta
2(2) and similar to
forward propagation, let me label a couple of the weights.
So this weight should be one cyan--let's say
that weight is theta 2
of 1, 2 and this
weight down here, let me highlight
this in red. That's going to be, let's say,
theta 2 of 2, 2.
So if we
look at how Delta 2(2)
is computed. How it's computed for this note. It turns
out that what we're
going to do is
we're going to take this value and
multiply it by this weight and
add it to this value
multiplied by that weight.
So it's really a weighted sum
of the new, these delta values.
weighted by the corresponding edge strength.
So concretely, let me fill this in.
This delta 2,2 is going to
be equal to theta 2(1)2,
which is that magenta weight,
times delta 3(1) plus, and
then the thing I have in red, that's
theta 2(2)2
times Delta 3(2).
So it is really, literally this red
weight times this value, plus this
magenta weight times it's value
and that's how we wind up with that value of delta.
And just as another example, let's look at this value.
How did we get that value?
Well, it's a similar
process, if this weight,
which I'm going to highlight in
green, if this weight is
equal to, say, delta
3(1)2, then we have
that, delta 3(2) is
going to be equal to
that green weight, theta 3(1)2
times delta 4(1).
And by the
way, so far I've been
writing the delta values only
for the hidden units and
not, but not, excluding the bias units.
Depending on how you define
the back propagation algorithm or depending on
how you implement it, you know,
you may end up implementing something
to compute delta values for
these bias units as well.
The bias unit is always output
the values plus one and they
are just what they are and there's
no way for us to change
the value and so, depending
on your implementation of back prop,
the way I usually implement it,
I do end up computing these
delta values, but we
just discard them and we
don't use them, because they don't
end up being part of
the calculation needed to compute the derivatives.
So, hopefully, that gives
you a little bit of intuition
about what back propagation is doing.
In case of all this, they
still seem so magical and
so black box, in a
later video, in the
putting it together video, I'll try
to give a little more intuition about what that back propagation is doing.
But, unfortunately, this is, you
know, a difficult algorithm to
try to visualize and understand what it is really doing.
But fortunately, you know,
often I guess, many people
have been using it very successfully
for many years and if
you infer the algorithm, you have
a very effective learning algorithm, even
though the inner workings of exactly
how it works can be harder to visualize.