In the last video, we gave
a mathematical definition of how
to represent or how to
compute the hypotheses used by Neural Network.
In this video, I like
show you how to actually
carry out that computation efficiently, and
that is show you a vector rise implementation.
And second, and more importantly, I want
to start giving you intuition about
why these neural network representations
might be a good idea and how
they can help us to learn complex nonlinear hypotheses.
Consider this neural network.
Previously we said that
the sequence of steps that we
need in order to compute
the output of a hypotheses
is these equations given on
the left where we compute
the activation values of the
three hidden uses and then
we use those to compute the
final output of our hypotheses
h of x. 
Now, I'm going
to define a few extra terms.
So, this term that I'm
underlining here, I'm going to
define that to be
z superscript 2 subscript 1.
So that we have that
a(2)1, which is this
term is equal to
g of z to 1.
And by the
way, these superscript 2, you
know, what that means is that
the z2 and this a2
as well, the superscript
2 in parentheses means that these
are values associated with layer
2, that is with the hidden
layer in the neural network.
Now this term here
I'm going to similarly define as
z(2)2.
And finally, this last
term here that I'm underlining,
let me define that as z(2)3.
So that similarly we have a(2)3
equals g of
z(2)3.
So these z values are just
a linear combination, a weighted
linear combination, of the
input values x0, x1,
x2, x3 that go into a particular neuron.
Now if you look at
this block of numbers,
you may notice that that block
of numbers corresponds suspiciously similar
to the matrix vector
operation, matrix vector multiplication
of x1 times the
vector x. Using this observation
we're going to be able to vectorize this computation
of the neural network.
Concretely, let's define the
feature vector x as usual
to be the vector of x0, x1,
x2, x3 where x0
as usual is always equal
1 and that defines
z2 to be the vector
of these z-values, you know, of z(2)1 z(2)2, z(2)3.
And notice that, there, z2 this
is a three dimensional vector.
We can now vectorize the computation
of a(2)1, a(2)2, a(2)3 as follows.
We can just write this in two steps.
We can compute z2 as theta
1 times x and that
would give us this vector z2;
and then a2 is
g of z2 and just
to be clear z2 here, This
is a three-dimensional vector and
a2 is also a three-dimensional
vector and thus this
activation g. This applies the
sigmoid function element-wise to each
of the z2's elements.
And
by the way, to make our notation
a little more consistent with
what we'll do later, in this
input layer we have the
inputs x, but we
can also thing it is
as in activations of the first layers.
So, if I defined a1 to
be equal to x. So,
the a1 is vector, I can
now take this x here
and replace this with z2 equals theta1
times a1 just by defining
a1 to be activations in my input layer.
Now, with what I've written
so far I've now gotten
myself the values for a1,
a2, a3, and really
I should put the
superscripts there as well.
But I need one more
value, which is I also want this a(0)2
and that corresponds to
a bias unit in the
hidden layer that goes to the output there.
Of course, there was a
bias unit here too that,
you know, it just didn't draw
under here but to
take care of this extra bias unit,
what we're going to do is add
an extra a0 superscript 2,
that's equal to one, and after
taking this step we now have
that a2 is going to
be a four dimensional feature
vector because we just added
this extra, you know,
a0 which is equal to
1 corresponding to the bias unit
in the hidden layer.
And finally,
to compute the actual
value output of our hypotheses, we
then simply need to compute
z3. So z3 is
equal to this term here that I'm just underlining.
This inner term there is z3.
And z3 is stated
2 times a2 and finally
my hypotheses output h of x which
is a3 that is
the activation of my
one and only unit in
the output layer. So, that's just the real number. You can write it as a3
or as a(3)1 and that's g of z3.
This process of computing h of x
is also called forward propagation
and is called that because we
start of with the activations
of the input-units and then
we sort of forward-propagate that to the
hidden layer and compute the activations of the
hidden layer and then we
sort of forward propagate that
and compute the activations of
the output layer, but this process of computing the activations from the input then
the hidden then the output layer,
and that's also called forward propagation
and what we just did is
we just worked out a vector
wise implementation of this
procedure.
So, if you
implement it using these equations
that we have on the right, these
would give you an efficient way
or both of the efficient way of
computing h of x.
This forward propagation view also
helps us to understand what
Neural Networks might be doing
and why they might help us to
learn interesting nonlinear hypotheses.
Consider the following neural network
and let's say I cover up
the left path of this picture for now.
If you look at what's left in this picture.
This looks a lot like
logistic regression where what
we're doing is we're using
that note, that's just the
logistic regression unit and we're
using that to make a
prediction h of x. 
And concretely,
what the hypotheses is outputting
is h of x is going
to be equal to g which
is my sigmoid activation function times theta 0
times a0 is equal
to 1 plus theta 1
plus theta 2
times a2 plus theta
3 times a3 whether
values a1, a2, a3
are those given by these three given units.
Now, to be actually consistent
to my early notation. Actually, we
need to, you know, fill in
these superscript 2's here everywhere
and I also have these
indices 1 there because I
have only one output unit, but if you focus on the blue parts of the notation.
This is, you know, this looks
awfully like the standard logistic
regression model, except that
I now have a capital theta instead of lower case theta.
And what this is
doing is just logistic regression.
But where the features fed into
logistic regression are these
values computed by the hidden layer.
Just to say that again, what
this neural network is doing is
just like logistic regression, except
that rather than using the
original features x1, x2, x3,
is using these new features a1, a2, a3.
Again, we'll put the superscripts
there, you know, to be consistent with the notation.
And the cool thing about this,
is that the features a1, a2,
a3, they themselves are learned
as functions of the input.
Concretely, the function mapping from
layer 1 to layer 2,
that is determined by some
other set of parameters, theta 1.
So it's as if the
neural network, instead of being
constrained to feed the
features x1, x2, x3 to logistic regression.
It gets to
learn its own features, a1,
a2, a3, to feed into the
logistic regression and as
you can imagine depending on
what parameters it chooses for
theta 1. 
You can learn some pretty interesting
and complex features and therefore
you can end up with a
better hypotheses than if
you were constrained to use
the raw features x1, x2 or x3 or if
you will constrain to say choose the
polynomial terms, you know,
x1, x2, x3, and so on.
But instead, this algorithm has
the flexibility to try
to learn whatever features at once, using
these a1, a2, a3 in
order to feed into this
last unit that's essentially
a logistic regression here. 
I realized
this example is described as
a somewhat high level and so
I'm not sure if this intuition
of the neural network, you know, having
more complex features will quite
make sense yet, but if
it doesn't yet in the next
two videos I'm going to
go through a specific example
of how a neural network can
use this hidden there to compute
more complex features to feed
into this final output layer
and how that can learn more complex hypotheses.
So, in case what I'm
saying here doesn't quite make
sense, stick with me
for the next two videos and
hopefully out there working through
those examples this explanation will
make a little bit more sense.
But just the point O. You
can have neural networks with
other types of diagrams as
well, and the way that
neural networks are connected, that's called the architecture.
So the term architecture refers to
how the different neurons are connected to each other.
This is an example
of a different neural network architecture
and once again you may
be able to get this intuition of
how the second layer,
here we have three heading units
that are computing some complex
function maybe of the
input layer, and then the
third layer can take the
second layer's features and compute
even more complex features in layer three
so that by the time you get
to the output layer, layer four,
you can have even more
complex features of what
you are able to compute in
layer three and so get very interesting nonlinear hypotheses.
By the way, in a network
like this, layer one, this is
called an input layer. Layer four
is still our output layer, and
this network has two hidden layers.
So anything that's not an
input layer or an output
layer is called a hidden layer.
So, hopefully from this video
you've gotten a sense of
how the feed forward propagation step
in a neural network works where you
start from the activations of
the input layer and forward
propagate that to the
first hidden layer, then the second
hidden layer, and then finally the output layer.
And you also saw how
we can vectorize that computation.
In the next, I realized
that some of the intuitions in this
video of how, you know, other certain
layers are computing complex features of the early layers.
I realized some of that intuition
may be still slightly abstract and kind of a high level.
And so what I would like
to do in the two videos
is work through a detailed example
of how a neural network can
be used to compute nonlinear
functions of the input and
hope that will give you a
good sense of the sorts of
complex nonlinear hypotheses we can get out of Neural Networks.