Neural Networks are one of
the most powerful learning algorithms that we have today.
In this, and in the
next few videos, I'd like to
start talking about a learning
algorithm for fitting the parameters
of the neural network given the training set.
As for the discussion of most
of the learning algorithms, we're going
to begin by talking about the
cost function for fitting the parameters of the network.
I'm going to focus
on the application of neural
networks to classification problems.
So, suppose we have a
network like that shown on the left.
And suppose we have a training
set like this of this of
Xi, Yi pairs of m training examples.
I'm going to use upper case
L to denote the
total number of layers in this network.
So, for the network shown
on the left, we would have
capital L equals 4.
And, I'm going to use
s subscript L, to denote
the number of units, that is
a number of neurons, not counting
the bias unit in layer L of the network.
So, for example, we would
have a S1 which
is the input layer equals S3 unit,
S2 in my example is five units.
And the output layer S4.
Which is also equals SL, because capital L is equal to four.
The output layer in my
example in the left has four units.
We're going to consider two types of classification problems.
The first is binary classification,
where the labels y are either zero or one.
In this case, we would have one output unit.
So, this neural network on top
has four output units, but if
we had binary classification, we would
have only one output unit
that computes h of x.
And the outputs of the
neural network would be h
of x is going to be a real number.
And in this case the number
of output units, SL, where
L is again the index
of the final layer because that's
the number of layers we have in the network.
So the number of units we
have in the output layer is going to be equal to one.
In this case, to simplify notation
later, I'm also going to set k equals 1.
So, you can think of k as
also denoting the number
of units in the output layer.
The second type of classification problem
we'll consider will be multiclass classification
problem where we may have k distinct classes.
So, our early example, I
had this representation for y
if we have four classes and
in this case, we would have caps lock K
output units and our hypotheses
will output vectors that are K dimensional.
And the number of output units
will be equal to K.
And usually we will have
K greater than or equal
to three in this case, because
if we had two classes then,
you know, we don't need to
use the one versus all method.
We need to use the one versus
all method only if we
have K greater than or
equal to three classes so we
only have two classes we will
need to use only one output unit.
Now, let's define the cost function for our cost function for our neural network.
The cost function we use for
the neural network is going to
be a generalization of the
one that we use for logistic
regression. 
For logistic regression,
we used to minimize the
cost function j of theta
that was minus 1 over
m of this cost function
and then plus this extra regularization
term here, where this was
a sum from j equals
1 through n, because we
did not regularize the bias term theta zero.
For a neural network our cost
function is going to be a generalization of this.
Where instead of having basically
just one logistic regression output
unit, we may instead have K of them.
So here's our cost function.
Neural network now outputs vectors
in RK where K might
be equal to 1 if we
have the binary classification problem.
I'm going to use this notation,
h of x subscript i, to denote the ith output.
That is h of x is a K dimensional vector.
And so this subscript i just
selects out the ith element
of the vector that is output by my neural network.
My cost function, j of
theta is now going
to be the following is minus one
over m of a sum
of a similar term to what
we have in logistic regression.
Except that
we have this sum from K
equals one through K.  The
summation is basically a
sum over my K output unit.
So, if I have four upper units.
That is the final layer of my
neural network has four output
units then this sum
from, this is a sum from
K equals one through four
of basically the logistic regression algorithms
cost function but summing
that cost function over each
of my four output units in turn.
And so, you notice
in particular that this applies
to YK, HK, because
we're basically taking the K
upper unit and comparing that
to the value of YK, which
is you know, which is
that one of those vectors
to say what cause it should be.
And finally, the second term
here is the regularization
term similar to what we had for logistic regression.
This summation terms looks really
complicated and always doing
is a summing over these terms,
theta j i l for
all values of i j
and l. 
Except that we
don't sum over the terms
corresponding to these bias values
like we had for logistic progression.
Concretely, we don't sum
over the terms corresponding
to where i is equal to zero.
So, that is because
when we are computing the activation
of the neuron, we have terms
like these, you know theta, i0
plus theta, i1,
x1 plus, and
so on, where I guess
we could have a 2 there
if this is the first hidden layer,
and so the values with
the 0 there at that corresponds to
something that multiplies into an
x0 or an a0 and
so, this is kind of like
a bias unit and by
analogy to what we were
doing for logistic progression, we won't
sum over those terms in
our regularization term because we
don't want to regularize them and
string the values 0.
But this is just one possible convention
and even if you were
to sum over, you know, i equals 0 up
to SL, it will work
about the same and it doesn't make a big difference.
But maybe this convention
of not regularizing the bias
term is just slightly more common.
So, that's the cost function
we're going to use to fill on your own network.
In the next video, we'll start
to talk about an algorithm for
trying to optimize the cost function.