Neural Networks are one of the most powerful learning algorithms that we have today. In this, and in the next few videos, I'd like to start talking about a learning algorithm for fitting the parameters of the neural network given the training set. As for the discussion of most of the learning algorithms, we're going to begin by talking about the cost function for fitting the parameters of the network. I'm going to focus on the application of neural networks to classification problems. So, suppose we have a network like that shown on the left. And suppose we have a training set like this of this of Xi, Yi pairs of m training examples. I'm going to use upper case L to denote the total number of layers in this network. So, for the network shown on the left, we would have capital L equals 4. And, I'm going to use s subscript L, to denote the number of units, that is a number of neurons, not counting the bias unit in layer L of the network. So, for example, we would have a S1 which is the input layer equals S3 unit, S2 in my example is five units. And the output layer S4. Which is also equals SL, because capital L is equal to four. The output layer in my example in the left has four units. We're going to consider two types of classification problems. The first is binary classification, where the labels y are either zero or one. In this case, we would have one output unit. So, this neural network on top has four output units, but if we had binary classification, we would have only one output unit that computes h of x. And the outputs of the neural network would be h of x is going to be a real number. And in this case the number of output units, SL, where L is again the index of the final layer because that's the number of layers we have in the network. So the number of units we have in the output layer is going to be equal to one. In this case, to simplify notation later, I'm also going to set k equals 1. So, you can think of k as also denoting the number of units in the output layer. The second type of classification problem we'll consider will be multiclass classification problem where we may have k distinct classes. So, our early example, I had this representation for y if we have four classes and in this case, we would have caps lock K output units and our hypotheses will output vectors that are K dimensional. And the number of output units will be equal to K. And usually we will have K greater than or equal to three in this case, because if we had two classes then, you know, we don't need to use the one versus all method. We need to use the one versus all method only if we have K greater than or equal to three classes so we only have two classes we will need to use only one output unit. Now, let's define the cost function for our cost function for our neural network. The cost function we use for the neural network is going to be a generalization of the one that we use for logistic regression. For logistic regression, we used to minimize the cost function j of theta that was minus 1 over m of this cost function and then plus this extra regularization term here, where this was a sum from j equals 1 through n, because we did not regularize the bias term theta zero. For a neural network our cost function is going to be a generalization of this. Where instead of having basically just one logistic regression output unit, we may instead have K of them. So here's our cost function. Neural network now outputs vectors in RK where K might be equal to 1 if we have the binary classification problem. I'm going to use this notation, h of x subscript i, to denote the ith output. That is h of x is a K dimensional vector. And so this subscript i just selects out the ith element of the vector that is output by my neural network. My cost function, j of theta is now going to be the following is minus one over m of a sum of a similar term to what we have in logistic regression. Except that we have this sum from K equals one through K. The summation is basically a sum over my K output unit. So, if I have four upper units. That is the final layer of my neural network has four output units then this sum from, this is a sum from K equals one through four of basically the logistic regression algorithms cost function but summing that cost function over each of my four output units in turn. And so, you notice in particular that this applies to YK, HK, because we're basically taking the K upper unit and comparing that to the value of YK, which is you know, which is that one of those vectors to say what cause it should be. And finally, the second term here is the regularization term similar to what we had for logistic regression. This summation terms looks really complicated and always doing is a summing over these terms, theta j i l for all values of i j and l. Except that we don't sum over the terms corresponding to these bias values like we had for logistic progression. Concretely, we don't sum over the terms corresponding to where i is equal to zero. So, that is because when we are computing the activation of the neuron, we have terms like these, you know theta, i0 plus theta, i1, x1 plus, and so on, where I guess we could have a 2 there if this is the first hidden layer, and so the values with the 0 there at that corresponds to something that multiplies into an x0 or an a0 and so, this is kind of like a bias unit and by analogy to what we were doing for logistic progression, we won't sum over those terms in our regularization term because we don't want to regularize them and string the values 0. But this is just one possible convention and even if you were to sum over, you know, i equals 0 up to SL, it will work about the same and it doesn't make a big difference. But maybe this convention of not regularizing the bias term is just slightly more common. So, that's the cost function we're going to use to fill on your own network. In the next video, we'll start to talk about an algorithm for trying to optimize the cost function.