So, it's taken us a lot of videos to get through the neural network learning algorithm. In this video, what I'd like to do is try to put all the pieces together, to give a overall summary or a bigger picture view, of how all the pieces fit together and of the overall process of how to implement a neural network learning algorithm. When training a neural network, the first thing you need to do is pick some network architecture and by architecture I just mean connectivity pattern between the neurons. So, you know, we might choose between say, a neural network with three input units and five hidden units and four output units versus one of 3, 5 hidden, 5 hidden, 4 output and here are 3, 5, 5, 5 units in each of three hidden layers and four open units, and so these choices of how many hidden units in each layer and how many hidden layers, those are architecture choices. So, how do you make these choices? Well first, the number of input units well that's pretty well defined. And once you decides on the fix set of features x the number of input units will just be, you know, the dimension of your features x(i) would be determined by that. And if you are doing multiclass classifications the number of output of this will be determined by the number of classes in your classification problem. And just a reminder if you have a multiclass classification where y takes on say values between 1 and 10, so that you have ten possible classes. Then remember to right, your output y as these were the vectors. So instead of clause one, you recode it as a vector like that, or for the second class you recode it as a vector like that. So if one of these apples takes on the fifth class, you know, y equals 5, then what you're showing to your neural network is not actually a value of y equals 5, instead here at the upper layer which would have ten output units, you will instead feed to the vector which you know with one in the fifth position and a bunch of zeros down here. So the choice of number of input units and number of output units is maybe somewhat reasonably straightforward. And as for the number of hidden units and the number of hidden layers, a reasonable default is to use a single hidden layer and so this type of neural network shown on the left with just one hidden layer is probably the most common. Or if you use more than one hidden layer, again the reasonable default will be to have the same number of hidden units in every single layer. So here we have two hidden layers and each of these hidden layers have the same number five of hidden units and here we have, you know, three hidden layers and each of them has the same number, that is five hidden units. Rather than doing this sort of network architecture on the left would be a perfect ably reasonable default. And as for the number of hidden units - usually, the more hidden units the better; it's just that if you have a lot of hidden units, it can become more computationally expensive, but very often, having more hidden units is a good thing. And usually the number of hidden units in each layer will be maybe comparable to the dimension of x, comparable to the number of features, or it could be any where from same number of hidden units of input features to maybe so that three or four times of that. So having the number of hidden units is comparable. You know, several times, or some what bigger than the number of input features is often a useful thing to do So, hopefully this gives you one reasonable set of default choices for neural architecture and and if you follow these guidelines, you will probably get something that works well, but in a later set of videos where I will talk specifically about advice for how to apply algorithms, I will actually say a lot more about how to choose a neural network architecture. Or actually have quite a lot I want to say later to make good choices for the number of hidden units, the number of hidden layers, and so on. Next, here's what we need to implement in order to trade in neural network, there are actually six steps that I have; I have four on this slide and two more steps on the next slide. First step is to set up the neural network and to randomly initialize the values of the weights. And we usually initialize the weights to small values near zero. Then we implement forward propagation so that we can input any excellent neural network and compute h of x which is this output vector of the y values. We then also implement code to compute this cost function j of theta. And next we implement back-prop, or the back-propagation algorithm, to compute these partial derivatives terms, partial derivatives of j of theta with respect to the parameters. Concretely, to implement back prop. Usually we will do that with a fore loop over the training examples. Some of you may have heard of advanced, and frankly very advanced factorization methods where you don't have a four-loop over the m-training examples, that the first time you're implementing back prop there should almost certainly the four loop in your code, where you're iterating over the examples, you know, x1, y1, then so you do forward prop and back prop on the first example, and then in the second iteration of the four-loop, you do forward propagation and back propagation on the second example, and so on. Until you get through the final example. So there should be a four-loop in your implementation of back prop, at least the first time implementing it. And then there are frankly somewhat complicated ways to do this without a four-loop, but I definitely do not recommend trying to do that much more complicated version the first time you try to implement back prop. So concretely, we have a four-loop over my m-training examples and inside the four-loop we're going to perform fore prop and back prop using just this one example. And what that means is that we're going to take x(i), and feed that to my input layer, perform forward-prop, perform back-prop and that will if all of these activations and all of these delta terms for all of the layers of all my units in the neural network then still inside this four-loop, let me draw some curly braces just to show the scope with the four-loop, this is in octave code of course, but it's more a sequence Java code, and a four-loop encompasses all this. We're going to compute those delta terms, which are is the formula that we gave earlier. Plus, you know, delta l plus one times a, l transpose of the code. And then finally, outside the having computed these delta terms, these accumulation terms, we would then have some other code and then that will allow us to compute these partial derivative terms. Right and these partial derivative terms have to take into account the regularization term lambda as well. And so, those formulas were given in the earlier video. So, how do you done that you now hopefully have code to compute these partial derivative terms. Next is step five, what I do is then use gradient checking to compare these partial derivative terms that were computed. So, I've compared the versions computed using back propagation versus the partial derivatives computed using the numerical estimates as using numerical estimates of the derivatives. So, I do gradient checking to make sure that both of these give you very similar values. Having done gradient checking just now reassures us that our implementation of back propagation is correct, and is then very important that we disable gradient checking, because the gradient checking code is computationally very slow. And finally, we then use an optimization algorithm such as gradient descent, or one of the advanced optimization methods such as LB of GS, contract gradient has embodied into fminunc or other optimization methods. We use these together with back propagation, so back propagation is the thing that computes these partial derivatives for us. And so, we know how to compute the cost function, we know how to compute the partial derivatives using back propagation, so we can use one of these optimization methods to try to minimize j of theta as a function of the parameters theta. And by the way, for neural networks, this cost function j of theta is non-convex, or is not convex and so it can theoretically be susceptible to local minima, and in fact algorithms like gradient descent and the advance optimization methods can, in theory, get stuck in local optima, but it turns out that in practice this is not usually a huge problem and even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like gradient descent will do a very good job minimizing this cost function j of theta and get a very good local minimum, even if it doesn't get to the global optimum. Finally, gradient descents for a neural network might still seem a little bit magical. So, let me just show one more figure to try to get that intuition about what gradient descent for a neural network is doing. This was actually similar to the figure that I was using earlier to explain gradient descent. So, we have some cost function, and we have a number of parameters in our neural network. Right here I've just written down two of the parameter values. In reality, of course, in the neural network, we can have lots of parameters with these. Theta one, theta two--all of these are matrices, right? So we can have very high dimensional parameters but because of the limitations the source of parts we can draw. I'm pretending that we have only two parameters in this neural network. Although obviously we have a lot more in practice. Now, this cost function j of theta measures how well the neural network fits the training data. So, if you take a point like this one, down here, that's a point where j of theta is pretty low, and so this corresponds to a setting of the parameters. There's a setting of the parameters theta, where, you know, for most of the training examples, the output of my hypothesis, that may be pretty close to y(i) and if this is true than that's what causes my cost function to be pretty low. Whereas in contrast, if you were to take a value like that, a point like that corresponds to, where for many training examples, the output of my neural network is far from the actual value y(i) that was observed in the training set. So points like this on the line correspond to where the hypothesis, where the neural network is outputting values on the training set that are far from y(i). So, it's not fitting the training set well, whereas points like this with low values of the cost function corresponds to where j of theta is low, and therefore corresponds to where the neural network happens to be fitting my training set well, because I mean this is what's needed to be true in order for j of theta to be small. So what gradient descent does is we'll start from some random initial point like that one over there, and it will repeatedly go downhill. And so what back propagation is doing is computing the direction of the gradient, and what gradient descent is doing is it's taking little steps downhill until hopefully it gets to, in this case, a pretty good local optimum. So, when you implement back propagation and use gradient descent or one of the advanced optimization methods, this picture sort of explains what the algorithm is doing. It's trying to find a value of the parameters where the output values in the neural network closely matches the values of the y(i)'s observed in your training set. So, hopefully this gives you a better sense of how the many different pieces of neural network learning fit together. In case even after this video, in case you still feel like there are, like, a lot of different pieces and it's not entirely clear what some of them do or how all of these pieces come together, that's actually okay. Neural network learning and back propagation is a complicated algorithm. And even though I've seen the math behind back propagation for many years and I've used back propagation, I think very successfully, for many years, even today I still feel like I don't always have a great grasp of exactly what back propagation is doing sometimes. And what the optimization process looks like of minimizing j if theta. Much this is a much harder algorithm to feel like I have a much less good handle on exactly what this is doing compared to say, linear regression or logistic regression. Which were mathematically and conceptually much simpler and much cleaner algorithms. But so in case if you feel the same way, you know, that's actually perfectly okay, but if you do implement back propagation, hopefully what you find is that this is one of the most powerful learning algorithms and if you implement this algorithm, implement back propagation, implement one of these optimization methods, you find that back propagation will be able to fit very complex, powerful, non-linear functions to your data, and this is one of the most effective learning algorithms we have today.