We can continue our discussion with StackNet. StackNet is a scalable meta modeling methodology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels. It is scalable because within the same level, we can run all the models in parallel. It utilizes stacking because it makes use of this technique we mentioned before where we split the data, we make predictions so some hold out data, and then we use another model to train on those predictions. And as we will see later on, this resembles a lot in neural network. Now let us continue that naive example we gave before with the students and the teacher, in order to understand what conceptually, in a real world, would need to add another layer. So in that example, we have a teacher that she was trying to combine the answers of different students and she was outputting an estimate of 17 under certain assumptions. We can make this example more interesting by introducing one more meta learner. Let's call him Mr. RF, who's also a physics teacher. Mr. RF believes that LR should have a bigger contribution to the ensemble because he has been doing private lessons with him and he knows he couldn't be that far off. So he's able to see the data from slightly different ways to capitalize on different parts of these predictions and make a different estimate. Whereas, the teachers could work it out and take an average, we could create or we can introduce a higher authority or another layer of modeling here. Let's call it the headmaster, GBM, in order to shop, make better predictions. And GBM doesn't need to know the answers that the students have given. The only thing he needs to know is the input from the teachers. And in this case, he's more keen to trust his physics teacher by outputting a 16.2 prediction. Why would this be of any use to people? I mean, isn't that already complicated? Why would we want to ever try something so complicated? I'm giving you an example of a competition my team used, four layer of stacking, in order to win. And we used two different sources of input data. We generated multiple models. Normally, exit boost and logistic regressions, and then we fed those into a four-layer architecture in order to get the top score. And although we could have escaped without using that fourth layer, we still need it up to level three in order to win. So you can understand the usefulness of deploying deep stacking. Another example is the Homesite competition organized by Homesite insurance where again, we created many different views of the data. So we had different transformations. We generated many models. We fed those models into a three-level architecture. I think we didn't need the third layer again. Probably, we could have escaped with only two levels but again, deep stacking was necessary in order to win. So there is your answer, deep stacking on multiple levels really helps you to win competitions. In the spirit of fairness and openness, there has been some criticism about large ensembles that maybe they don't have commercial value, they are confidentially expensive. I have to add three things on that. The first is, what is considered expensive today may not be expensive tomorrow and we have seen that, for example, with the deep learning, where with the advent of GPUs, they have become 100 times faster and now they have become again very, very popular. The other thing is, you don't need to always build very, very deep ensembles but still, small ensembles would still really help. So knowing how to do them can add value to businesses, again based on different assumptions about how fast they want the decisions, how much is the uplift you can see from stacking, which may vary, sometimes it's more, sometime is less. And generally, how much computing power they have. We can make a case that even stacking on multiple layers can be very useful. And the last point is that these are predictive modeling competitions so it is a bit like the Olympics. It is nice to be able to see the theoretical best you can get because this is how innovation takes over. This is how we move forward. We can express StackNet as a neural network. So normally, in a neural network, we have these architecture of hidden units where they are connected with input with the form of linear regression. So actually, it looks pretty much like a linear regression. So whether you have a set of coefficients and you have a constant value where you call it bias in neaural networks, and this is how your output predictions which one of the hidden units which are then taken, collected, to create the output. The concept of StackNet is actually not that much different. The only thing we want to do is, we don't want to be limited to that linear regression or to that perception. We want to be able to use any machine learning algorithm. Putting that aside, the architecture should be exactly the same, could be fairly similar. So how to train this? In a typical neural network, we use bipropagation. Here in this context, this is not feasible. I mean in the context of trying to make this network work with any input model because not all are differentiable. So this is why we can use stacking. Stacking here is a way to link the output, the prediction, the output of the node, with target variable. This is how the link also is made from the input features with a node. However, if you remember the way that stacking works is you have some train data. And then, you need to divide it into two halves. So, you use the first part called, training, in order to make predictions to the other part called, valid. If we, assuming that adding more layers gives us some uplift, if we wanted to do this again, we would have re-split the valid data into two parts. Let's call it, mini train, and mini valid. And you can see the problem here. I mean, assuming if we have really big data, then this may not really be an issue. But in certain situations where we don't have that much data. Ideally, we would like to do this without having to constantly re-split our data. And therefore minimizing the training data set. So, this is why we use a K-Fold paradigm. Let's assume we have a training data set with four features x0, x1, x2, x3, and the y variable, or target. If we are use k-fold where k = 4, this is a hyper-parameter which is what to put here. We would make four different parts out of these datasets. Here I have put different colors, colors to each one of these parts. What we would do then in order to commence the training, is we will create an empty vector that has the same size as rows, as in the training data, but for now is empty. And then, for each one of the folds, we would start, we will take a subset of the training data. In this case, we will start with red, yellow, and green. We will train a model, and then we will take the blue part, and will make predictions. And we will take these predictions, and we will put them in the corresponding location in the prediction array which was empty. Now, we are going to repeat the same process always using this rotation. So, we are now going to use the blue, the yellow, and the green part, and we will keep to create a model, and we will keep the red part for prediction. Again, we will take these predictions and put it into the corresponding part in the prediction array. And we will repeat again with the yellow, and the green. Something that I need to mention is that the K-Fold doesn't need to be sequential as a date. So, it would have been shuffled. I did it as this way in order to illustrate it better. But once we have finished and we have generated a whole prediction for the whole training data, then we can use the whole training data, in order to fit one last model and make now predictions for the test data. Another way we could have done this is for each one of the four models we were making predictions for the validation data. At the same time, we could have been making predictions for the whole test data. And after four models, we will just take an average at the end. We'll just divide the test predictions by four. But a different way to do it, I have found this way I just explained better with neural networks, and the method where you use the whole training data to generate predictions for test better with tree-based methods. So, once we finish the predictions with the test, you can start again with another model this time. So you will generate an empty prediction, you will stack it next to your previous one. And you will repeat the same process. You will essentially repeat this until you're finished with all models for the same layer. And then, this will become your new training data set and you will generally begin all over again if you have a new layer. This is generally the concept. Though we could say this, in order to extend on many layers, we use this K-Fold paradigm. However, normally, neural networks we have this notion of epochs. We have iterations which help us to re-calibrate the weights between the nodes. Here we don't have this option, the way stacking is. However, we can introduce this ability of revisiting the initial data through connections. So, a typical way to connect the nodes is the one we have already explored where you have it input nodes, each node is directly related with the nodes of the previous layer. Another way to do this is to say, a node is not only affected, connected with the nodes of the directly previous layer, but from all previous nodes from any previous layer. So, in order to illustrate this better, if you remember the example with the headmaster where he was using predictions from the teachers, he could have been using also predictions from the students at the same time. This actually can work quite well. And you can also refit the initial data. Not just the predictions, you can actually put your initial x data set, and append it to your predictions. This can work really well if you haven't made many models. So that way, you get the chance to revisit that initial training data, and try to capture more informations. And because we already have metal-models present, the model tries to focus on where we can explore any new information. So in this kind of situation it works quite well. Also, this is very similar to target encoding or many encoding you've seen before where you use some part of the data, let's say, a code on a categorical column, given some cross-validation, you generate some estimates for the target variable. And then, you insert this into your training data. Okay, you don't stack it, as in you don't create a new column, but essentially you replace one column with hold out predictions of your target variable which is essentially very similar. You have created the logic for the target variable, and you are essentially inserting it into your training data idea.