Continuing our discussion with ensemble methods, next one up is stacking. Stacking is a very, very popular form of ensembling using predictive modeling competitions. And I believe in most competitions, there is a form of stacking in the end in order to boost your performance as best as you can. Going through the definition of stacking, it essentially means making several predictions with hold-out data sets. And then collecting or stacking these predictions to form a new data set, where you can fit a new model on it, on this newly-formed data set from predictions. I would like to take you through a very simple, I would say naive, example to show you how, conceptually, this can work. I mean, we have so far seen that you can use previous models' predictions to affect a new model, but always in relation with the input data. This is a new concept because we're only going to use the predictions of some models in order to make a better model. So let's see how these could work in a real life scenario. Let's assume we have three kids, let's name them LR, SVM, KNN, and they argue about a physics question. So each one believes the answer to a physics question is different. First one says 13, second 18, third 11, they don't know how to solve this disagreement. They do the honorable thing, they say let's take an average, which in this case is 14. So you can almost see the kids, there's different models here, they take input data. In this case, it's the question about physics. They process it based on historical information and and they are able to output an estimate, a prediction. Have they done it optimally, though? Another way to say this is to say there was a teacher, Miss DL, who had seen this discussion, and she decided to step up. While she didn't hear the question, she does know the students quite well, she knows the strength and weaknesses of each one. She knows how well they have done historically in physics questions. And from the range of values they have provided, she is able to give an estimate. Let's say that in this concept, she knows that SVM is really good in physics, and her father works in the department of Physics of Excellence. And therefore she should have a bigger contribution to this ensemble than every other kid, therefore the answer is 17. And this is how a meta model works, it doesn't need to know the input data. It just knows how the models have done historically, in order to find the best way to combine them. And this can work quite well in practice. So, let's go more into the methodology of stacking. Wolpert introduced stacking in 1992, as a meta modeling technique to combine different models. It consists of several steps. The first step is, let's assume we have a train data set, let's divide it into two parts; so a training and the validation. Then you take the training part, and you train several models. And then you make predictions for the second part, let's say the validation data set. Then you collect all these predictions, or you stack these predictions. You form a new data set and you use this as inputs to a new model. Normally we call this a meta model, and the models we run into, we call them base model or base learners. If you're still confused about stacking, consider the following animation. So let's assume we have three data sets A, B, and C. In this case, A will serve the role of the training data set, B will be the validation data set, and C will be the test data sets where we want to make the final predictions. They all have similar architectural, four features, and one target variable we try to predict. So in this case, we can choose an algorithm to train a model based on data set 1, and then we make predictions for B and C at the same time. Now we take these predictions, and we put them into a new data set. So we create a data set to store the predictions for the validation data in B1. And a data set called C1 to save predictions for the test data, called C1. Then we're going to repeat the process, now we're going to choose another algorithm. Again, we will fit it on A data set. We will make predictions on B and C at the same time, and we will save these predictions into the newly-formed data sets. And we essentially append them, we stack them next to each other, this is where stacking takes its name. And we can continue this even more, do it with a third algorithm. Again the same, fit on A, predict on B and C, same predictions. What we do then is we take the target variable for the B data set, or the validation datadset, which we already knew. And we are going to fit a new model on B1 with the target of the validation data, and then we will make predictions from C1. And this is how we combine different models with stacking, to hopefully make better predictions for the test or the unobserved data. Let us go through an example, a simple example in Python, in order to understand better, as in in code, how it would work. It is quite simple, so even people not very experienced with Python hopefully can understand this. The main logic is that we will use two base learners on some input data, a random forest and a linear regression. And then, we will try to combine the results, starting with a meta learner, again, it will be linear regression. Let's assume we again have a train data set, and a target variable for this data set, and a test data set. Maybe the code seems a bit intimidating, but we will go step by step. What we do initially is we take the train data set and we split it in two parts. So we create a training and a valid data set out of this, and we also split the target variable. So we create ytraining and yvalid, and we split this by 50%. We could have chosen something else, let's say 50%. Then we specify our base learners, so model1 is the random forest in this case, and model2 is a linear regression. What we do then is we fit the both models using the training data and the training target. And we make predictions for the validation data for both models, and at the same time we'll make predictions for the test data. Again, for both models, we save these as preds1, preds2, and for the test data, test_preds1 and test_preds2. Then we are going to collect the predictions, we are going to stack the predictions and create two new data sets. One for validation, where we call it stacked_predictions, which consists of preds1 and preds2. And then for the data set for for the test predictions, called stacked_test_predictions, where we stack test_preds1 and test_preds2. Then we specify a meta learner, let's call it meta_model, which is a linear regression. And we fit this model on the predictions made on the validation data and the target for the validation data, which was our holdout data set all this time. And then we can generate predictions for the test data by applying this model on the stacked_test_predictions. This is how it works. Now, I think this is a good time to revisit an old example we used in the first session, about simple averaging. If you remember, we had a prediction that was doing quite well to predict age when the age was less than 50, and another prediction that was doing quite well when age was more than 50. And we did something tricky, we said if it is less than 50, we'll use the first one, if age is more than 50, we will use the other one. The reason this is tricky is because normally we use the target information to make this decision. Where in an ideal world, this is what you try to predict, you don't know it. We have done it in order to show what is the theoretical best we could get, or yeah, the best. So taking the same predictions and applying stacking, this is what the end result would actually look like. As you can see, it has done pretty similarly. The only area that there is some error is around the threshold of 50. And that makes sense, because the model doesn't see the target variable, is not able to identify this cut of 50 exactly. So it tries to do it only based on the input models, and there is some overlap around this area. But you can see that stacking is able to identify this, and use it in order to make better predictions. There are certain things you need to be mindful of when using stacking. One is when you have time-sensitive data, as in let's say, time series, you need to formulate your stacking so that you respect time. What I mean is, when you create your train and validation data, you need to make certain that your train is in the past and your validation is in the future, and ideally your test data is also in the future. So you need to respect this time element in order to make certain your model generalizes well. The other thing you need to look at is, obviously, single model performance is important. But the other thing that is also very important is model diversity, how different a model is to each other. What is the new information each model brings into the table? Now, because stacking, and depending on the algorithms you will use for stacking, can go quite deep into exploring relationships. It will find when a model is good, and when a model is actually bad or fairly weak. So you don't need to worry too much to make all the models really strong, stacking can actually extract the juice from each prediction. Therefore, what you really need to focus is, am I making a model that brings some information, even though it is generally weak? And this is true, there have been many situations where I've made, I've had some quite weak models in my ensemble, I mean, compared to the top performance. And nevertheless, they were actually adding lots of value in stacking. They were bringing in new information that the meta model could leverage. Normally, you introduce diversity from two forms, one is by choosing a different algorithm. Which makes sense, certain algorithms capitalize on different relationships within the data. For example, a linear model will focus on a linear relationship, a non-linear model can capture better a non-linear relationships. So predictions may come a bit different. The other thing is you can even run the same model, but you try to run it on different transformation of input data, either less features or completely different transformation. For example, in one data set you may treat categorical features as one whole encoding. In another, you may just use label in coding, and the result will probably produce a model that is very different. Generally, there is no limit to how many models you can stack. But you can expect that there is a plateauing after certain models have been added. So initially, you will see some significant uplift in whatever metric you are testing on every time you run the model. But after some point, the incremental uplift will be fairly small. Generally, there's no way to know this before, exactly what is the number of models where we will start plateauing. But generally, this is a affected by how many features you have in your data, how much diversity you managed to introduce into your models, quite often how many rows of data you have. So it is tough to know this beforehand, but generally this is something to be mindful of. But there is a point where adding more models actually does not add that much value. And because the meta model, the meta model will only use predictions of other models. We can assume that the other models have done, let's say, a deep work or a deep job to scrutinize the data. And therefore the meta model doesn't need to be so deep. Normally, you have predictions with are correlated with the target. And the only thing it needs to do is just to find a way to combine them, and that is normally not so complicated. Therefore, it is quite often that the meta model is generally simpler. So if I was to express this in a random forest context, it will have lower depth than what was the best one you found in your base models. This was the end of the session, here we discussed stacking. In the next one, we will discuss a very interesting concept about stacking and extending it on multiple levels, called stack net. So stay in tune.