Hello, everyone. This is Marios Michailidis. And today, we'll continue our discussion with ensemble methods, and specifically, with a very popular form of ensembling called boosting. What is boosting? Boosting is a form of weighted averaging of models where each model is built sequentially in a way that it takes into account previous model performance. In order to understand this better, remember that before, we discussed about biking, and we saw that we can have it at many different models, which are independent to each other in order to get a better prediction. Boosting does something different. It says, now I tried to make a model, but I take into account how well the previous models have done in order to make a better prediction. So, every model we add sequentially to the ensemble, it takes into account how well the previous models have done in order to make better predictions. There are two main boosting type of algorithms. One is based on weight, and the other is based on residual error, and we will discuss both of them one by one. For weight boosting, it's better to see an example in order to understand it better. Let's say we have a tabular data set, with four features. Let's call them x0, x1, x2, and x3, and we want to use these features to predict a target variable, y. What we are going to do in weight boosting is, we are going to fit a model, and we will generate predictions. Let's call them pred. These predictions have a certain margin of error. We can calculate these absolute error, and when I say absolute error, is absolute of y minus our prediction. You can see there are predictions which are very, very far off, like row number five, but there are others like number six, which the model has actually done quite well. So what we do based on this is we generate, let's say, a new column or a new vector, where we create a weight column, and we say that this weight is 1 plus the absolute error. There are different ways to calculate this weight. Now, I'm just giving you this as an example. You can infer that there are different ways to do this, but the overall principle is very similar. So what you're going to do next is, you're going to fit a new model using the same features and the same target variable, but you're going to also add this weight. What weight says to the model is, I want you to put more significance into a certain role. You can almost interpret weight has the number of times that a certain row appears in my data. So let's say weight was 2, this means that this row appears twice, and therefore, has bigger contribution to the total error. You can keep repeating this process. You can, again, calculate a new error based on this error. You calculate new weights, and this is how you sequentially add models to the ensemble that take into account how well each model has done in certain cases, maximizing the focus from where the previous models have done more wrong. There are certain parameters associated with this type of boosting. One is the learning rate. We can also call it shrinkage or eta. It has different names. Now, if you recall, I explained boosting as a form of weighted averaging. And this is true, because normally what this learning rate. So what we say is, every new model we built, we shouldn't trust it 100%. We should trust it a little bit. This ensures that we don't have one model generally having too much contribution, and completely making something that is not very generalizable. So this ensures that we don't over-trust one model, we trust many models a little bit. It is very good to control over fitting. The second parameter we look at is the number of estimators. This is quite important. And normally, there is an inverse relationship, an opposite relationship, with the learning rate. So the more estimators we add to these type of ensemble, the smaller learning rate we need to put. This is sometimes quite difficult to find the right values, and we do it with the help of cross-validation. So normally, we start with a fixed number of estimators, let's say, 100, and then, we try to find the optimal learning rate for this 100 estimators. Let's say, based on cross-validation performance, we find this to be 0.1. What we can do then is, let's say, we can double the number of estimators, make it 200, and divide the learning rate by 2, so we can put 0.05, and then we take performance. It may be that the relationship is not as linear as I explained, and the best learning rate may be 0.04 or 0.06 after duplicating the estimators, but this is roughly the logic. This is how we work in order to increase estimators, and try to see more estimators give us better performance without losing so much time, every time, trying to find the best learning rate. Another thing we look at is the type of input model. And generally, we can perform boosting with any type of estimator. The only condition is that it needs to accept weight in it's modeling process. So I weigh to say how much we should rely in each role of our data set. And then, we have various boosting types. As I said, I roughly explained to you how we can use the weight as a means to focus on different rows, different cases the model has done wrong, but there are different ways to express this. For example, there are certain boosting algorithm that do not care about the margin of error, they only care if you did the classification correct or not. So there are different variations. One I really like is the AdaBoost, and there is a very good implementation in sklearn, where you can choose any input algorithm. I think it's really good. And another one I really like is, normally, it's only good for logistic regression, and there is a very good implementation in Weka for Java if you want to try. Now, let's move onto the our time of boosting, which has been the most successful. I believe that in any predictive modeling competition that was not image classification or predicting videos. This has been the most dominant type of algorithm that actually has one most in these challenges so this type of boosting has been extremely successful, but what is it? I'll try to give you again a similar example in order to understand the concept. Let's say we have again the same dataset, same features, again when trying to predict a y variable, we fit a model, we make predictions. What we do next, is we'll calculate the error of these predictions but this time, not in absolute terms because we're interested about the direction of the error. What we do next is we take this error and we make it adding new y variable so the error now becomes the new target variable and we use the same features in order to predict this error. It's an interesting concept and if we wanted, let's say to make predictions for Rownum equals one, what we would do is we will take our initial prediction and then we'll add the new prediction, which is based on the error of the first prediction. So initially, we have 0.75 and then we predicted 0.2. In order to make a final prediction, we would say one plus the other equals 0.95. If you recall, the target for this row, it was one. Using two models, we were able to get closer to the actual answer. This form of boosting works really, really well to minimize the error. There are certain parameters again which are associated with this type of boosting. The first is again the learning rate and it works pretty much as I explained it before. What you need to take into account is how this is applied. Let's say we have a learning rate of 0.1. In the previous example, where the prediction was 0.2 for the second model, what you will say is I want to move my prediction towards that direction only 10 percent. If you remember the prediction was 0.2, 10 percent of this is 0.02. This is how much we would move towards the prediction of the error. This is a good way to control over fitting. Again, we ensure we don't over rely in one model. Again, how many estimators you put is quite important. Normally, more is better but you need to offset this with the right learning rate. You need to make certain that every model has the right contribution. If you intent to put many, then you need to make sure that your models have very, very small contribution. Again, you decide these parameters based on cross-validation and the logic is very similar as explained before. Other things that work really well is taking a subset of rows or a subset of columns when you build its model. Actually, there is no reason why we wouldn't use this with the previous algorithm. The way its based, it is more common with this type of boosting, and internally works quite well. For input model, I have seen that this method works really well with this increase but theoretically, you can put anything you want. Again, there are various boosting types. I think the two most common or more successful right now in a predictive modeling context is the gradient based, which is actually what I explained with you how the prediction and you don't move 100 percent with that direction if you apply the learning rate. The other very interesting one, which I've actually find it very efficient especially in classification problems is the dart. Dart, it imposes a drop out mechanism in order to control the contribution of the trees. This is a concept derived from deep learning where you say, "Every time I make a new prediction in my sample, every time I add a new estimate or I'm not relying on all previous estimators but only on a subset of them." Just to give you an example, let's say we have a drop out rate of 20 percent. So far, we have built 10 trees, we want to or 10 models and then we try to see, we try to build a new, an 11th one. What we'll do is we will randomly exclude two trees when we generate a prediction for that 11th tree or that 11th model. By randomly excluding some models, by introducing this kind of randomness, it works as a form of regularization. Therefore, it helps a lot to make a model that generalizes quite well enough for same data. This concept tends to work quite well because this type of boosting algorithm has been so successful. There have been many implementations to try to improve on different parts of these algorithms. One really successful application especially in the comparative predictive modeling world is the Xgboost. It is very scalable and it supports many loss functions. At the same time, is available in all major programming languages for data science. Another good implementation is Lightgbm. As the name connotes, it is lightning fast. Also, it is supported by many programming languages and supports many loss functions. Another interesting case is the Gradient Boosting Machine from H2O. What's really interesting about this implementation is that it can handle categorical variables out of the box and it also comes with a real set of parameters where you can control the modeling process quite thoroughly. Another interesting case, which is also fairly new is the Catboost. What's really good about this is that it comes with the strong initial set of parameters. Therefore, you don't need to spend so much time tuning. As I mentioned before, this can be quite a time consuming process. It can also handle categorical variables out of the box. Ultimately, I really like the Gradient Boosting Machine implementation of Scikit-learn. What I really like about this is that you can put any scikit-learn estimator as a base. This is the end of this video. In the next session, we will discuss docking, which is also very popular, so stay tuned.