Hello everyone, this is Marios Michailidis, and this will be the first video in a series that we will be discussing on ensemble methods for machine learning. To tell you a bit about me, I work as Research Data Scientist for H2Oai. In fact, my PhD is about assemble methods, and they used to be ranked number one in cargo and ensemble methods have greatly helped me to achieve this spot. So you might find the course interesting. So what is ensemble modelling? I think with this term, we refer to combining many different machine learning models in order to get a more powerful prediction. And later on we will see examples that this happens, that we combine different models and we do get better predictions. There are various ensemble methods. Here we'll discuss a few, those that we encounter quite often, in predictive modelling competitions, and they tend to be, in general, quite competitive. We will start with simple averaging methods, then we'll go to weighted averaging methods, and we will also examine conditional averaging. And then we will move to some more typical ones like bagging, or the very, very popular, boosting, then stacking and StackNet, which is the result of my research. But as I said, these will be a series of videos, and we will initially start with the averaging methods. So, in order to help you understand a bit more about the averaging methods, let's take an example. Let's say we have a variable called age, as in age years, and we try to predict this. We have a model that yields prediction for age. Let's assume that the relationship between the two, the actual age in our prediction, looks like in the graph, as in the graph. So you can see that the model boasts quite a higher square of a value of 0.91, but it doesn't do so well in the whole range of values. So when age is less than 50, the model actually does quite well. But when age is more than 50, you can see that the average error is higher. Now let's take another example. Let's assume we have a second model that also tries to predict age, but this one looks like that. As you can see, this model does quite well when age is higher than 50, but not so well when age is less than 50, nevertheless, it scores again 0.91. So we have two models that have a similar predictive power, but they look quite different. It's quite obvious that they do better in different parts of the distribution of age. So what will happen if we were to try to combine this two with a simple averaging method, in other words, just say (model 1 + model two) / 2, so a simple averaging method. The end result will look as in the new graph. So, our square has moved to 0.95, which is a considerable improvement versus the 0.91 we had before, and as you can see, on average, the points tend to be closer with the reality. So the average error is smaller. However, as you can see, the model doesn't do better as an individual models for the areas where the models were doing really well, nevertheless, it does better on average. This is something we need to understand, that there is potentially a better way to combine these models. We could try to take a weighting average. So say, I'm going to take 70% of the first model prediction and 30% of the second model prediction. In other words, (model 1x0.7 + model 2x0.3), and the end result would look as in the graph. So you can see their square is no better and that makes sense, because the models have quite similar predictive power and it doesn't make sense to rely more in one. And also it is quite clear that it looks more with model 1, because it has better predictions when age is less than 50, and worse predictions when age is more than 50. As a theoretical exercise, what is the theoretical best we could get out of this? We know we have a model that scores really well when age is less than 50, and another model that scores really well when age is more than 50. So ideally, we would like to get to something like that. This is how we leverage the two models in the best possible way here by using a simple conditioning method. So if less than 50 is one I'll just use the other, and we will see later on that there are ensemble methods that are very good at finding these relationships of two or more predictions in respect to the target variable. But, this will be a topic for another discussion. Here we discuss simple averaging methods, hopefully you found it useful, and stay here for the next session to come. Thank you very much.