1 00:00:00,840 --> 00:00:06,700 Continuing our discussion with ensemble methods, next one up is stacking. 2 00:00:06,700 --> 00:00:08,200 Stacking is a very, 3 00:00:08,200 --> 00:00:13,908 very popular form of ensembling using predictive modeling competitions. 4 00:00:13,908 --> 00:00:19,061 And I believe in most competitions, there is a form of 5 00:00:19,061 --> 00:00:26,060 stacking in the end in order to boost your performance as best as you can. 6 00:00:26,060 --> 00:00:30,851 Going through the definition of stacking, it essentially 7 00:00:30,851 --> 00:00:35,850 means making several predictions with hold-out data sets. 8 00:00:35,850 --> 00:00:40,872 And then collecting or stacking these predictions 9 00:00:40,872 --> 00:00:46,471 to form a new data set, where you can fit a new model on it, 10 00:00:46,471 --> 00:00:51,520 on this newly-formed data set from predictions. 11 00:00:51,520 --> 00:00:56,570 I would like to take you through a very simple, I would say naive, 12 00:00:56,570 --> 00:01:02,620 example to show you how, conceptually, this can work. 13 00:01:02,620 --> 00:01:09,350 I mean, we have so far seen that you can use previous models' predictions 14 00:01:09,350 --> 00:01:15,650 to affect a new model, but always in relation with the input data. 15 00:01:15,650 --> 00:01:20,930 This is a new concept because we're only going to use the predictions 16 00:01:20,930 --> 00:01:25,330 of some models in order to make a better model. 17 00:01:25,330 --> 00:01:29,860 So let's see how these could work in a real life scenario. 18 00:01:29,860 --> 00:01:35,620 Let's assume we have three kids, let's name them LR, 19 00:01:35,620 --> 00:01:41,340 SVM, KNN, and they argue about a physics question. 20 00:01:41,340 --> 00:01:46,494 So each one believes the answer to a physics question is different. 21 00:01:46,494 --> 00:01:51,105 First one says 13, second 18, third 11, 22 00:01:51,105 --> 00:01:55,837 they don't know how to solve this disagreement. 23 00:01:55,837 --> 00:01:59,684 They do the honorable thing, they say let's take an average, 24 00:01:59,684 --> 00:02:01,360 which in this case is 14. 25 00:02:01,360 --> 00:02:05,760 So you can almost see the kids, 26 00:02:05,760 --> 00:02:09,570 there's different models here, they take input data. 27 00:02:09,570 --> 00:02:13,470 In this case, it's the question about physics. 28 00:02:13,470 --> 00:02:18,319 They process it based on historical information and and 29 00:02:18,319 --> 00:02:22,138 they are able to output an estimate, a prediction. 30 00:02:22,138 --> 00:02:23,769 Have they done it optimally, though? 31 00:02:26,142 --> 00:02:31,171 Another way to say this is to say there was a teacher, 32 00:02:31,171 --> 00:02:37,930 Miss DL, who had seen this discussion, and she decided to step up. 33 00:02:37,930 --> 00:02:44,139 While she didn't hear the question, she does know the students quite well, 34 00:02:44,139 --> 00:02:48,261 she knows the strength and weaknesses of each one. 35 00:02:48,261 --> 00:02:53,000 She knows how well they have done historically in physics questions. 36 00:02:53,000 --> 00:02:58,015 And from the range of values they have provided, she is able to give an estimate. 37 00:02:58,015 --> 00:03:04,450 Let's say that in this concept, she knows that SVM is really good in physics, 38 00:03:04,450 --> 00:03:10,200 and her father works in the department of Physics of Excellence. 39 00:03:10,200 --> 00:03:15,285 And therefore she should have a bigger contribution to this 40 00:03:15,285 --> 00:03:20,687 ensemble than every other kid, therefore the answer is 17. 41 00:03:20,687 --> 00:03:26,340 And this is how a meta model works, it doesn't need to know the input data. 42 00:03:26,340 --> 00:03:30,975 It just knows how the models have done historically, 43 00:03:30,975 --> 00:03:34,945 in order to find the best way to combine them. 44 00:03:34,945 --> 00:03:37,839 And this can work quite well in practice. 45 00:03:39,957 --> 00:03:43,840 So, let's go more into the methodology of stacking. 46 00:03:45,020 --> 00:03:48,359 Wolpert introduced stacking in 1992, 47 00:03:48,359 --> 00:03:52,880 as a meta modeling technique to combine different models. 48 00:03:52,880 --> 00:03:56,650 It consists of several steps. 49 00:03:56,650 --> 00:04:01,207 The first step is, let's assume we have a train data set, 50 00:04:01,207 --> 00:04:06,510 let's divide it into two parts; so a training and the validation. 51 00:04:06,510 --> 00:04:13,700 Then you take the training part, and you train several models. 52 00:04:15,200 --> 00:04:18,941 And then you make predictions for the second part, 53 00:04:18,941 --> 00:04:21,642 let's say the validation data set. 54 00:04:21,642 --> 00:04:26,720 Then you collect all these predictions, or you stack these predictions. 55 00:04:26,720 --> 00:04:32,034 You form a new data set and you use this as inputs to a new model. 56 00:04:32,034 --> 00:04:37,199 Normally we call this a meta model, and the models we run into, 57 00:04:37,199 --> 00:04:40,620 we call them base model or base learners. 58 00:04:43,153 --> 00:04:47,320 If you're still confused about stacking, consider the following animation. 59 00:04:47,320 --> 00:04:52,870 So let's assume we have three data sets A, B, and C. 60 00:04:55,320 --> 00:04:59,690 In this case, A will serve the role of the training data set, 61 00:04:59,690 --> 00:05:02,432 B will be the validation data set, and 62 00:05:02,432 --> 00:05:07,590 C will be the test data sets where we want to make the final predictions. 63 00:05:07,590 --> 00:05:11,820 They all have similar architectural, four features, 64 00:05:11,820 --> 00:05:15,065 and one target variable we try to predict. 65 00:05:15,065 --> 00:05:19,727 So in this case, we can choose an algorithm 66 00:05:19,727 --> 00:05:24,130 to train a model based on data set 1, and 67 00:05:24,130 --> 00:05:30,160 then we make predictions for B and C at the same time. 68 00:05:30,160 --> 00:05:35,019 Now we take these predictions, and we put them into a new data set. 69 00:05:35,019 --> 00:05:42,620 So we create a data set to store the predictions for the validation data in B1. 70 00:05:42,620 --> 00:05:49,463 And a data set called C1 to save predictions for the test data, called C1. 71 00:05:49,463 --> 00:05:52,115 Then we're going to repeat the process, 72 00:05:52,115 --> 00:05:55,156 now we're going to choose another algorithm. 73 00:05:55,156 --> 00:05:58,150 Again, we will fit it on A data set. 74 00:05:58,150 --> 00:06:03,382 We will make predictions on B and C at the same time, 75 00:06:03,382 --> 00:06:10,580 and we will save these predictions into the newly-formed data sets. 76 00:06:10,580 --> 00:06:14,270 And we essentially append them, we stack them next to each other, 77 00:06:14,270 --> 00:06:16,860 this is where stacking takes its name. 78 00:06:16,860 --> 00:06:20,213 And we can continue this even more, do it with a third algorithm. 79 00:06:20,213 --> 00:06:25,360 Again the same, fit on A, predict on B and C, same predictions. 80 00:06:26,460 --> 00:06:30,678 What we do then is we take the target variable for the B data set, or 81 00:06:30,678 --> 00:06:34,000 the validation datadset, which we already knew. 82 00:06:34,000 --> 00:06:41,431 And we are going to fit a new model on B1 with the target of the validation data, 83 00:06:41,431 --> 00:06:45,607 and then we will make predictions from C1. 84 00:06:45,607 --> 00:06:49,392 And this is how we combine different models with stacking, 85 00:06:49,392 --> 00:06:54,170 to hopefully make better predictions for the test or the unobserved data. 86 00:06:57,505 --> 00:07:03,007 Let us go through an example, a simple example in Python, 87 00:07:03,007 --> 00:07:09,199 in order to understand better, as in in code, how it would work. 88 00:07:09,199 --> 00:07:10,760 It is quite simple, so 89 00:07:10,760 --> 00:07:16,200 even people not very experienced with Python hopefully can understand this. 90 00:07:17,410 --> 00:07:23,101 The main logic is that we will use two base learners on some input data, 91 00:07:23,101 --> 00:07:26,483 a random forest and a linear regression. 92 00:07:26,483 --> 00:07:31,367 And then, we will try to combine the results, starting with a meta learner, 93 00:07:31,367 --> 00:07:33,860 again, it will be linear regression. 94 00:07:35,110 --> 00:07:38,906 Let's assume we again have a train data set, and 95 00:07:38,906 --> 00:07:43,364 a target variable for this data set, and a test data set. 96 00:07:46,658 --> 00:07:51,850 Maybe the code seems a bit intimidating, but we will go step by step. 97 00:07:51,850 --> 00:07:58,690 What we do initially is we take the train data set and we split it in two parts. 98 00:07:58,690 --> 00:08:03,914 So we create a training and a valid data set out of this, 99 00:08:03,914 --> 00:08:07,670 and we also split the target variable. 100 00:08:07,670 --> 00:08:12,954 So we create ytraining and yvalid, and we split this by 50%. 101 00:08:12,954 --> 00:08:16,400 We could have chosen something else, let's say 50%. 102 00:08:16,400 --> 00:08:22,184 Then we specify our base learners, so model1 is the random 103 00:08:22,184 --> 00:08:27,864 forest in this case, and model2 is a linear regression. 104 00:08:27,864 --> 00:08:32,510 What we do then is we fit the both models using 105 00:08:32,510 --> 00:08:37,165 the training data and the training target. 106 00:08:37,165 --> 00:08:42,078 And we make predictions for the validation data for both models, 107 00:08:42,078 --> 00:08:47,460 and at the same time we'll make predictions for the test data. 108 00:08:47,460 --> 00:08:52,206 Again, for both models, we save these as preds1, preds2, and for 109 00:08:52,206 --> 00:08:55,756 the test data, test_preds1 and test_preds2. 110 00:08:55,756 --> 00:08:58,997 Then we are going to collect the predictions, 111 00:08:58,997 --> 00:09:03,661 we are going to stack the predictions and create two new data sets. 112 00:09:03,661 --> 00:09:07,943 One for validation, where we call it stacked_predictions, 113 00:09:07,943 --> 00:09:10,700 which consists of preds1 and preds2. 114 00:09:10,700 --> 00:09:14,610 And then for the data set for for the test predictions, 115 00:09:14,610 --> 00:09:21,160 called stacked_test_predictions, where we stack test_preds1 and test_preds2. 116 00:09:22,190 --> 00:09:24,540 Then we specify a meta learner, 117 00:09:24,540 --> 00:09:27,960 let's call it meta_model, which is a linear regression. 118 00:09:27,960 --> 00:09:33,138 And we fit this model on the predictions made on the validation data and 119 00:09:33,138 --> 00:09:39,223 the target for the validation data, which was our holdout data set all this time. 120 00:09:39,223 --> 00:09:42,499 And then we can generate predictions for 121 00:09:42,499 --> 00:09:48,580 the test data by applying this model on the stacked_test_predictions. 122 00:09:48,580 --> 00:09:50,140 This is how it works. 123 00:09:51,820 --> 00:09:56,070 Now, I think this is a good time to revisit an old example 124 00:09:56,070 --> 00:10:00,590 we used in the first session, about simple averaging. 125 00:10:00,590 --> 00:10:05,010 If you remember, we had a prediction that was doing 126 00:10:05,010 --> 00:10:09,518 quite well to predict age when the age was less than 50, and 127 00:10:09,518 --> 00:10:14,780 another prediction that was doing quite well when age was more than 50. 128 00:10:14,780 --> 00:10:18,964 And we did something tricky, we said if it is less than 50, 129 00:10:18,964 --> 00:10:24,315 we'll use the first one, if age is more than 50, we will use the other one. 130 00:10:24,315 --> 00:10:29,020 The reason this is tricky is because normally we use the target 131 00:10:29,020 --> 00:10:31,833 information to make this decision. 132 00:10:31,833 --> 00:10:36,820 Where in an ideal world, this is what you try to predict, you don't know it. 133 00:10:36,820 --> 00:10:42,340 We have done it in order to show what is the theoretical best we could get, 134 00:10:42,340 --> 00:10:43,740 or yeah, the best. 135 00:10:43,740 --> 00:10:46,290 So taking the same predictions and 136 00:10:46,290 --> 00:10:51,390 applying stacking, this is what the end result would actually look like. 137 00:10:52,450 --> 00:10:55,460 As you can see, it has done pretty similarly. 138 00:10:55,460 --> 00:11:02,990 The only area that there is some error is around the threshold of 50. 139 00:11:02,990 --> 00:11:08,693 And that makes sense, because the model doesn't see the target variable, 140 00:11:08,693 --> 00:11:12,361 is not able to identify this cut of 50 exactly. 141 00:11:12,361 --> 00:11:16,029 So it tries to do it only based on the input models, and 142 00:11:16,029 --> 00:11:18,810 there is some overlap around this area. 143 00:11:18,810 --> 00:11:24,527 But you can see that stacking is able to identify this, 144 00:11:24,527 --> 00:11:29,510 and use it in order to make better predictions. 145 00:11:31,800 --> 00:11:36,670 There are certain things you need to be mindful of when using stacking. 146 00:11:36,670 --> 00:11:41,152 One is when you have time-sensitive data, as in let's say, 147 00:11:41,152 --> 00:11:46,640 time series, you need to formulate your stacking so that you respect time. 148 00:11:47,770 --> 00:11:53,290 What I mean is, when you create your train and validation data, 149 00:11:53,290 --> 00:11:58,300 you need to make certain that your train is in the past and your validation 150 00:11:58,300 --> 00:12:03,080 is in the future, and ideally your test data is also in the future. 151 00:12:03,080 --> 00:12:07,956 So you need to respect this time element in order to make 152 00:12:07,956 --> 00:12:11,464 certain your model generalizes well. 153 00:12:11,464 --> 00:12:14,131 The other thing you need to look at is, obviously, 154 00:12:14,131 --> 00:12:16,310 single model performance is important. 155 00:12:16,310 --> 00:12:21,267 But the other thing that is also very important is model 156 00:12:21,267 --> 00:12:26,470 diversity, how different a model is to each other. 157 00:12:26,470 --> 00:12:31,488 What is the new information each model brings into the table? 158 00:12:31,488 --> 00:12:36,940 Now, because stacking, and depending on the algorithms you will use for 159 00:12:36,940 --> 00:12:41,100 stacking, can go quite deep into exploring relationships. 160 00:12:42,540 --> 00:12:46,850 It will find when a model is good, and 161 00:12:46,850 --> 00:12:50,800 when a model is actually bad or fairly weak. 162 00:12:50,800 --> 00:12:55,275 So you don't need to worry too much to make all the models really strong, 163 00:12:55,275 --> 00:13:02,630 stacking can actually extract the juice from each prediction. 164 00:13:02,630 --> 00:13:07,498 Therefore, what you really need to focus is, am I making a model that brings 165 00:13:07,498 --> 00:13:11,009 some information, even though it is generally weak? 166 00:13:11,009 --> 00:13:15,931 And this is true, there have been many situations where I've made, I've had 167 00:13:15,931 --> 00:13:21,112 some quite weak models in my ensemble, I mean, compared to the top performance. 168 00:13:21,112 --> 00:13:25,942 And nevertheless, they were actually adding lots of value in stacking. 169 00:13:25,942 --> 00:13:32,140 They were bringing in new information that the meta model could leverage. 170 00:13:32,140 --> 00:13:35,523 Normally, you introduce diversity from two forms, 171 00:13:35,523 --> 00:13:38,240 one is by choosing a different algorithm. 172 00:13:38,240 --> 00:13:39,511 Which makes sense, 173 00:13:39,511 --> 00:13:44,460 certain algorithms capitalize on different relationships within the data. 174 00:13:44,460 --> 00:13:49,263 For example, a linear model will focus on a linear relationship, 175 00:13:49,263 --> 00:13:52,450 a non-linear model can capture better a non-linear relationships. 176 00:13:52,450 --> 00:13:56,060 So predictions may come a bit different. 177 00:13:56,060 --> 00:13:59,734 The other thing is you can even run the same model, but 178 00:13:59,734 --> 00:14:03,900 you try to run it on different transformation of input data, 179 00:14:03,900 --> 00:14:08,470 either less features or completely different transformation. 180 00:14:08,470 --> 00:14:09,594 For example, 181 00:14:09,594 --> 00:14:15,410 in one data set you may treat categorical features as one whole encoding. 182 00:14:15,410 --> 00:14:20,376 In another, you may just use label in coding, and 183 00:14:20,376 --> 00:14:26,890 the result will probably produce a model that is very different. 184 00:14:29,717 --> 00:14:34,890 Generally, there is no limit to how many models you can stack. 185 00:14:34,890 --> 00:14:37,851 But you can expect that there is a plateauing 186 00:14:37,851 --> 00:14:40,501 after certain models have been added. 187 00:14:40,501 --> 00:14:45,358 So initially, you will see some significant uplift in whatever metric you 188 00:14:45,358 --> 00:14:48,220 are testing on every time you run the model. 189 00:14:48,220 --> 00:14:55,232 But after some point, the incremental uplift will be fairly small. 190 00:14:55,232 --> 00:15:00,489 Generally, there's no way to know this before, 191 00:15:00,489 --> 00:15:07,760 exactly what is the number of models where we will start plateauing. 192 00:15:07,760 --> 00:15:13,590 But generally, this is a affected by how many features you have in your data, 193 00:15:13,590 --> 00:15:18,346 how much diversity you managed to introduce into your models, 194 00:15:18,346 --> 00:15:21,596 quite often how many rows of data you have. 195 00:15:21,596 --> 00:15:25,773 So it is tough to know this beforehand, but 196 00:15:25,773 --> 00:15:30,360 generally this is something to be mindful of. 197 00:15:30,360 --> 00:15:34,221 But there is a point where adding more models actually does not add that 198 00:15:34,221 --> 00:15:34,950 much value. 199 00:15:38,788 --> 00:15:43,289 And because the meta model, the meta model 200 00:15:43,289 --> 00:15:48,065 will only use predictions of other models. 201 00:15:48,065 --> 00:15:53,019 We can assume that the other models have done, let's say, 202 00:15:53,019 --> 00:15:56,993 a deep work or a deep job to scrutinize the data. 203 00:15:56,993 --> 00:16:00,960 And therefore the meta model doesn't need to be so deep. 204 00:16:00,960 --> 00:16:04,810 Normally, you have predictions with are correlated with the target. 205 00:16:04,810 --> 00:16:09,790 And the only thing it needs to do is just to find a way to combine them, and 206 00:16:09,790 --> 00:16:12,950 that is normally not so complicated. 207 00:16:12,950 --> 00:16:19,290 Therefore, it is quite often that the meta model is generally simpler. 208 00:16:19,290 --> 00:16:24,350 So if I was to express this in a random forest context, 209 00:16:24,350 --> 00:16:32,020 it will have lower depth than what was the best one you found in your base models. 210 00:16:33,710 --> 00:16:38,010 This was the end of the session, here we discussed stacking. 211 00:16:38,010 --> 00:16:42,813 In the next one, we will discuss a very interesting concept about stacking and 212 00:16:42,813 --> 00:16:46,110 extending it on multiple levels, called stack net. 213 00:16:46,110 --> 00:16:47,370 So stay in tune.