1 00:00:00,580 --> 00:00:09,095 Now, I'll give you a few tips that are going to help you to make a good ensemble, 2 00:00:09,095 --> 00:00:12,505 or at least it help you to start with. 3 00:00:12,505 --> 00:00:15,590 And I mentioned before when I talked about stacking, 4 00:00:15,590 --> 00:00:19,070 what's quite important is to introduce diversity, 5 00:00:19,070 --> 00:00:23,900 and one way to do this is based on the algorithms you use. 6 00:00:23,900 --> 00:00:26,570 So an architecture that I have found quite 7 00:00:26,570 --> 00:00:32,205 useful is to always make two or three gradient boosted trees. 8 00:00:32,205 --> 00:00:37,520 And what I'm about to do here is I haven't either tried different implementations, 9 00:00:37,520 --> 00:00:42,050 or I try to make one with bigger depth, 10 00:00:42,050 --> 00:00:44,410 one with middle depth, 11 00:00:44,410 --> 00:00:46,150 and one with low depth, 12 00:00:46,150 --> 00:00:49,880 and then I try to tune the parameters around it in order 13 00:00:49,880 --> 00:00:54,050 to make them have as similar performance as possible. 14 00:00:54,050 --> 00:00:56,900 Again, hoping that the end result will 15 00:00:56,900 --> 00:01:00,945 be some models which are fairly diverse to each other. 16 00:01:00,945 --> 00:01:02,500 Another thing that I like is, 17 00:01:02,500 --> 00:01:07,195 I'm using neural nets either from Keras or PyTorch. 18 00:01:07,195 --> 00:01:08,760 And again, what I try to do is, 19 00:01:08,760 --> 00:01:13,600 I try to make one which is fairly deep that take three hidden layers, 20 00:01:13,600 --> 00:01:15,105 and one which is middle, 21 00:01:15,105 --> 00:01:17,090 like two hidden layers, 22 00:01:17,090 --> 00:01:20,130 and one that has, let's say, only one hidden layer. 23 00:01:20,130 --> 00:01:23,540 Again, I try to diversify, 24 00:01:23,540 --> 00:01:25,770 I try to make them slightly different, 25 00:01:25,770 --> 00:01:28,745 in order to try and get new information. 26 00:01:28,745 --> 00:01:32,715 Then I use a few ExtraTrees or Random Forest, 27 00:01:32,715 --> 00:01:36,440 most of the time they work quite well, they normally add. 28 00:01:36,440 --> 00:01:40,810 I also add a few linear models or ridge regression. 29 00:01:40,810 --> 00:01:47,560 I also like linear support vector machine from scikit-learn. 30 00:01:47,560 --> 00:01:52,245 KNN models tend to want quite nice value in many problems, 31 00:01:52,245 --> 00:01:55,830 which is surprising because if you look at them individually, 32 00:01:55,830 --> 00:02:00,360 they rarely have have good performance by compared to an exit boost 33 00:02:00,360 --> 00:02:07,135 that they generally tend to add quite some value in a metamodeling context. 34 00:02:07,135 --> 00:02:09,515 I personally like factorization machines, 35 00:02:09,515 --> 00:02:12,232 I find them quite useful, 36 00:02:12,232 --> 00:02:18,188 especially libFM that factorizes all pairwise interactions. 37 00:02:18,188 --> 00:02:24,360 And if your data permits, so if your data is not so big, 38 00:02:24,360 --> 00:02:32,550 I also find useful support vector machines with some sort of non-linear kernel, like RBF. 39 00:02:32,550 --> 00:02:35,580 They don't work quite well, especially in regression. 40 00:02:35,580 --> 00:02:38,520 The other way you can introduce diversity 41 00:02:38,520 --> 00:02:44,666 is with the transformations you make in your own input data, 42 00:02:44,666 --> 00:02:47,220 whereas, you can actually run 43 00:02:47,220 --> 00:02:52,740 exactly the same models just by having slightly different input data, 44 00:02:52,740 --> 00:02:55,830 is enough to generate diversity. 45 00:02:55,830 --> 00:03:00,345 So, what we look at for categorical features are train one hot encoding, 46 00:03:00,345 --> 00:03:06,120 label encoding as in replacing the categorical values with an index. 47 00:03:06,120 --> 00:03:09,422 I can use likelihood encoding or target encoding, 48 00:03:09,422 --> 00:03:13,860 or I can replace categories with frequency or counts. 49 00:03:13,860 --> 00:03:19,788 For numerical features, I try to take care of outliers or not. 50 00:03:19,788 --> 00:03:25,290 I bin the variables from X to Z, 51 00:03:25,290 --> 00:03:28,480 from Z to all, and so on. 52 00:03:28,480 --> 00:03:31,770 I use derivatives which is a way to smoothen 53 00:03:31,770 --> 00:03:37,950 your variables or I could use percentiles or scaling, 54 00:03:37,950 --> 00:03:45,030 these are different ways I use to change my numerical features in different models. 55 00:03:45,030 --> 00:03:47,360 And then I also explore interactions, like column one, 56 00:03:47,360 --> 00:03:51,620 multiple column two, or column one plus column two. 57 00:03:51,620 --> 00:03:57,660 I can go up to three or four levels so explore all possible interactions. 58 00:03:57,660 --> 00:04:04,380 Another way to explore interactions is with groupby statements where you say, 59 00:04:04,380 --> 00:04:08,625 given all categories of a categorical feature, 60 00:04:08,625 --> 00:04:11,565 compute the average of another variable, for example. 61 00:04:11,565 --> 00:04:15,290 In certain situations, this works quite well. 62 00:04:15,290 --> 00:04:17,955 The other thing that I would do is, 63 00:04:17,955 --> 00:04:21,600 I would try and supervise techniques like k-means, 64 00:04:21,600 --> 00:04:27,748 or SVM, or PCA, or numerical features. 65 00:04:27,748 --> 00:04:33,315 Again, it tends to add value quite often. 66 00:04:33,315 --> 00:04:35,045 Now, on every other layer, 67 00:04:35,045 --> 00:04:38,730 what you the need to keep in mind is that you need to make 68 00:04:38,730 --> 00:04:44,805 your algorithms smaller or shallower, more constrained. 69 00:04:44,805 --> 00:04:46,210 What does this mean? 70 00:04:46,210 --> 00:04:47,452 In gradient boosting trees, 71 00:04:47,452 --> 00:04:49,130 means you need to put very small depth, 72 00:04:49,130 --> 00:04:51,280 like two or three. 73 00:04:51,280 --> 00:04:55,350 Linear models, you need to put high regularization. 74 00:04:55,350 --> 00:04:58,755 Extra Trees, just don't make them too big, 75 00:04:58,755 --> 00:05:01,735 they tend to work quite well. 76 00:05:01,735 --> 00:05:04,150 Shallow neural networks, again, 77 00:05:04,150 --> 00:05:07,610 I normally put one layer maximum two layers here, 78 00:05:07,610 --> 00:05:09,830 with not that many hidden neurons. 79 00:05:09,830 --> 00:05:15,845 We can try KNN with BrayCurtis distance or 80 00:05:15,845 --> 00:05:22,360 sometimes it's actually better to brute force the best linear weights we could use. 81 00:05:22,360 --> 00:05:30,218 Actually, in a cross-validation try to find the best linear weights, again, symmetric. 82 00:05:30,218 --> 00:05:34,900 We can also deploy different feature engineering in this subsequent level. 83 00:05:34,900 --> 00:05:36,420 One thing we could do is, 84 00:05:36,420 --> 00:05:41,238 we could create pairwise differences between the model's predictions, 85 00:05:41,238 --> 00:05:44,922 and this can help because they tend to be quite correlated. 86 00:05:44,922 --> 00:05:47,220 So when you create the differences, 87 00:05:47,220 --> 00:05:57,030 you essentially force the model to focus on what each new model brings. 88 00:05:57,030 --> 00:05:59,935 And you can also create row-wise statistics like 89 00:05:59,935 --> 00:06:03,520 averages or standard deviations of all the models. 90 00:06:03,520 --> 00:06:06,565 This is almost like an ensemble, 91 00:06:06,565 --> 00:06:12,691 you create some ensemble features yourself. 92 00:06:12,691 --> 00:06:16,050 You can also deploy standard feature selection techniques, 93 00:06:16,050 --> 00:06:20,765 so any techniques you would use to find which features are important. 94 00:06:20,765 --> 00:06:24,390 You could use it here to find which models are important and exclude them. 95 00:06:24,390 --> 00:06:29,960 A rule which empirically, let's say as a rule of thumb, 96 00:06:29,960 --> 00:06:33,770 I've found to work quite well. 97 00:06:33,770 --> 00:06:38,555 I mean, not with 100 percent confidence but as a general idea, 98 00:06:38,555 --> 00:06:43,220 is that for every seven point five models in one layer, 99 00:06:43,220 --> 00:06:48,635 we add one model in the subsequent layer. 100 00:06:48,635 --> 00:06:50,885 So if we have seven models, 101 00:06:50,885 --> 00:06:53,315 we'll have one metamodel. 102 00:06:53,315 --> 00:06:55,040 If we have 15 models, 103 00:06:55,040 --> 00:07:00,095 then we will have two metamodels and so on. 104 00:07:00,095 --> 00:07:03,356 What we need to be very mindful is that, 105 00:07:03,356 --> 00:07:06,455 even though we use this hold-out mechanism, 106 00:07:06,455 --> 00:07:09,790 we still might introduce leakage. 107 00:07:09,790 --> 00:07:19,463 How we can control this is by selecting the right K, 108 00:07:19,463 --> 00:07:22,165 when I mentioned K in the cross-validation. 109 00:07:22,165 --> 00:07:25,430 So when we select a very high value there, 110 00:07:25,430 --> 00:07:28,985 this means that each model would use 111 00:07:28,985 --> 00:07:31,920 more training data when it makes 112 00:07:31,920 --> 00:07:36,205 predictions and therefore it might not generalize very well. 113 00:07:36,205 --> 00:07:41,985 At the same time, it will exhaust all information about the training data. 114 00:07:41,985 --> 00:07:48,950 So again, there is this bias variance threshold where you try to find. 115 00:07:48,950 --> 00:07:52,677 So, there is no easy way to spot a mistake here. 116 00:07:52,677 --> 00:07:55,033 Normally, you have a test data set, 117 00:07:55,033 --> 00:08:00,710 and if you see in your cross-validation that you have a next improvement, 118 00:08:00,710 --> 00:08:03,644 that in your test data you don't see it, 119 00:08:03,644 --> 00:08:10,340 then you need to go back and try to reduce the number of K-Folds. 120 00:08:10,340 --> 00:08:13,985 And hopefully, this will generalize better, 121 00:08:13,985 --> 00:08:20,665 at least that's a way that has worked in practice. 122 00:08:20,665 --> 00:08:23,485 I'll list a few software which you can use for stacking, 123 00:08:23,485 --> 00:08:27,686 one is StackNet which is the product of my research. 124 00:08:27,686 --> 00:08:30,261 You can give it a shot if you want. 125 00:08:30,261 --> 00:08:32,435 Another thing which you can try, 126 00:08:32,435 --> 00:08:33,594 is Stacked ensembles from H2O. 127 00:08:33,594 --> 00:08:36,110 There is also a new software called, 128 00:08:36,110 --> 00:08:44,475 Xcessic and Python where you can also create quite diverse, and large ensembles. 129 00:08:44,475 --> 00:08:50,210 A few more things to know about StackNet if you want to use it, 130 00:08:50,210 --> 00:08:55,190 is that it now supports many of the common machine learning tools we use, 131 00:08:55,190 --> 00:08:56,390 like xgboost, lightgbm, H2O. 132 00:08:56,390 --> 00:09:00,420 So, you can pretty much have 133 00:09:00,420 --> 00:09:05,570 available all the great tools in order to build a strong ensemble. 134 00:09:05,570 --> 00:09:12,090 An interesting addition which we didn't have much chance to discuss but nevertheless, 135 00:09:12,090 --> 00:09:14,170 it is quite interesting, 136 00:09:14,170 --> 00:09:21,515 is that you can run classifiers in a regression problem, and vice versa. 137 00:09:21,515 --> 00:09:24,710 This means that instead of predicting AIDS, 138 00:09:24,710 --> 00:09:26,400 I could be predicting, 139 00:09:26,400 --> 00:09:30,350 will this person live more than 50 years old or not. 140 00:09:30,350 --> 00:09:37,845 And this tends to work quite well because it makes the model focus in certain areas, 141 00:09:37,845 --> 00:09:40,410 and meta-model will be able to 142 00:09:40,410 --> 00:09:44,288 utilize this information to make better predictions for AIDS. 143 00:09:44,288 --> 00:09:46,190 So, this tends to works quite well. 144 00:09:46,190 --> 00:09:48,530 I've found it useful quite many often. 145 00:09:48,530 --> 00:09:51,395 So, this is something you should explore. 146 00:09:51,395 --> 00:09:57,490 Generally, the software has already many top 10 solutions, 147 00:09:57,490 --> 00:09:59,200 and not just by me. 148 00:09:59,200 --> 00:10:02,535 So, it has been tested. 149 00:10:02,535 --> 00:10:05,980 And in the examples section, 150 00:10:05,980 --> 00:10:10,840 I think there is a really interesting example with 151 00:10:10,840 --> 00:10:16,340 a very popular kind of competition which was hosted by Amazon. 152 00:10:16,340 --> 00:10:19,255 And this example uses StackNet, 153 00:10:19,255 --> 00:10:22,561 and you can see how you can get with top 10. 154 00:10:22,561 --> 00:10:25,930 But in principle, this is a very nice competition. 155 00:10:25,930 --> 00:10:29,790 It doesn't have very big data, nor very small. 156 00:10:29,790 --> 00:10:34,415 You can try lots of transformations especially with the categorical data. 157 00:10:34,415 --> 00:10:38,600 And it is very good place to start. 158 00:10:38,600 --> 00:10:45,475 The other thing I wanted to say is that StackNet has also an educational flavor. 159 00:10:45,475 --> 00:10:47,715 I made it as such, 160 00:10:47,715 --> 00:10:49,680 I have this focus in mind. 161 00:10:49,680 --> 00:10:55,800 So, if you go to the parameter section where it leaves all the different tools, 162 00:10:55,800 --> 00:11:02,335 you can find sections where they tell you which parameters are the most important. 163 00:11:02,335 --> 00:11:04,285 This is based on my experience. 164 00:11:04,285 --> 00:11:09,320 For example, in xgboost, num_round is important, eta is important. 165 00:11:09,320 --> 00:11:14,883 You can take, then, this information and use it even outside StackNet. 166 00:11:14,883 --> 00:11:18,361 For example, if you want to use this, they're from Python. 167 00:11:18,361 --> 00:11:20,850 Because the parameters are generally the same. 168 00:11:20,850 --> 00:11:23,150 So, if you don't know where to start, 169 00:11:23,150 --> 00:11:26,265 how to look here, see which these parameters are important. 170 00:11:26,265 --> 00:11:32,361 And focus on them in order to try and get a good resolution. 171 00:11:32,361 --> 00:11:35,665 Before we close this session, 172 00:11:35,665 --> 00:11:38,410 there are a few things I'd like to tell you. 173 00:11:38,410 --> 00:11:41,785 Go out there, apply what you've learned. 174 00:11:41,785 --> 00:11:46,225 There is no such thing as learning only theoretically. 175 00:11:46,225 --> 00:11:49,550 The best thing to learn is to bleed on the battlefield. 176 00:11:49,550 --> 00:11:50,860 Choose a competition. 177 00:11:50,860 --> 00:11:55,518 You may start with some which are named as tutorials. 178 00:11:55,518 --> 00:11:59,285 And then, you can go on into the real competitions. 179 00:11:59,285 --> 00:12:00,455 This is how you've learned. 180 00:12:00,455 --> 00:12:04,443 You need to get more practical experience, obviously. 181 00:12:04,443 --> 00:12:11,940 Don't be demoralized if you see there's still a gap with the top people, 182 00:12:11,940 --> 00:12:14,106 because it does take some time to adjust. 183 00:12:14,106 --> 00:12:16,275 You need to learn the dynamics. 184 00:12:16,275 --> 00:12:18,865 You need to understand how you need to work, 185 00:12:18,865 --> 00:12:20,743 where to optimize your work, 186 00:12:20,743 --> 00:12:23,572 how maximize the intensity. 187 00:12:23,572 --> 00:12:25,560 So, it takes a bit of time. 188 00:12:25,560 --> 00:12:27,285 But you'll get there. 189 00:12:27,285 --> 00:12:29,815 My main point is don't get disappointed, 190 00:12:29,815 --> 00:12:31,590 you'll definitely get there. 191 00:12:31,590 --> 00:12:36,755 Something that has always helped me is to save my code. 192 00:12:36,755 --> 00:12:40,740 Let's say at the end of it it's a silence. That is good. 193 00:12:40,740 --> 00:12:44,640 Because I can then take home this code in the next competition, 194 00:12:44,640 --> 00:12:46,315 and try and improve it. 195 00:12:46,315 --> 00:12:51,775 So, this helps to gradually build a much stronger pipeline, 196 00:12:51,775 --> 00:12:55,619 and at the same time, it saves time. 197 00:12:55,619 --> 00:13:01,160 Something that has helped me personally is to seek collaborations. 198 00:13:01,160 --> 00:13:03,255 These generally, I think, 199 00:13:03,255 --> 00:13:05,025 there are two elements into it. 200 00:13:05,025 --> 00:13:07,813 One is, you definitely improve your result, 201 00:13:07,813 --> 00:13:11,610 because every person seizes the problem from different angles, 202 00:13:11,610 --> 00:13:14,230 is able to extract different information. 203 00:13:14,230 --> 00:13:18,085 Therefore, when you join forces the score is better, 204 00:13:18,085 --> 00:13:20,085 but it's also more fun. 205 00:13:20,085 --> 00:13:22,595 And since you're doing it, 206 00:13:22,595 --> 00:13:24,800 you might as well enjoy it. 207 00:13:24,800 --> 00:13:27,920 And the other thing I need to highlight is generally, 208 00:13:27,920 --> 00:13:29,785 you need to be connected with forums, 209 00:13:29,785 --> 00:13:31,700 and codes, and kernels, 210 00:13:31,700 --> 00:13:33,512 because there might be tips, 211 00:13:33,512 --> 00:13:37,520 there might be some cutting-edge solutions that come out, 212 00:13:37,520 --> 00:13:41,675 and they can significantly sift a leader-board. 213 00:13:41,675 --> 00:13:44,060 So generally, you need to keep reading, 214 00:13:44,060 --> 00:13:46,829 and have that in mind. 215 00:13:46,829 --> 00:13:51,750 This is the last video in a series where we have 216 00:13:51,750 --> 00:13:57,165 examined and sample methods will look that simple and sampling. 217 00:13:57,165 --> 00:13:59,432 Then we went to bagging, 218 00:13:59,432 --> 00:14:03,397 boosting, stacking, multi-layer stacking. 219 00:14:03,397 --> 00:14:06,640 Hopefully, you found this useful. 220 00:14:06,640 --> 00:14:09,875 Thank you for bearing with me all this time. 221 00:14:09,875 --> 00:14:11,365 I also enjoyed it. 222 00:14:11,365 --> 00:14:13,920 And go out there, 223 00:14:13,920 --> 00:14:15,670 make us proud and who knows? 224 00:14:15,670 --> 00:14:19,000 The next two person in the leader-board might be you.