1 00:00:00,000 --> 00:00:04,740 Hello everyone. This is Marios Michailidis and we will continue 2 00:00:04,740 --> 00:00:09,425 our discussion in regards to ensemble methods. 3 00:00:09,425 --> 00:00:12,555 Previously, we saw some simple averaging methods. 4 00:00:12,555 --> 00:00:14,935 This time, we'll discuss about bagging, 5 00:00:14,935 --> 00:00:19,140 which is a very popular and efficient form of ensembling. 6 00:00:19,140 --> 00:00:23,670 What is bagging? bagging refers to averaging 7 00:00:23,670 --> 00:00:31,110 slightly different versions of the same model as a means to improve the predictive power. 8 00:00:31,110 --> 00:00:35,935 A common and quite successful application of bagging is the Random Forest. 9 00:00:35,935 --> 00:00:38,790 Where you would run many different versions of 10 00:00:38,790 --> 00:00:42,900 decision trees in order to get a better prediction. 11 00:00:42,900 --> 00:00:45,510 Why should we consider bagging? 12 00:00:45,510 --> 00:00:48,188 Generally, in the modeling process, 13 00:00:48,188 --> 00:00:52,030 there are two main sources of error. 14 00:00:52,030 --> 00:00:56,885 There are errors due to bias often referred to as underfitting, 15 00:00:56,885 --> 00:01:01,750 and errors due to variance often referred to as overfitting. 16 00:01:01,750 --> 00:01:03,874 In order to better understand this, 17 00:01:03,874 --> 00:01:07,500 I'll give you two opposite examples. 18 00:01:07,500 --> 00:01:10,347 One with high bias and low variance and 19 00:01:10,347 --> 00:01:14,490 vice versa in order to understand the concept better. 20 00:01:14,490 --> 00:01:18,280 Let's take an example of high bias and low variance. 21 00:01:18,280 --> 00:01:21,655 We have a person who is let's say young, 22 00:01:21,655 --> 00:01:24,840 less than 30 years old and we know this person is 23 00:01:24,840 --> 00:01:28,110 quite rich and we're trying to find him, 24 00:01:28,110 --> 00:01:32,685 this person who'll buy a racing or an expensive car. 25 00:01:32,685 --> 00:01:35,610 Our model has high variance, 26 00:01:35,610 --> 00:01:39,665 has high bias if it says that this person is 27 00:01:39,665 --> 00:01:45,550 young and I think he's not going to buy an expensive car. 28 00:01:45,550 --> 00:01:49,015 What the model has done here is that it hasn't 29 00:01:49,015 --> 00:01:52,720 explore very deep relationship within the data. 30 00:01:52,720 --> 00:01:57,150 It doesn't matter that this person 31 00:01:57,150 --> 00:02:01,925 is young if it has a lots of money when it comes to buying a car. 32 00:02:01,925 --> 00:02:05,850 It hasn't explored different relationships. 33 00:02:05,850 --> 00:02:10,125 In other words, it has been underfitted. 34 00:02:10,125 --> 00:02:17,175 However, this is also associated with low variance because this relationship, 35 00:02:17,175 --> 00:02:23,310 the fact that a young person generally doesn't buy an expensive car is generally 36 00:02:23,310 --> 00:02:30,570 true so we would expect this information to generalize well enough in a foreseen data. 37 00:02:30,570 --> 00:02:34,435 Therefore, the variance is low in this example. 38 00:02:34,435 --> 00:02:38,300 Now, let's try to see the other way around, 39 00:02:38,300 --> 00:02:43,470 an example with high variance and low bias. 40 00:02:43,470 --> 00:02:46,275 Let's assume we have a person. 41 00:02:46,275 --> 00:02:48,045 His name is John. 42 00:02:48,045 --> 00:02:50,285 He lives in a green house, 43 00:02:50,285 --> 00:02:56,300 has brown eyes, and we want to see he will buy a car. 44 00:02:56,300 --> 00:03:02,090 A model that has gone so deep in order to find these relationships 45 00:03:02,090 --> 00:03:05,155 actually has a low bias because it has 46 00:03:05,155 --> 00:03:09,455 really explored a lots of information about the training data. 47 00:03:09,455 --> 00:03:12,720 However, it is making the mistake that 48 00:03:12,720 --> 00:03:17,690 every person that has these characteristics is going to buy a car. 49 00:03:17,690 --> 00:03:22,375 Therefore, it generalizes for something that it shouldn't. 50 00:03:22,375 --> 00:03:26,310 In other words, it has already exhausted the information in 51 00:03:26,310 --> 00:03:30,900 the training data and the results are not significant. 52 00:03:30,900 --> 00:03:36,250 So, here, we actually have high variance but low bias. 53 00:03:36,250 --> 00:03:42,090 If we were to visualize the relationship between prediction error and model complexity, 54 00:03:42,090 --> 00:03:43,805 it would look like that. 55 00:03:43,805 --> 00:03:47,865 When we begin the training of the model, 56 00:03:47,865 --> 00:03:50,910 we can see that the training error make the error 57 00:03:50,910 --> 00:03:53,730 in that training data gets reduced and the 58 00:03:53,730 --> 00:03:59,970 same happens in the test data because the predictions are easily generalizable. 59 00:03:59,970 --> 00:04:04,575 They are simple. However, after a point, 60 00:04:04,575 --> 00:04:10,300 any improvements in the training error are not realized into test data. 61 00:04:10,300 --> 00:04:16,599 This is the point where the model starts over exhausting information, 62 00:04:16,599 --> 00:04:20,475 creates predictions that are not generalizable. 63 00:04:20,475 --> 00:04:27,285 This is where bagging actually comes into play and offers it's utmost value. 64 00:04:27,285 --> 00:04:33,717 By making slightly different or let say randomized models, 65 00:04:33,717 --> 00:04:39,160 we ensure that the predictions do not read very high variance. 66 00:04:39,160 --> 00:04:41,410 They're generally more generalizable. 67 00:04:41,410 --> 00:04:45,835 We don't over exhaust the information in the training data. 68 00:04:45,835 --> 00:04:47,185 At the same time, 69 00:04:47,185 --> 00:04:53,365 we saw before that when you average slightly different models, 70 00:04:53,365 --> 00:04:59,855 we are generally able to get better predictions and we can assume that in 10 models, 71 00:04:59,855 --> 00:05:06,155 we are still able to find quite significant information about the training data. 72 00:05:06,155 --> 00:05:11,490 Therefore, this is why bagging tends to work quite well and personally, 73 00:05:11,490 --> 00:05:12,915 I always use bagging. 74 00:05:12,915 --> 00:05:17,290 When I say, "I fit a model," I have actually not fit a model I have 75 00:05:17,290 --> 00:05:23,835 fit a bagging version of this model so probably that different models. 76 00:05:23,835 --> 00:05:27,935 Which parameters are associated with bagging? 77 00:05:27,935 --> 00:05:30,210 The first is the seed. 78 00:05:30,210 --> 00:05:35,305 We can understand that many algorithms have some randomized procedures 79 00:05:35,305 --> 00:05:41,630 so by changing the seed you ensure that they are made slightly differently. 80 00:05:41,630 --> 00:05:48,485 At the same time, you can run a model with less rows or you could use bootstrapping. 81 00:05:48,485 --> 00:05:53,760 Bootstrapping is different from row sub-sampling in the sense that you create 82 00:05:53,760 --> 00:05:56,820 an artificial dataset so you 83 00:05:56,820 --> 00:06:00,151 might let's say data row the training data three or four times. 84 00:06:00,151 --> 00:06:05,335 You create a random dataset from the training data. 85 00:06:05,335 --> 00:06:09,570 A different form of randomness can be imputed with shuffling. 86 00:06:09,570 --> 00:06:10,840 There are some algorithms, 87 00:06:10,840 --> 00:06:13,840 which are sensitive to the order of the data. 88 00:06:13,840 --> 00:06:18,580 By changing the order you ensure that the models become quite different. 89 00:06:18,580 --> 00:06:21,680 Another way is to dating a random sample of columns 90 00:06:21,680 --> 00:06:27,275 so bid models on different features or different variables of the data. 91 00:06:27,275 --> 00:06:30,300 Then you have model-specific parameters. 92 00:06:30,300 --> 00:06:32,295 For example, in a linear model, 93 00:06:32,295 --> 00:06:36,300 you will try to build 10 different let's say 94 00:06:36,300 --> 00:06:42,115 logistic regression with slightly different regularization parameters. 95 00:06:42,115 --> 00:06:46,250 Obviously, you could also control the number of models 96 00:06:46,250 --> 00:06:50,995 you include in your ensemble or in this case we call them bags. 97 00:06:50,995 --> 00:06:55,120 Normally, we put a value more than 10 here but, 98 00:06:55,120 --> 00:06:57,631 in principle, the more bags you put, 99 00:06:57,631 --> 00:06:59,310 it doesn't hurt you. 100 00:06:59,310 --> 00:07:05,370 It makes results better but after some point, performance start plateauing. 101 00:07:05,370 --> 00:07:10,470 So there is a cost benefit with time but, in principle, 102 00:07:10,470 --> 00:07:14,930 more bags is generally better and optionally, 103 00:07:14,930 --> 00:07:16,681 you can also apply parallelism. 104 00:07:16,681 --> 00:07:21,317 Bagging models are independent to each other, 105 00:07:21,317 --> 00:07:24,680 which means you can build many of them at the same time and make 106 00:07:24,680 --> 00:07:29,035 full use of your computation power. 107 00:07:29,035 --> 00:07:33,460 Now, we can see an example about bagging but before I do that, 108 00:07:33,460 --> 00:07:35,210 just to let you know that 109 00:07:35,210 --> 00:07:42,220 a bagging estimators that scikit-learn has in Python are actually quite cool. 110 00:07:42,220 --> 00:07:45,330 Therefore, I recommend them. 111 00:07:45,330 --> 00:07:51,955 This is a typical 15 lines of code that I use quite often. 112 00:07:51,955 --> 00:07:55,915 They seem really simple but they're actually quite efficient. 113 00:07:55,915 --> 00:08:00,968 Assuming you have a training at the test dataset and to target variable, 114 00:08:00,968 --> 00:08:05,360 what you do is you specify some bagging parameters. 115 00:08:05,360 --> 00:08:10,060 What is the model I'm going to use at random forest? 116 00:08:10,060 --> 00:08:11,850 How many bags I'm going to run? 117 00:08:11,850 --> 00:08:13,960 10. What will be my seed? 118 00:08:13,960 --> 00:08:16,575 One. Then you create an object, 119 00:08:16,575 --> 00:08:20,570 an empty object that will save the predictions and 120 00:08:20,570 --> 00:08:24,950 then you run a loop for as many bags as you have specified. 121 00:08:24,950 --> 00:08:27,480 In this loop, you repeat the same. 122 00:08:27,480 --> 00:08:30,510 You change the seed, you feed the model, 123 00:08:30,510 --> 00:08:35,600 you make predictions in the test data and you save these predictions and then, 124 00:08:35,600 --> 00:08:38,970 you just take an average of these predictions. 125 00:08:38,970 --> 00:08:41,230 This is the end of the session. 126 00:08:41,230 --> 00:08:46,245 In this session, we discussed bagging as a popular form of ensembling. 127 00:08:46,245 --> 00:08:50,146 We saw bagging in association with variants and 128 00:08:50,146 --> 00:08:55,380 bias and we also saw in the example about how to use it. 129 00:08:55,380 --> 00:08:59,935 Thank you very much. The next session we will describe boosting, 130 00:08:59,935 --> 00:09:04,000 which is also very popular so stay in tune and have a good day.