1 00:00:03,050 --> 00:00:05,550 Hi. In this video, 2 00:00:05,550 --> 00:00:07,341 we will cover basic approach as to 3 00:00:07,341 --> 00:00:11,060 feature preprocessing and feature generation for numeric features. 4 00:00:11,060 --> 00:00:14,970 We will understand how model choice impacts feature preprocessing. 5 00:00:14,970 --> 00:00:19,500 We will identify the preprocessing methods that are used most often, 6 00:00:19,500 --> 00:00:23,655 and we will discuss feature generation and go through several examples. 7 00:00:23,655 --> 00:00:26,370 Let's start with preprocessing. 8 00:00:26,370 --> 00:00:29,880 First thing you need to know about handling numeric features is 9 00:00:29,880 --> 00:00:34,230 that there are models which do and don't depend on feature scale. 10 00:00:34,230 --> 00:00:36,540 For now, we will broadly divide 11 00:00:36,540 --> 00:00:40,760 all models into tree-based models and non-tree-based models. 12 00:00:40,760 --> 00:00:43,068 For example, decision trees classifier tries 13 00:00:43,068 --> 00:00:46,275 to find the most useful split for each feature, 14 00:00:46,275 --> 00:00:49,410 and it won't change its behavior and its predictions. 15 00:00:49,410 --> 00:00:53,655 It can multiply the feature by a constant and to retrain the model. 16 00:00:53,655 --> 00:00:58,490 On the other side, there are models which depend on these kind of transformations. 17 00:00:58,490 --> 00:01:01,585 The model based on your nearest neighbors, 18 00:01:01,585 --> 00:01:04,380 linear models, and neural network. 19 00:01:04,380 --> 00:01:06,900 Let's consider the following example. 20 00:01:06,900 --> 00:01:10,615 We have a binary classification test with two features. 21 00:01:10,615 --> 00:01:13,740 The object in the picture belong to different classes. 22 00:01:13,740 --> 00:01:15,480 The red circle to class zero, 23 00:01:15,480 --> 00:01:17,310 and the blue cross to class one, 24 00:01:17,310 --> 00:01:21,375 and finally, the class of the green object is unknown. 25 00:01:21,375 --> 00:01:24,369 Here, we will use a one nearest neighbor's model 26 00:01:24,369 --> 00:01:26,785 to predict the class of the green object. 27 00:01:26,785 --> 00:01:29,700 We will measure distance using square distance, 28 00:01:29,700 --> 00:01:32,500 which is also called altometric. 29 00:01:32,500 --> 00:01:36,970 Now, if we calculate distances to the red circle and to the blue cross, 30 00:01:36,970 --> 00:01:40,380 we will see that our model will predict class one for 31 00:01:40,380 --> 00:01:45,795 the green object because the blue cross of class one is much closer than the red circle. 32 00:01:45,795 --> 00:01:48,540 But if we multiply the first feature by 10, 33 00:01:48,540 --> 00:01:49,949 the red circle will became the closest object, 34 00:01:49,949 --> 00:01:53,096 and we will get an opposite prediction. 35 00:01:53,096 --> 00:01:55,885 Let's now consider two extreme cases. 36 00:01:55,885 --> 00:02:00,880 What will happen if we multiply the first feature by zero and by one million? 37 00:02:00,880 --> 00:02:03,300 If the feature is multiplied by zero, 38 00:02:03,300 --> 00:02:06,765 then every object will have feature relay of zero, 39 00:02:06,765 --> 00:02:09,840 which results in KNN ignoring that feature. 40 00:02:09,840 --> 00:02:13,155 On the opposite, if the feature is multiplied by one million, 41 00:02:13,155 --> 00:02:17,565 slightest differences in that features values will impact prediction, 42 00:02:17,565 --> 00:02:22,335 and this will result in KNN favoring that feature over all others. 43 00:02:22,335 --> 00:02:25,098 Great, but what about other models? 44 00:02:25,098 --> 00:02:30,595 Linear models are also experiencing difficulties with differently scaled features. 45 00:02:30,595 --> 00:02:33,660 First, we want regularization to be applied 46 00:02:33,660 --> 00:02:37,450 to linear models coefficients for features in equal amount. 47 00:02:37,450 --> 00:02:43,230 But in fact, regularization impact turns out to be proportional to feature scale. 48 00:02:43,230 --> 00:02:48,820 And second, gradient descent methods can go crazy without a proper scaling. 49 00:02:48,820 --> 00:02:50,828 Due to the same reasons, 50 00:02:50,828 --> 00:02:52,740 neural networks are similar to 51 00:02:52,740 --> 00:02:56,710 linear models in the requirements for feature preprocessing. 52 00:02:56,710 --> 00:02:58,930 It is important to understand that 53 00:02:58,930 --> 00:03:03,035 different features scalings result in different models quality. 54 00:03:03,035 --> 00:03:08,205 In this sense, it is just another hyper parameter you need to optimize. 55 00:03:08,205 --> 00:03:12,580 The easiest way to do this is to rescale all features to the same scale. 56 00:03:12,580 --> 00:03:18,875 For example, to make the minimum of a feature equal to zero and the maximum equal to one, 57 00:03:18,875 --> 00:03:20,740 you can achieve this in two steps. 58 00:03:20,740 --> 00:03:23,565 First, we sector at minimum value. 59 00:03:23,565 --> 00:03:27,240 And second, we divide the difference base maximum. 60 00:03:27,240 --> 00:03:31,382 It can be done with MinMaxScaler from sklearn. 61 00:03:31,382 --> 00:03:34,950 Let's illustrate this with an example. 62 00:03:34,950 --> 00:03:37,320 We apply the so-called MinMaxScaler to 63 00:03:37,320 --> 00:03:41,355 two features from the detaining dataset, Age and SibSp. 64 00:03:41,355 --> 00:03:46,320 Looking at histograms, we see that the features have different scale, 65 00:03:46,320 --> 00:03:48,380 ages between zero and 80, 66 00:03:48,380 --> 00:03:51,595 while SibSp is between zero and 8. 67 00:03:51,595 --> 00:03:55,800 Let's apply MinMaxScaling and see what it will do. 68 00:03:55,800 --> 00:03:58,515 Indeed, we see that after this transformation, 69 00:03:58,515 --> 00:04:04,815 both age and SibSp features were successfully converted to the same value range of 0,1. 70 00:04:04,815 --> 00:04:11,000 Note that distributions of values which we observe from the histograms didn't change. 71 00:04:11,000 --> 00:04:12,535 To give you another example, 72 00:04:12,535 --> 00:04:16,365 we can apply a scalar named StandardScaler in sklearn, 73 00:04:16,365 --> 00:04:20,980 which basically first subtract mean value from the feature, 74 00:04:20,980 --> 00:04:25,060 and then divides the result by feature standard deviation. 75 00:04:25,060 --> 00:04:27,785 In this way, we'll get standardized distribution, 76 00:04:27,785 --> 00:04:31,670 with a mean of zero and standard deviation of one. 77 00:04:31,670 --> 00:04:36,747 After either of MinMaxScaling or StandardScaling transformations, 78 00:04:36,747 --> 00:04:41,460 features impacts on non-tree-based models will be roughly similar. 79 00:04:41,460 --> 00:04:44,045 Even more, if you want to use KNN, 80 00:04:44,045 --> 00:04:48,920 we can go one step ahead and recall that the bigger feature is, 81 00:04:48,920 --> 00:04:51,810 the more important it will be for KNN. 82 00:04:51,810 --> 00:04:56,180 So, we can optimize scaling parameter to boost features which 83 00:04:56,180 --> 00:05:01,106 seems to be more important for us and see if this helps. 84 00:05:01,106 --> 00:05:03,475 When we work with linear models, 85 00:05:03,475 --> 00:05:08,185 there is another important moment that influences model training results. 86 00:05:08,185 --> 00:05:10,655 I'm talking about outiers. 87 00:05:10,655 --> 00:05:13,130 For example, in this plot, we have one feature, 88 00:05:13,130 --> 00:05:16,550 X, and a target variable, Y. 89 00:05:16,550 --> 00:05:18,865 If you fit a simple linear model, 90 00:05:18,865 --> 00:05:22,850 its predictions can look just like the red line. 91 00:05:22,850 --> 00:05:29,065 But if you do have one outlier with X feature equal to some huge value, 92 00:05:29,065 --> 00:05:33,940 predictions of the linear model will look more like the purple line. 93 00:05:33,940 --> 00:05:36,419 The same holds, not only for features values, 94 00:05:36,419 --> 00:05:38,915 but also for target values. 95 00:05:38,915 --> 00:05:41,815 For example, let's imagine we have a model trained 96 00:05:41,815 --> 00:05:45,245 on the data with target values between zero and one. 97 00:05:45,245 --> 00:05:48,690 Let's think what happens if we add a new sample in 98 00:05:48,690 --> 00:05:52,390 the training data with a target value of 1,000. 99 00:05:52,390 --> 00:05:53,950 When we retrain the model, 100 00:05:53,950 --> 00:05:57,185 the model will predict abnormally high values. 101 00:05:57,185 --> 00:06:00,160 Obviously, we have to fix this somehow. 102 00:06:00,160 --> 00:06:03,295 To protect linear models from outliers, 103 00:06:03,295 --> 00:06:09,635 we can clip features values between two chosen values of lower bound and upper bound. 104 00:06:09,635 --> 00:06:13,105 We can choose them as some percentiles of that feature. 105 00:06:13,105 --> 00:06:16,575 For example, first and 99s percentiles. 106 00:06:16,575 --> 00:06:19,750 This procedure of clipping is well-known in 107 00:06:19,750 --> 00:06:23,468 financial data and it is called winsorization. 108 00:06:23,468 --> 00:06:26,980 Let's take a look at this histogram for an example. 109 00:06:26,980 --> 00:06:32,330 We see that the majority of feature values are between zero and 400. 110 00:06:32,330 --> 00:06:37,760 But there is a number of outliers with values around -1,000. 111 00:06:37,760 --> 00:06:42,495 They can make life a lot harder for our nice and simple linear model. 112 00:06:42,495 --> 00:06:45,940 Let's clip this feature's value range and to do so, first, 113 00:06:45,940 --> 00:06:49,570 we will calculate lower bound and upper bound values as 114 00:06:49,570 --> 00:06:53,810 features values at first and 99s percentiles. 115 00:06:53,810 --> 00:06:55,260 After we clip the features values, 116 00:06:55,260 --> 00:06:59,655 we can see that features distribution looks fine, 117 00:06:59,655 --> 00:07:03,743 and we hope now this feature will be more useful for our model. 118 00:07:03,743 --> 00:07:09,985 Another effective preprocessing for numeric features is the rank transformation. 119 00:07:09,985 --> 00:07:15,210 Basically, it sets spaces between proper assorted values to be equal. 120 00:07:15,210 --> 00:07:17,160 This transformation, for example, 121 00:07:17,160 --> 00:07:22,005 can be a better option than MinMaxScaler if we have outliers, 122 00:07:22,005 --> 00:07:28,160 because rank transformation will move the outliers closer to other objects. 123 00:07:28,160 --> 00:07:31,140 Let's understand rank using this example. 124 00:07:31,140 --> 00:07:34,165 If we apply a rank to the source of array, 125 00:07:34,165 --> 00:07:37,585 it will just change values to their indices. 126 00:07:37,585 --> 00:07:41,110 Now, if we apply a rank to the not-sorted array, 127 00:07:41,110 --> 00:07:42,765 it will sort this array, 128 00:07:42,765 --> 00:07:46,610 define mapping between values and indices in this source of array, 129 00:07:46,610 --> 00:07:49,528 and apply this mapping to the initial array. 130 00:07:49,528 --> 00:07:54,180 Linear models, KNN, and neural networks can benefit from 131 00:07:54,180 --> 00:07:59,640 this kind of transformation if we have no time to handle outliers manually. 132 00:07:59,640 --> 00:08:04,868 Rank can be imported as a random data function from scipy. 133 00:08:04,868 --> 00:08:10,314 One more important note about the rank transformation is that to apply to the test data, 134 00:08:10,314 --> 00:08:15,580 you need to store the creative mapping from features values to their rank values. 135 00:08:15,580 --> 00:08:18,540 Or alternatively, you can concatenate, 136 00:08:18,540 --> 00:08:22,855 train, and test data before applying the rank transformation. 137 00:08:22,855 --> 00:08:27,390 There is one more example of numeric features preprocessing which 138 00:08:27,390 --> 00:08:32,440 often helps non-tree-based models and especially neural networks. 139 00:08:32,440 --> 00:08:35,805 You can apply log transformation through your data, 140 00:08:35,805 --> 00:08:37,845 or there's another possibility. 141 00:08:37,845 --> 00:08:41,440 You can extract a square root of the data. 142 00:08:41,440 --> 00:08:44,880 Both these transformations can be useful because they 143 00:08:44,880 --> 00:08:49,425 drive too big values closer to the features' average value. 144 00:08:49,425 --> 00:08:55,095 Along with this, the values near zero are becoming a bit more distinguishable. 145 00:08:55,095 --> 00:08:58,320 Despite the simplicity, one of these transformations can 146 00:08:58,320 --> 00:09:02,213 improve your neural network's results significantly. 147 00:09:02,213 --> 00:09:08,210 Another important moment which holds true for all preprocessings is that sometimes, 148 00:09:08,210 --> 00:09:10,335 it is beneficial to train a model on 149 00:09:10,335 --> 00:09:14,370 concatenated data frames produced by different preprocessings, 150 00:09:14,370 --> 00:09:19,325 or to mix models training differently-preprocessed data. 151 00:09:19,325 --> 00:09:22,385 Again, linear models, KNN, 152 00:09:22,385 --> 00:09:26,804 and neural networks can benefit hugely from this. 153 00:09:26,804 --> 00:09:30,945 To this end, we have discussed numeric feature preprocessing, 154 00:09:30,945 --> 00:09:33,915 how model choice impacts feature preprocessing, 155 00:09:33,915 --> 00:09:37,745 and what are the most commonly used preprocessing methods. 156 00:09:37,745 --> 00:09:40,395 Let's now move on to feature generation. 157 00:09:40,395 --> 00:09:43,290 Feature generation is a process of creating 158 00:09:43,290 --> 00:09:47,155 new features using knowledge about the features and the task. 159 00:09:47,155 --> 00:09:51,495 It helps us by making model training more simple and effective. 160 00:09:51,495 --> 00:09:55,270 Sometimes, we can engineer these features using prior knowledge and logic. 161 00:09:55,270 --> 00:09:57,345 Sometimes we have to dig into the data, 162 00:09:57,345 --> 00:09:59,310 create and check hypothesis, 163 00:09:59,310 --> 00:10:04,060 and use this derived knowledge and our intuition to derive new features. 164 00:10:04,060 --> 00:10:08,325 Here, we will discuss feature generation with prior knowledge, 165 00:10:08,325 --> 00:10:09,990 but as it turns out, 166 00:10:09,990 --> 00:10:12,450 an ability to dig into the data and 167 00:10:12,450 --> 00:10:17,380 derive insights is what makes a good competitor a great one. 168 00:10:17,380 --> 00:10:20,220 We will thoroughly analyze and illustrate this skill in 169 00:10:20,220 --> 00:10:23,313 the next lessons on exploratory data analysis. 170 00:10:23,313 --> 00:10:28,950 For now, let's discuss examples of feature generation for numeric features. 171 00:10:28,950 --> 00:10:31,385 First, let's start with a simple one. 172 00:10:31,385 --> 00:10:32,825 If you have columns, 173 00:10:32,825 --> 00:10:37,430 Real Estate price and Real Estate squared area in the dataset, 174 00:10:37,430 --> 00:10:40,155 we can quickly add one more feature, 175 00:10:40,155 --> 00:10:42,090 price per meter square. 176 00:10:42,090 --> 00:10:45,430 Easy, and this seems quite reasonable. 177 00:10:45,430 --> 00:10:51,615 Or, let me give you another quick example from the Forest Cover Type Prediction dataset. 178 00:10:51,615 --> 00:10:54,660 If we have a horizontal distance to a water source and 179 00:10:54,660 --> 00:10:58,980 the vertical difference in heights within the point and the water source, 180 00:10:58,980 --> 00:11:02,065 we as well may add combined feature indicating 181 00:11:02,065 --> 00:11:05,750 the direct distance to the water from this point. 182 00:11:05,750 --> 00:11:10,745 Among other things, it is useful to know that adding, multiplications, 183 00:11:10,745 --> 00:11:16,540 divisions, and other features interactions can be of help not only for linear models. 184 00:11:16,540 --> 00:11:21,900 For example, although gradient within decision tree is a very powerful model, 185 00:11:21,900 --> 00:11:26,620 it still experiences difficulties with approximation of multiplications and divisions. 186 00:11:26,620 --> 00:11:29,580 And adding size features explicitly can 187 00:11:29,580 --> 00:11:33,410 lead to a more robust model with less amount of trees. 188 00:11:33,410 --> 00:11:39,035 The third example of feature generation for numeric features is also very interesting. 189 00:11:39,035 --> 00:11:42,750 Sometimes, if we have prices of products as a feature, 190 00:11:42,750 --> 00:11:47,745 we can add new feature indicating fractional part of these prices. 191 00:11:47,745 --> 00:11:51,730 For example, if some product costs 2.49, 192 00:11:51,730 --> 00:11:56,115 the fractional part of its price is 0.49. 193 00:11:56,115 --> 00:11:58,945 This feature can help the model utilize 194 00:11:58,945 --> 00:12:03,000 the differences in people's perception of these prices. 195 00:12:03,000 --> 00:12:06,450 Also, we can find similar patterns in tasks 196 00:12:06,450 --> 00:12:10,760 which require distinguishing between a human and a robot. 197 00:12:10,760 --> 00:12:15,465 For example, if we will have some kind of financial data like auctions, 198 00:12:15,465 --> 00:12:20,030 we could observe that people tend to set round numbers as prices, 199 00:12:20,030 --> 00:12:22,925 and there are something like 0.935, 200 00:12:22,925 --> 00:12:25,440 blah, blah,, blah, very long number here. 201 00:12:25,440 --> 00:12:29,420 Or, if we are trying to find spambots on social networks, 202 00:12:29,420 --> 00:12:36,610 we can be sure that no human ever read messages with an exact interval of one second. 203 00:12:36,610 --> 00:12:42,050 Great, these three examples should have provided you an idea that 204 00:12:42,050 --> 00:12:48,193 creativity and data understanding are the keys to productive feature generation. 205 00:12:48,193 --> 00:12:50,820 All right, let's summarize this up. 206 00:12:50,820 --> 00:12:54,195 In this video, we have discussed numeric features. 207 00:12:54,195 --> 00:12:59,400 First, the impact of feature preprocessing is different for different models. 208 00:12:59,400 --> 00:13:02,100 Tree-based models don't depend on scaling, 209 00:13:02,100 --> 00:13:05,735 while non-tree-based models usually depend on them. 210 00:13:05,735 --> 00:13:08,480 Second, we can treat scaling as 211 00:13:08,480 --> 00:13:11,185 an important hyper parameter in cases 212 00:13:11,185 --> 00:13:15,075 when the choice of scaling impacts predictions quality. 213 00:13:15,075 --> 00:13:18,125 And at last, we should remember 214 00:13:18,125 --> 00:13:23,240 that feature generation is powered by an understanding of the data. 215 00:13:23,240 --> 00:13:29,910 Remember this lesson and this knowledge will surely help you in your next competition.