1
00:00:03,050 --> 00:00:05,550
Hi. In this video,

2
00:00:05,550 --> 00:00:07,341
we will cover basic approach as to

3
00:00:07,341 --> 00:00:11,060
feature preprocessing and feature generation for numeric features.

4
00:00:11,060 --> 00:00:14,970
We will understand how model choice impacts feature preprocessing.

5
00:00:14,970 --> 00:00:19,500
We will identify the preprocessing methods that are used most often,

6
00:00:19,500 --> 00:00:23,655
and we will discuss feature generation and go through several examples.

7
00:00:23,655 --> 00:00:26,370
Let's start with preprocessing.

8
00:00:26,370 --> 00:00:29,880
First thing you need to know about handling numeric features is

9
00:00:29,880 --> 00:00:34,230
that there are models which do and don't depend on feature scale.

10
00:00:34,230 --> 00:00:36,540
For now, we will broadly divide

11
00:00:36,540 --> 00:00:40,760
all models into tree-based models and non-tree-based models.

12
00:00:40,760 --> 00:00:43,068
For example, decision trees classifier tries

13
00:00:43,068 --> 00:00:46,275
to find the most useful split for each feature,

14
00:00:46,275 --> 00:00:49,410
and it won't change its behavior and its predictions.

15
00:00:49,410 --> 00:00:53,655
It can multiply the feature by a constant and to retrain the model.

16
00:00:53,655 --> 00:00:58,490
On the other side, there are models which depend on these kind of transformations.

17
00:00:58,490 --> 00:01:01,585
The model based on your nearest neighbors,

18
00:01:01,585 --> 00:01:04,380
linear models, and neural network.

19
00:01:04,380 --> 00:01:06,900
Let's consider the following example.

20
00:01:06,900 --> 00:01:10,615
We have a binary classification test with two features.

21
00:01:10,615 --> 00:01:13,740
The object in the picture belong to different classes.

22
00:01:13,740 --> 00:01:15,480
The red circle to class zero,

23
00:01:15,480 --> 00:01:17,310
and the blue cross to class one,

24
00:01:17,310 --> 00:01:21,375
and finally, the class of the green object is unknown.

25
00:01:21,375 --> 00:01:24,369
Here, we will use a one nearest neighbor's model

26
00:01:24,369 --> 00:01:26,785
to predict the class of the green object.

27
00:01:26,785 --> 00:01:29,700
We will measure distance using square distance,

28
00:01:29,700 --> 00:01:32,500
which is also called altometric.

29
00:01:32,500 --> 00:01:36,970
Now, if we calculate distances to the red circle and to the blue cross,

30
00:01:36,970 --> 00:01:40,380
we will see that our model will predict class one for

31
00:01:40,380 --> 00:01:45,795
the green object because the blue cross of class one is much closer than the red circle.

32
00:01:45,795 --> 00:01:48,540
But if we multiply the first feature by 10,

33
00:01:48,540 --> 00:01:49,949
the red circle will became the closest object,

34
00:01:49,949 --> 00:01:53,096
and we will get an opposite prediction.

35
00:01:53,096 --> 00:01:55,885
Let's now consider two extreme cases.

36
00:01:55,885 --> 00:02:00,880
What will happen if we multiply the first feature by zero and by one million?

37
00:02:00,880 --> 00:02:03,300
If the feature is multiplied by zero,

38
00:02:03,300 --> 00:02:06,765
then every object will have feature relay of zero,

39
00:02:06,765 --> 00:02:09,840
which results in KNN ignoring that feature.

40
00:02:09,840 --> 00:02:13,155
On the opposite, if the feature is multiplied by one million,

41
00:02:13,155 --> 00:02:17,565
slightest differences in that features values will impact prediction,

42
00:02:17,565 --> 00:02:22,335
and this will result in KNN favoring that feature over all others.

43
00:02:22,335 --> 00:02:25,098
Great, but what about other models?

44
00:02:25,098 --> 00:02:30,595
Linear models are also experiencing difficulties with differently scaled features.

45
00:02:30,595 --> 00:02:33,660
First, we want regularization to be applied

46
00:02:33,660 --> 00:02:37,450
to linear models coefficients for features in equal amount.

47
00:02:37,450 --> 00:02:43,230
But in fact, regularization impact turns out to be proportional to feature scale.

48
00:02:43,230 --> 00:02:48,820
And second, gradient descent methods can go crazy without a proper scaling.

49
00:02:48,820 --> 00:02:50,828
Due to the same reasons,

50
00:02:50,828 --> 00:02:52,740
neural networks are similar to

51
00:02:52,740 --> 00:02:56,710
linear models in the requirements for feature preprocessing.

52
00:02:56,710 --> 00:02:58,930
It is important to understand that

53
00:02:58,930 --> 00:03:03,035
different features scalings result in different models quality.

54
00:03:03,035 --> 00:03:08,205
In this sense, it is just another hyper parameter you need to optimize.

55
00:03:08,205 --> 00:03:12,580
The easiest way to do this is to rescale all features to the same scale.

56
00:03:12,580 --> 00:03:18,875
For example, to make the minimum of a feature equal to zero and the maximum equal to one,

57
00:03:18,875 --> 00:03:20,740
you can achieve this in two steps.

58
00:03:20,740 --> 00:03:23,565
First, we sector at minimum value.

59
00:03:23,565 --> 00:03:27,240
And second, we divide the difference base maximum.

60
00:03:27,240 --> 00:03:31,382
It can be done with MinMaxScaler from sklearn.

61
00:03:31,382 --> 00:03:34,950
Let's illustrate this with an example.

62
00:03:34,950 --> 00:03:37,320
We apply the so-called MinMaxScaler to

63
00:03:37,320 --> 00:03:41,355
two features from the detaining dataset, Age and SibSp.

64
00:03:41,355 --> 00:03:46,320
Looking at histograms, we see that the features have different scale,

65
00:03:46,320 --> 00:03:48,380
ages between zero and 80,

66
00:03:48,380 --> 00:03:51,595
while SibSp is between zero and 8.

67
00:03:51,595 --> 00:03:55,800
Let's apply MinMaxScaling and see what it will do.

68
00:03:55,800 --> 00:03:58,515
Indeed, we see that after this transformation,

69
00:03:58,515 --> 00:04:04,815
both age and SibSp features were successfully converted to the same value range of 0,1.

70
00:04:04,815 --> 00:04:11,000
Note that distributions of values which we observe from the histograms didn't change.

71
00:04:11,000 --> 00:04:12,535
To give you another example,

72
00:04:12,535 --> 00:04:16,365
we can apply a scalar named StandardScaler in sklearn,

73
00:04:16,365 --> 00:04:20,980
which basically first subtract mean value from the feature,

74
00:04:20,980 --> 00:04:25,060
and then divides the result by feature standard deviation.

75
00:04:25,060 --> 00:04:27,785
In this way, we'll get standardized distribution,

76
00:04:27,785 --> 00:04:31,670
with a mean of zero and standard deviation of one.

77
00:04:31,670 --> 00:04:36,747
After either of MinMaxScaling or StandardScaling transformations,

78
00:04:36,747 --> 00:04:41,460
features impacts on non-tree-based models will be roughly similar.

79
00:04:41,460 --> 00:04:44,045
Even more, if you want to use KNN,

80
00:04:44,045 --> 00:04:48,920
we can go one step ahead and recall that the bigger feature is,

81
00:04:48,920 --> 00:04:51,810
the more important it will be for KNN.

82
00:04:51,810 --> 00:04:56,180
So, we can optimize scaling parameter to boost features which

83
00:04:56,180 --> 00:05:01,106
seems to be more important for us and see if this helps.

84
00:05:01,106 --> 00:05:03,475
When we work with linear models,

85
00:05:03,475 --> 00:05:08,185
there is another important moment that influences model training results.

86
00:05:08,185 --> 00:05:10,655
I'm talking about outiers.

87
00:05:10,655 --> 00:05:13,130
For example, in this plot, we have one feature,

88
00:05:13,130 --> 00:05:16,550
X, and a target variable, Y.

89
00:05:16,550 --> 00:05:18,865
If you fit a simple linear model,

90
00:05:18,865 --> 00:05:22,850
its predictions can look just like the red line.

91
00:05:22,850 --> 00:05:29,065
But if you do have one outlier with X feature equal to some huge value,

92
00:05:29,065 --> 00:05:33,940
predictions of the linear model will look more like the purple line.

93
00:05:33,940 --> 00:05:36,419
The same holds, not only for features values,

94
00:05:36,419 --> 00:05:38,915
but also for target values.

95
00:05:38,915 --> 00:05:41,815
For example, let's imagine we have a model trained

96
00:05:41,815 --> 00:05:45,245
on the data with target values between zero and one.

97
00:05:45,245 --> 00:05:48,690
Let's think what happens if we add a new sample in

98
00:05:48,690 --> 00:05:52,390
the training data with a target value of 1,000.

99
00:05:52,390 --> 00:05:53,950
When we retrain the model,

100
00:05:53,950 --> 00:05:57,185
the model will predict abnormally high values.

101
00:05:57,185 --> 00:06:00,160
Obviously, we have to fix this somehow.

102
00:06:00,160 --> 00:06:03,295
To protect linear models from outliers,

103
00:06:03,295 --> 00:06:09,635
we can clip features values between two chosen values of lower bound and upper bound.

104
00:06:09,635 --> 00:06:13,105
We can choose them as some percentiles of that feature.

105
00:06:13,105 --> 00:06:16,575
For example, first and 99s percentiles.

106
00:06:16,575 --> 00:06:19,750
This procedure of clipping is well-known in

107
00:06:19,750 --> 00:06:23,468
financial data and it is called winsorization.

108
00:06:23,468 --> 00:06:26,980
Let's take a look at this histogram for an example.

109
00:06:26,980 --> 00:06:32,330
We see that the majority of feature values are between zero and 400.

110
00:06:32,330 --> 00:06:37,760
But there is a number of outliers with values around -1,000.

111
00:06:37,760 --> 00:06:42,495
They can make life a lot harder for our nice and simple linear model.

112
00:06:42,495 --> 00:06:45,940
Let's clip this feature's value range and to do so, first,

113
00:06:45,940 --> 00:06:49,570
we will calculate lower bound and upper bound values as

114
00:06:49,570 --> 00:06:53,810
features values at first and 99s percentiles.

115
00:06:53,810 --> 00:06:55,260
After we clip the features values,

116
00:06:55,260 --> 00:06:59,655
we can see that features distribution looks fine,

117
00:06:59,655 --> 00:07:03,743
and we hope now this feature will be more useful for our model.

118
00:07:03,743 --> 00:07:09,985
Another effective preprocessing for numeric features is the rank transformation.

119
00:07:09,985 --> 00:07:15,210
Basically, it sets spaces between proper assorted values to be equal.

120
00:07:15,210 --> 00:07:17,160
This transformation, for example,

121
00:07:17,160 --> 00:07:22,005
can be a better option than MinMaxScaler if we have outliers,

122
00:07:22,005 --> 00:07:28,160
because rank transformation will move the outliers closer to other objects.

123
00:07:28,160 --> 00:07:31,140
Let's understand rank using this example.

124
00:07:31,140 --> 00:07:34,165
If we apply a rank to the source of array,

125
00:07:34,165 --> 00:07:37,585
it will just change values to their indices.

126
00:07:37,585 --> 00:07:41,110
Now, if we apply a rank to the not-sorted array,

127
00:07:41,110 --> 00:07:42,765
it will sort this array,

128
00:07:42,765 --> 00:07:46,610
define mapping between values and indices in this source of array,

129
00:07:46,610 --> 00:07:49,528
and apply this mapping to the initial array.

130
00:07:49,528 --> 00:07:54,180
Linear models, KNN, and neural networks can benefit from

131
00:07:54,180 --> 00:07:59,640
this kind of transformation if we have no time to handle outliers manually.

132
00:07:59,640 --> 00:08:04,868
Rank can be imported as a random data function from scipy.

133
00:08:04,868 --> 00:08:10,314
One more important note about the rank transformation is that to apply to the test data,

134
00:08:10,314 --> 00:08:15,580
you need to store the creative mapping from features values to their rank values.

135
00:08:15,580 --> 00:08:18,540
Or alternatively, you can concatenate,

136
00:08:18,540 --> 00:08:22,855
train, and test data before applying the rank transformation.

137
00:08:22,855 --> 00:08:27,390
There is one more example of numeric features preprocessing which

138
00:08:27,390 --> 00:08:32,440
often helps non-tree-based models and especially neural networks.

139
00:08:32,440 --> 00:08:35,805
You can apply log transformation through your data,

140
00:08:35,805 --> 00:08:37,845
or there's another possibility.

141
00:08:37,845 --> 00:08:41,440
You can extract a square root of the data.

142
00:08:41,440 --> 00:08:44,880
Both these transformations can be useful because they

143
00:08:44,880 --> 00:08:49,425
drive too big values closer to the features' average value.

144
00:08:49,425 --> 00:08:55,095
Along with this, the values near zero are becoming a bit more distinguishable.

145
00:08:55,095 --> 00:08:58,320
Despite the simplicity, one of these transformations can

146
00:08:58,320 --> 00:09:02,213
improve your neural network's results significantly.

147
00:09:02,213 --> 00:09:08,210
Another important moment which holds true for all preprocessings is that sometimes,

148
00:09:08,210 --> 00:09:10,335
it is beneficial to train a model on

149
00:09:10,335 --> 00:09:14,370
concatenated data frames produced by different preprocessings,

150
00:09:14,370 --> 00:09:19,325
or to mix models training differently-preprocessed data.

151
00:09:19,325 --> 00:09:22,385
Again, linear models, KNN,

152
00:09:22,385 --> 00:09:26,804
and neural networks can benefit hugely from this.

153
00:09:26,804 --> 00:09:30,945
To this end, we have discussed numeric feature preprocessing,

154
00:09:30,945 --> 00:09:33,915
how model choice impacts feature preprocessing,

155
00:09:33,915 --> 00:09:37,745
and what are the most commonly used preprocessing methods.

156
00:09:37,745 --> 00:09:40,395
Let's now move on to feature generation.

157
00:09:40,395 --> 00:09:43,290
Feature generation is a process of creating

158
00:09:43,290 --> 00:09:47,155
new features using knowledge about the features and the task.

159
00:09:47,155 --> 00:09:51,495
It helps us by making model training more simple and effective.

160
00:09:51,495 --> 00:09:55,270
Sometimes, we can engineer these features using prior knowledge and logic.

161
00:09:55,270 --> 00:09:57,345
Sometimes we have to dig into the data,

162
00:09:57,345 --> 00:09:59,310
create and check hypothesis,

163
00:09:59,310 --> 00:10:04,060
and use this derived knowledge and our intuition to derive new features.

164
00:10:04,060 --> 00:10:08,325
Here, we will discuss feature generation with prior knowledge,

165
00:10:08,325 --> 00:10:09,990
but as it turns out,

166
00:10:09,990 --> 00:10:12,450
an ability to dig into the data and

167
00:10:12,450 --> 00:10:17,380
derive insights is what makes a good competitor a great one.

168
00:10:17,380 --> 00:10:20,220
We will thoroughly analyze and illustrate this skill in

169
00:10:20,220 --> 00:10:23,313
the next lessons on exploratory data analysis.

170
00:10:23,313 --> 00:10:28,950
For now, let's discuss examples of feature generation for numeric features.

171
00:10:28,950 --> 00:10:31,385
First, let's start with a simple one.

172
00:10:31,385 --> 00:10:32,825
If you have columns,

173
00:10:32,825 --> 00:10:37,430
Real Estate price and Real Estate squared area in the dataset,

174
00:10:37,430 --> 00:10:40,155
we can quickly add one more feature,

175
00:10:40,155 --> 00:10:42,090
price per meter square.

176
00:10:42,090 --> 00:10:45,430
Easy, and this seems quite reasonable.

177
00:10:45,430 --> 00:10:51,615
Or, let me give you another quick example from the Forest Cover Type Prediction dataset.

178
00:10:51,615 --> 00:10:54,660
If we have a horizontal distance to a water source and

179
00:10:54,660 --> 00:10:58,980
the vertical difference in heights within the point and the water source,

180
00:10:58,980 --> 00:11:02,065
we as well may add combined feature indicating

181
00:11:02,065 --> 00:11:05,750
the direct distance to the water from this point.

182
00:11:05,750 --> 00:11:10,745
Among other things, it is useful to know that adding, multiplications,

183
00:11:10,745 --> 00:11:16,540
divisions, and other features interactions can be of help not only for linear models.

184
00:11:16,540 --> 00:11:21,900
For example, although gradient within decision tree is a very powerful model,

185
00:11:21,900 --> 00:11:26,620
it still experiences difficulties with approximation of multiplications and divisions.

186
00:11:26,620 --> 00:11:29,580
And adding size features explicitly can

187
00:11:29,580 --> 00:11:33,410
lead to a more robust model with less amount of trees.

188
00:11:33,410 --> 00:11:39,035
The third example of feature generation for numeric features is also very interesting.

189
00:11:39,035 --> 00:11:42,750
Sometimes, if we have prices of products as a feature,

190
00:11:42,750 --> 00:11:47,745
we can add new feature indicating fractional part of these prices.

191
00:11:47,745 --> 00:11:51,730
For example, if some product costs 2.49,

192
00:11:51,730 --> 00:11:56,115
the fractional part of its price is 0.49.

193
00:11:56,115 --> 00:11:58,945
This feature can help the model utilize

194
00:11:58,945 --> 00:12:03,000
the differences in people's perception of these prices.

195
00:12:03,000 --> 00:12:06,450
Also, we can find similar patterns in tasks

196
00:12:06,450 --> 00:12:10,760
which require distinguishing between a human and a robot.

197
00:12:10,760 --> 00:12:15,465
For example, if we will have some kind of financial data like auctions,

198
00:12:15,465 --> 00:12:20,030
we could observe that people tend to set round numbers as prices,

199
00:12:20,030 --> 00:12:22,925
and there are something like 0.935,

200
00:12:22,925 --> 00:12:25,440
blah, blah,, blah, very long number here.

201
00:12:25,440 --> 00:12:29,420
Or, if we are trying to find spambots on social networks,

202
00:12:29,420 --> 00:12:36,610
we can be sure that no human ever read messages with an exact interval of one second.

203
00:12:36,610 --> 00:12:42,050
Great, these three examples should have provided you an idea that

204
00:12:42,050 --> 00:12:48,193
creativity and data understanding are the keys to productive feature generation.

205
00:12:48,193 --> 00:12:50,820
All right, let's summarize this up.

206
00:12:50,820 --> 00:12:54,195
In this video, we have discussed numeric features.

207
00:12:54,195 --> 00:12:59,400
First, the impact of feature preprocessing is different for different models.

208
00:12:59,400 --> 00:13:02,100
Tree-based models don't depend on scaling,

209
00:13:02,100 --> 00:13:05,735
while non-tree-based models usually depend on them.

210
00:13:05,735 --> 00:13:08,480
Second, we can treat scaling as

211
00:13:08,480 --> 00:13:11,185
an important hyper parameter in cases

212
00:13:11,185 --> 00:13:15,075
when the choice of scaling impacts predictions quality.

213
00:13:15,075 --> 00:13:18,125
And at last, we should remember

214
00:13:18,125 --> 00:13:23,240
that feature generation is powered by an understanding of the data.

215
00:13:23,240 --> 00:13:29,910
Remember this lesson and this knowledge will surely help you in your next competition.