1
00:00:00,000 --> 00:00:03,170
Hello everyone. This is Marios.

2
00:00:03,170 --> 00:00:09,310
Today I would like to show you the Pipeline or like the approach I have used to

3
00:00:09,310 --> 00:00:12,910
tackle more than 100 machine learning competitions

4
00:00:12,910 --> 00:00:17,040
in cargo and obviously has helped me to do quite well.

5
00:00:17,040 --> 00:00:23,730
Before I start, let me state that I'm not claiming this is the best pipeline out there,

6
00:00:23,730 --> 00:00:25,405
is just the one I use.

7
00:00:25,405 --> 00:00:29,135
You might find some parts of it useful.

8
00:00:29,135 --> 00:00:32,300
So roughly, the Pipeline is,

9
00:00:32,300 --> 00:00:33,905
as you see it on the screen,

10
00:00:33,905 --> 00:00:38,160
here this is a summary and we will go through it in more detail later on.

11
00:00:38,160 --> 00:00:42,480
But briefly, I spend the first day in order to understand

12
00:00:42,480 --> 00:00:47,985
the problem and make the necessary preparations in order to deal with this.

13
00:00:47,985 --> 00:00:53,620
Then, maybe one, two days in order to understand a bit about the data,

14
00:00:53,620 --> 00:00:54,970
what are my features,

15
00:00:54,970 --> 00:00:56,505
what I have available,

16
00:00:56,505 --> 00:00:59,770
trying to understand other dynamics about the data,

17
00:00:59,770 --> 00:01:02,260
which will lead me to define

18
00:01:02,260 --> 00:01:07,700
a good cross validation strategy and we will see later why this is important.

19
00:01:07,700 --> 00:01:12,745
And then, once I have specified the cross validation strategy,

20
00:01:12,745 --> 00:01:16,455
I will spend all the days until

21
00:01:16,455 --> 00:01:25,810
3-4 days before the end of the competition and I will keep iterating,

22
00:01:25,810 --> 00:01:31,950
doing different feature engineering and applying different machine bearing models.

23
00:01:31,950 --> 00:01:35,620
Now, something that I need to to highlight is that,

24
00:01:35,620 --> 00:01:39,670
when I start this process I do it myself,

25
00:01:39,670 --> 00:01:42,635
shut from the outside world.

26
00:01:42,635 --> 00:01:44,525
So, I close my ears,

27
00:01:44,525 --> 00:01:48,725
and I just focus on how I would tackle this problem.

28
00:01:48,725 --> 00:01:52,425
That's because I don't want to get affected by what the others are doing.

29
00:01:52,425 --> 00:01:57,060
Because I might be able to find something that others will not.

30
00:01:57,060 --> 00:02:03,810
I mean, I might take a completely different approach and this always leads me to gain,

31
00:02:03,810 --> 00:02:08,155
when I then combine with the rest of the people.

32
00:02:08,155 --> 00:02:14,030
For example, through merges or when I use other people's kernels.

33
00:02:14,030 --> 00:02:15,320
So, I think this is important,

34
00:02:15,320 --> 00:02:21,255
because it gives you the chance to create an intuitive approach about the data,

35
00:02:21,255 --> 00:02:25,190
and then also leverage the fact that

36
00:02:25,190 --> 00:02:29,570
other people have different approaches and you will get more diverse results.

37
00:02:29,570 --> 00:02:32,470
And in the last 3 to 4 days,

38
00:02:32,470 --> 00:02:37,310
I would start exploring different ways to combine all the models of field,

39
00:02:37,310 --> 00:02:39,155
in order to get the best results.

40
00:02:39,155 --> 00:02:41,645
Now, if people have seen me in competitions,

41
00:02:41,645 --> 00:02:48,580
you should know that you might have noticed that in the last 3-2 days I do

42
00:02:48,580 --> 00:02:55,475
a rapid jump in the little box and that's exactly because I leave assembling at the end.

43
00:02:55,475 --> 00:02:56,750
I normally don't do it.

44
00:02:56,750 --> 00:02:59,300
I have confidence that it will work

45
00:02:59,300 --> 00:03:03,210
and I spend more time in feature engineering and modeling,

46
00:03:03,210 --> 00:03:05,285
up until this point.

47
00:03:05,285 --> 00:03:09,520
So, let's take all these steps one by one.

48
00:03:09,520 --> 00:03:13,370
Initially I try to understand the problem.

49
00:03:13,370 --> 00:03:16,025
First of all, what type of problem it is.

50
00:03:16,025 --> 00:03:18,060
Is it image classification,

51
00:03:18,060 --> 00:03:21,710
so try to find what object is presented on an image.

52
00:03:21,710 --> 00:03:24,150
This is sound classification,

53
00:03:24,150 --> 00:03:29,140
like which type of bird appears in a sound file.

54
00:03:29,140 --> 00:03:31,260
Is it text classification?

55
00:03:31,260 --> 00:03:34,120
Like who has written the specific text,

56
00:03:34,120 --> 00:03:37,185
or what this text is about.

57
00:03:37,185 --> 00:03:39,390
Is it an optimization problem,

58
00:03:39,390 --> 00:03:47,225
like giving some constraints how can I get from point A to point B etc.

59
00:03:47,225 --> 00:03:50,850
Is it a tabular dataset,

60
00:03:50,850 --> 00:03:56,060
so that's like data which you can represent in Excel for example,

61
00:03:56,060 --> 00:03:58,190
with rows and columns,

62
00:03:58,190 --> 00:04:01,910
with various types of features, categorical or numerical.

63
00:04:01,910 --> 00:04:04,030
Is it time series problem?

64
00:04:04,030 --> 00:04:05,595
Is time important?

65
00:04:05,595 --> 00:04:08,840
All these questions are very very

66
00:04:08,840 --> 00:04:12,600
important and that's why I look at the dataset and I try to understand,

67
00:04:12,600 --> 00:04:17,520
because it defines in many ways what resources I would need,

68
00:04:17,520 --> 00:04:23,285
where do I need to look and what kind of hardware and software I will need.

69
00:04:23,285 --> 00:04:31,420
Also, I do this sort of preparation along with controlling the volume of the data.

70
00:04:31,420 --> 00:04:33,115
How much is it.

71
00:04:33,115 --> 00:04:36,845
Because again, this will define how I need to,

72
00:04:36,845 --> 00:04:40,305
what preparations I need to do in order to solve this problem.

73
00:04:40,305 --> 00:04:44,045
Once I understand what type of problem it is,

74
00:04:44,045 --> 00:04:47,240
then I need to reserve hardware to solve this.

75
00:04:47,240 --> 00:04:52,755
So, in many cases I can escape without using GPUs,

76
00:04:52,755 --> 00:04:57,495
so just a few CPUs would do the trick.

77
00:04:57,495 --> 00:05:00,900
But in problems like image classification of sound,

78
00:05:00,900 --> 00:05:04,130
then generally anywhere you would need to use deep learning.

79
00:05:04,130 --> 00:05:07,205
You definitely need to invest a lot in CPU,

80
00:05:07,205 --> 00:05:09,275
RAM and disk space.

81
00:05:09,275 --> 00:05:11,880
So, that's why this screening is important.

82
00:05:11,880 --> 00:05:16,225
It will make me understand what type of machine I will need in order to

83
00:05:16,225 --> 00:05:22,155
solve this and whether I have this processing power at this point in order to solve this.

84
00:05:22,155 --> 00:05:29,665
Once this has been specified and I know how many CPUs,

85
00:05:29,665 --> 00:05:32,775
GPUs, RAM and disk space I'm going to need,

86
00:05:32,775 --> 00:05:36,155
then I need to prepare the software.

87
00:05:36,155 --> 00:05:40,655
So, different software is suited for different types of problems.

88
00:05:40,655 --> 00:05:44,840
Keras and TensorFlow is obviously really good for when solving

89
00:05:44,840 --> 00:05:50,795
an image classification or sound classification and text problems

90
00:05:50,795 --> 00:05:54,900
that you can pretty much use it in any other problem as well.

91
00:05:54,900 --> 00:05:58,320
Then you most probably if you use Python,

92
00:05:58,320 --> 00:06:01,430
you need scikit learn and XGBoost, Lighgbm.

93
00:06:01,430 --> 00:06:05,560
This is the pinnacle of machine learning right now.

94
00:06:05,560 --> 00:06:07,425
And how do I set this up?

95
00:06:07,425 --> 00:06:15,965
Normally I create either an anaconda environment or a virtual environment in general,

96
00:06:15,965 --> 00:06:18,410
and how a different one for its competition,

97
00:06:18,410 --> 00:06:21,095
because it's easy to set this up.

98
00:06:21,095 --> 00:06:22,865
So, you just set this up,

99
00:06:22,865 --> 00:06:25,840
you download the necessary packages you need,

100
00:06:25,840 --> 00:06:27,355
and then you're good to go.

101
00:06:27,355 --> 00:06:28,450
This is a good way,

102
00:06:28,450 --> 00:06:32,875
a clean way to keep everything tidy and

103
00:06:32,875 --> 00:06:38,265
to really know what you used and what you find useful in the particular competitions.

104
00:06:38,265 --> 00:06:42,120
It's also a good validation for later on,

105
00:06:42,120 --> 00:06:44,100
when we will have to do this again,

106
00:06:44,100 --> 00:06:47,220
to find an environment that has worked well for this type of

107
00:06:47,220 --> 00:06:53,335
problem and possibly reuse it.

108
00:06:53,335 --> 00:06:58,720
Another question I ask at this point is what the metric I'm being tested on.

109
00:06:58,720 --> 00:07:00,650
Again, is it a regression program,

110
00:07:00,650 --> 00:07:02,735
is it a classification program,

111
00:07:02,735 --> 00:07:05,135
it is root mean square error,

112
00:07:05,135 --> 00:07:07,255
it is mean absolute error.

113
00:07:07,255 --> 00:07:14,210
I ask these questions because I try to find if there's any similar competition with

114
00:07:14,210 --> 00:07:22,015
similar type of data that I may have dealt with in the past,

115
00:07:22,015 --> 00:07:25,975
because this will make this preparation much much better,

116
00:07:25,975 --> 00:07:27,765
because I'll go backwards,

117
00:07:27,765 --> 00:07:31,485
find what I had used in the past and capitalize on it.

118
00:07:31,485 --> 00:07:33,835
So, reuse it, improve it,

119
00:07:33,835 --> 00:07:37,600
or even if I don't have something myself,

120
00:07:37,600 --> 00:07:43,465
I can just find other similar competitions or explanations of these type of

121
00:07:43,465 --> 00:07:46,920
problem from the web and try to see what people

122
00:07:46,920 --> 00:07:50,720
had used in order to integrate it to my approaches.

123
00:07:50,720 --> 00:07:54,025
So, this is what it means to understand the problem at this point.

124
00:07:54,025 --> 00:07:59,170
It's more about doing the screen search,

125
00:07:59,170 --> 00:08:05,560
this screening in order to understand what type of preparation I need to do,

126
00:08:05,560 --> 00:08:07,490
and actually do this preparation,

127
00:08:07,490 --> 00:08:10,285
in order to be able to solve this problem competitively,

128
00:08:10,285 --> 00:08:11,560
in terms of hardware,

129
00:08:11,560 --> 00:08:14,960
software and other resources,

130
00:08:14,960 --> 00:08:19,595
past resources in dealing with these types of problems.

131
00:08:19,595 --> 00:08:27,015
Then I spent the next one or two days to do some exploratory data analysis.

132
00:08:27,015 --> 00:08:32,725
The first thing that I do is I see all my features,

133
00:08:32,725 --> 00:08:34,710
assuming a tabular data set,

134
00:08:34,710 --> 00:08:39,670
in the training and the test data and to see how consistent they are.

135
00:08:39,670 --> 00:08:45,545
I tend to plot distributions and to try to find if there are any discrepancies.

136
00:08:45,545 --> 00:08:48,920
So is this variable in the training data set very

137
00:08:48,920 --> 00:08:53,240
different than the same variable in the task set?

138
00:08:53,240 --> 00:08:56,230
Because if there are discrepancies or differences,

139
00:08:56,230 --> 00:08:59,425
this is something I have to deal with.

140
00:08:59,425 --> 00:09:03,780
Maybe I need to remove these variables or scale them in a specific way.

141
00:09:03,780 --> 00:09:08,055
In any case, big discrepancies can cause problems to the model,

142
00:09:08,055 --> 00:09:11,080
so that's why I spend some time here and do

143
00:09:11,080 --> 00:09:14,235
some plotting in order to detect these differences.

144
00:09:14,235 --> 00:09:18,340
The other thing that I do is I tend to

145
00:09:18,340 --> 00:09:22,730
plot features versus the target variable and possibly versus time,

146
00:09:22,730 --> 00:09:25,380
if time is available.

147
00:09:25,380 --> 00:09:29,795
And again, this tells me to understand the effect of time,

148
00:09:29,795 --> 00:09:33,910
how important is time or date in this data set.

149
00:09:33,910 --> 00:09:39,040
And at the same time it helps me to understand which are like the most predictive inputs,

150
00:09:39,040 --> 00:09:40,695
the most predictive variables.

151
00:09:40,695 --> 00:09:44,630
This is important because it generally gives me intuition about the problem.

152
00:09:44,630 --> 00:09:47,645
How exactly this helps me is not always clear.

153
00:09:47,645 --> 00:09:51,810
Sometimes it may help me define a gross validation strategy or

154
00:09:51,810 --> 00:09:56,350
help me create some really good features but in general,

155
00:09:56,350 --> 00:10:00,045
this kind of knowledge really helps to understand the problem.

156
00:10:00,045 --> 00:10:08,360
I tend to create cross tabs for example with

157
00:10:08,360 --> 00:10:12,520
the categorical variables and the target variable and also creates

158
00:10:12,520 --> 00:10:19,015
unpredictability metrics like information value and you see chi square for example,

159
00:10:19,015 --> 00:10:25,240
in order to see what's useful and whether I can make hypothesis about the data,

160
00:10:25,240 --> 00:10:30,555
whether I understand the data and how they relate with the target variable.

161
00:10:30,555 --> 00:10:33,720
The more understanding I create at this point,

162
00:10:33,720 --> 00:10:39,680
most probably will lead to better features for better models applied on this data.

163
00:10:39,680 --> 00:10:41,435
Also while I do this,

164
00:10:41,435 --> 00:10:44,060
I like to bin numerical features into

165
00:10:44,060 --> 00:10:47,850
bands in order to understand if there nonlinear R.A.T's.

166
00:10:47,850 --> 00:10:49,310
When I say nonlinear R.A.T's,

167
00:10:49,310 --> 00:10:53,290
whether the value of a feature is low,

168
00:10:53,290 --> 00:10:55,225
target variable is high,

169
00:10:55,225 --> 00:11:00,500
then as the value increases the target variable decreases as well.

170
00:11:00,500 --> 00:11:04,835
So whether there are strange relationships trends,

171
00:11:04,835 --> 00:11:08,245
patterns, or correlations between features and the target variable,

172
00:11:08,245 --> 00:11:12,760
in order to see how best to handle this later on and get

173
00:11:12,760 --> 00:11:18,585
an intuition about which type of problems or which type of models would work better.

174
00:11:18,585 --> 00:11:25,365
Once I have understood the data,

175
00:11:25,365 --> 00:11:31,000
to some extent, then it's time for me to define a cross validation strategy.

176
00:11:31,000 --> 00:11:33,725
I think this is a really important step

177
00:11:33,725 --> 00:11:37,675
and there have been competitions where people were able to

178
00:11:37,675 --> 00:11:41,160
win just because they were able to find

179
00:11:41,160 --> 00:11:46,090
the best way to validate or to create a good cross validation strategy.

180
00:11:46,090 --> 00:11:48,830
And by cross validation strategy,

181
00:11:48,830 --> 00:11:56,365
I mean to create a validation approach that best resembles what you're being tested on.

182
00:11:56,365 --> 00:12:00,800
If you manage to create this internally then you can create

183
00:12:00,800 --> 00:12:06,690
many different models and create many different features and anything you do,

184
00:12:06,690 --> 00:12:11,175
you can have the confidence that is working or it's not working,

185
00:12:11,175 --> 00:12:16,995
if you've managed to build the cross validation strategy in a consistent way

186
00:12:16,995 --> 00:12:24,205
with what you're being tested on so consistency is the key word here.

187
00:12:24,205 --> 00:12:26,145
The first thing I ask is,

188
00:12:26,145 --> 00:12:28,360
"Is time important in this data?"

189
00:12:28,360 --> 00:12:33,000
So do I have a feature which is called date or time?

190
00:12:33,000 --> 00:12:37,240
If this is important then I need to switch to a time-based validation.

191
00:12:37,240 --> 00:12:42,105
Always have past data predicting future data,

192
00:12:42,105 --> 00:12:44,070
and even the intervals,

193
00:12:44,070 --> 00:12:46,815
they need to be similar with the test data.

194
00:12:46,815 --> 00:12:50,165
So if the test data is three months in the future,

195
00:12:50,165 --> 00:12:56,030
I need to build my training and validation to account for this time interval.

196
00:12:56,030 --> 00:12:58,820
So my validation data always need to be

197
00:12:58,820 --> 00:13:02,115
three months in the future and compared to the training data.

198
00:13:02,115 --> 00:13:04,725
You need to be consistent in order to have

199
00:13:04,725 --> 00:13:08,660
the most consistent results with what you are been tested on.

200
00:13:08,660 --> 00:13:10,720
The other thing that I ask is,

201
00:13:10,720 --> 00:13:15,960
"Are there different entities between the train and the test data?"

202
00:13:15,960 --> 00:13:19,470
Imagine if you have different customers in

203
00:13:19,470 --> 00:13:23,080
the training data and different in the test data.

204
00:13:23,080 --> 00:13:25,820
Ideally, you need to formulate

205
00:13:25,820 --> 00:13:30,510
your cross validation strategy so that in the validation data,

206
00:13:30,510 --> 00:13:32,580
you always have different customers running in

207
00:13:32,580 --> 00:13:37,155
training data otherwise you are not really testing in a fair way.

208
00:13:37,155 --> 00:13:42,655
Your validation method would not be consistent with the test data.

209
00:13:42,655 --> 00:13:48,340
Obviously, if you know a customer and you try to predict it, him or her,

210
00:13:48,340 --> 00:13:51,110
why you have that customer in your training data,

211
00:13:51,110 --> 00:13:55,375
this is a biased prediction when compared to the test data,

212
00:13:55,375 --> 00:13:57,735
that you don't have this information available.

213
00:13:57,735 --> 00:14:02,785
And this is the type of questions you need to ask yourself when you are at this point,

214
00:14:02,785 --> 00:14:11,085
"Am I making a validation which is really consistent with what am I being tested on?"

215
00:14:11,085 --> 00:14:16,435
The other thing that is often the case is

216
00:14:16,435 --> 00:14:20,755
that the training and the test data are completely random.

217
00:14:20,755 --> 00:14:27,165
I'm sorry, I just shortened my data and I took a random part, put it on training,

218
00:14:27,165 --> 00:14:30,560
the other for test so in that case,

219
00:14:30,560 --> 00:14:36,510
is any random type of cross validation could help for example,

220
00:14:36,510 --> 00:14:39,495
just do a random K-fold.

221
00:14:39,495 --> 00:14:45,955
There are cases where you may have to use a combination of all the above so you

222
00:14:45,955 --> 00:14:53,600
have strong temporal elements at the same time you have different entities,

223
00:14:53,600 --> 00:14:58,290
so different customers to predict for past and future and at the same time,

224
00:14:58,290 --> 00:14:59,885
there is a random element too.

225
00:14:59,885 --> 00:15:03,850
You might need to incorporate all of them do make a good strategy.

226
00:15:03,850 --> 00:15:07,165
What I do is I often start with

227
00:15:07,165 --> 00:15:13,360
a random validation and just see how it fares with the test leader board,

228
00:15:13,360 --> 00:15:18,165
and see how consistent the result is with what they have internally,

229
00:15:18,165 --> 00:15:24,655
and see if improvements in my validation lead to improvements to the leader board.

230
00:15:24,655 --> 00:15:26,495
If that doesn't happen,

231
00:15:26,495 --> 00:15:30,610
I make a deeper investigation and try to understand why.

232
00:15:30,610 --> 00:15:35,460
It may be that the time element is very strong and I need to

233
00:15:35,460 --> 00:15:41,015
take it into account or there are different entities between the train and test data.

234
00:15:41,015 --> 00:15:47,135
These kinds of questions in order to formulate a better validation strategy.

235
00:15:47,135 --> 00:15:51,615
Once the validation strategy has been defined,

236
00:15:51,615 --> 00:15:57,025
now I start creating many different features.

237
00:15:57,025 --> 00:16:00,650
I'm sorry for bombarding you with loads of

238
00:16:00,650 --> 00:16:04,900
information in one slide but I wanted this to be standalone.

239
00:16:04,900 --> 00:16:07,765
It says give you the different type of

240
00:16:07,765 --> 00:16:12,550
future engineering you can use in different types of problems,

241
00:16:12,550 --> 00:16:15,420
and also suggestions for the competition to look

242
00:16:15,420 --> 00:16:18,475
up which was quite representative of this time.

243
00:16:18,475 --> 00:16:22,155
But you can ignore these for now. Look at it later.

244
00:16:22,155 --> 00:16:26,195
The main point is different problem requires

245
00:16:26,195 --> 00:16:30,450
different feature engineering and I put everything when I say feature engineering.

246
00:16:30,450 --> 00:16:33,155
I put the day data cleaning and preparation as well,

247
00:16:33,155 --> 00:16:35,005
how you handle missing values,

248
00:16:35,005 --> 00:16:39,095
and the features you generate out of this.

249
00:16:39,095 --> 00:16:43,970
The thing is, every problem has its own corpus of

250
00:16:43,970 --> 00:16:51,060
different techniques you use to derive or create new features.

251
00:16:51,060 --> 00:16:56,755
It's not easy to know everything because sometimes it's too much,

252
00:16:56,755 --> 00:17:00,030
I don't remember it myself so what I tend to do

253
00:17:00,030 --> 00:17:03,405
is go back to similar competitions and see what

254
00:17:03,405 --> 00:17:10,240
people are using or what people have used in the past and I incorporate into my code.

255
00:17:10,240 --> 00:17:13,200
If I have dealt with this or a similar problem

256
00:17:13,200 --> 00:17:16,470
in the past then I look at my code to see what I had done in the past,

257
00:17:16,470 --> 00:17:19,850
but still looking for ways to improve this.

258
00:17:19,850 --> 00:17:23,670
I think that's the best way to be able to handle any problem.

259
00:17:23,670 --> 00:17:30,770
The good thing is that a lot of the feature engineering can be automated.

260
00:17:30,770 --> 00:17:33,950
You probably have already seen that but,

261
00:17:33,950 --> 00:17:41,320
as long as your cross validation strategy is consistent with the test data and reliable,

262
00:17:41,320 --> 00:17:45,235
then you can potentially try all sorts of

263
00:17:45,235 --> 00:17:49,995
transformations and see how they work in your validation environment.

264
00:17:49,995 --> 00:17:53,895
If they work well, you can be confident that this type of

265
00:17:53,895 --> 00:17:58,230
feature engineering is useful and use it for further modeling.

266
00:17:58,230 --> 00:18:00,800
If not, you discard and try something else.

267
00:18:00,800 --> 00:18:03,410
Also the combinations of what you can do in terms of

268
00:18:03,410 --> 00:18:06,380
feature engineering can be quite vast in different types of

269
00:18:06,380 --> 00:18:12,210
problems so obviously time is a factor here, and scalability too.

270
00:18:12,210 --> 00:18:16,630
You need to be able to use your resources well in order to be able to search

271
00:18:16,630 --> 00:18:21,110
as much as you can in order to get the best outcome.

272
00:18:21,110 --> 00:18:23,810
This is what I do. Normally if I have more time

273
00:18:23,810 --> 00:18:27,510
to do this feature engineering in a competition,

274
00:18:27,510 --> 00:18:31,000
I tend to do better because I explore more things.