1
00:00:00,330 --> 00:00:06,490
And the modeling is pretty much the same story.

2
00:00:06,490 --> 00:00:11,690
So, it's type problem has its own type of model that works best.

3
00:00:11,690 --> 00:00:14,515
Now, I don't want to go through that list again,

4
00:00:14,515 --> 00:00:18,680
I put it here so that you can use it for reference.

5
00:00:18,680 --> 00:00:25,240
But, again, the way you work this out is you look for literature,

6
00:00:25,240 --> 00:00:29,240
you sense other previous competitions that were

7
00:00:29,240 --> 00:00:33,380
similar and you try to find which type of problem,

8
00:00:33,380 --> 00:00:37,585
which type of model or best for its type of problem.

9
00:00:37,585 --> 00:00:42,610
And it's not surprise that for typical dataset,

10
00:00:42,610 --> 00:00:44,550
when I say typical dataset I mean,

11
00:00:44,550 --> 00:00:49,140
tabular dataset rather boosting machines in the form of [inaudible] turned to

12
00:00:49,140 --> 00:00:57,705
rock fest for problems like aim as classification sound classification,

13
00:00:57,705 --> 00:01:04,925
deep learning in the form of convolutional neural networks tend to work better.

14
00:01:04,925 --> 00:01:11,055
So, this is roughly what you need to know.

15
00:01:11,055 --> 00:01:14,280
New techniques are being developed so,

16
00:01:14,280 --> 00:01:18,640
I think your best chance here or what I have used in order to

17
00:01:18,640 --> 00:01:23,190
do well in the past was knowing what's tends to work well with its problem,

18
00:01:23,190 --> 00:01:28,945
and going backwards and trying to find other code or

19
00:01:28,945 --> 00:01:32,010
other implementations and similar problems in order

20
00:01:32,010 --> 00:01:36,900
to integrate it with mine and try to get a better result.

21
00:01:36,900 --> 00:01:40,900
I should mention that each of

22
00:01:40,900 --> 00:01:44,920
the previous models needs to be changed sometimes differently.

23
00:01:44,920 --> 00:01:47,150
So you need to spend time within

24
00:01:47,150 --> 00:01:51,540
this cross-validation strategy in order to find the best parameters,

25
00:01:51,540 --> 00:01:56,770
and then we move onto Ensembling.

26
00:01:56,770 --> 00:02:00,830
Every time you apply

27
00:02:00,830 --> 00:02:03,340
your cross-validation procedure with

28
00:02:03,340 --> 00:02:06,575
a different feature engineering and a different joint model,

29
00:02:06,575 --> 00:02:10,015
it's time, you saved two types of predictions,

30
00:02:10,015 --> 00:02:15,655
you save predictions for the validation data and you save predictions for the test data.

31
00:02:15,655 --> 00:02:20,810
So now that you have saved all these predictions and by the way this is the point

32
00:02:20,810 --> 00:02:25,780
that if you collaborate with others that tend to send you the predictions,

33
00:02:25,780 --> 00:02:29,150
and you'll be surprised that sometime that collaboration is just this.

34
00:02:29,150 --> 00:02:36,130
So people just sending these prediction files for the validation and the test data.

35
00:02:36,130 --> 00:02:40,190
So now you can find the best way to combine

36
00:02:40,190 --> 00:02:44,300
these models in order to get the best results.

37
00:02:44,300 --> 00:02:47,720
And since you already have predictions for the validation data,

38
00:02:47,720 --> 00:02:51,300
you know the target variable for the validation data,

39
00:02:51,300 --> 00:02:55,120
so you can explore different ways to combine them.

40
00:02:55,120 --> 00:02:57,735
The methods could be simple,

41
00:02:57,735 --> 00:02:59,800
could be an average,

42
00:02:59,800 --> 00:03:05,990
or already average, or it can go up to a multilayer stacking in general.

43
00:03:05,990 --> 00:03:10,455
Generally, what you need to know is that from my experience,

44
00:03:10,455 --> 00:03:15,045
smaller data requires simple ensemble techniques like averaging.

45
00:03:15,045 --> 00:03:19,450
And also what tends to show is to look at correlation between predictions.

46
00:03:19,450 --> 00:03:22,880
So find it here that work well,

47
00:03:22,880 --> 00:03:25,060
but they tend to be quite diverse.

48
00:03:25,060 --> 00:03:28,385
So, when you use fusion correlation,

49
00:03:28,385 --> 00:03:30,480
the correlation is not very high.

50
00:03:30,480 --> 00:03:34,735
That means they are likely to bring new information,

51
00:03:34,735 --> 00:03:39,530
and so when you combine you get the most out of it.

52
00:03:39,530 --> 00:03:42,735
But if you have bigger data there are,

53
00:03:42,735 --> 00:03:45,810
you got pretty must try all sorts of things.

54
00:03:45,810 --> 00:03:48,110
What I like to think of is it is that,

55
00:03:48,110 --> 00:03:50,030
when you have really big data,

56
00:03:50,030 --> 00:03:54,380
the stacking process that impedes the modeling process.

57
00:03:54,380 --> 00:03:57,830
By that, I mean that you have

58
00:03:57,830 --> 00:04:02,305
a new set of features this time they are predictions of models,

59
00:04:02,305 --> 00:04:06,825
but you can apply the same process you have used before.

60
00:04:06,825 --> 00:04:08,945
So you can do feature engineering,

61
00:04:08,945 --> 00:04:15,820
you can create new features or you can remove the features/ prediction that

62
00:04:15,820 --> 00:04:19,340
you no longer need and you can use this in

63
00:04:19,340 --> 00:04:24,150
order to improve the results for your validation data.

64
00:04:24,150 --> 00:04:27,870
This process can be quite exhaustive,

65
00:04:27,870 --> 00:04:33,075
but well, again, it can be automated to some extent.

66
00:04:33,075 --> 00:04:35,965
So, the more time you have here,

67
00:04:35,965 --> 00:04:39,315
most probably the better you will do.

68
00:04:39,315 --> 00:04:41,680
But from my experience, 2,

69
00:04:41,680 --> 00:04:47,680
3 days is good in order to get the best out of all the models you have built and depends

70
00:04:47,680 --> 00:04:49,785
obviously on the volume of data or

71
00:04:49,785 --> 00:04:55,170
volume of predictions you have generated up until this point.

72
00:04:55,170 --> 00:05:04,455
At this point I would like to share a few thoughts about collaboration.

73
00:05:04,455 --> 00:05:08,475
Many people have asked me this and I think this is a good point to share.

74
00:05:08,475 --> 00:05:12,670
These ideas has greatly helped me to do well in competitions.

75
00:05:12,670 --> 00:05:16,070
The first thing is that it makes things more fun.

76
00:05:16,070 --> 00:05:17,570
I mean you are not alone,

77
00:05:17,570 --> 00:05:21,355
you're with other people and that's always more energizing,

78
00:05:21,355 --> 00:05:24,300
it's always more interesting, it's more fun,

79
00:05:24,300 --> 00:05:28,115
you can communicate with the others through times like Skype,

80
00:05:28,115 --> 00:05:33,900
and yeah I think it's more collaborative as the world says, it is better.

81
00:05:33,900 --> 00:05:36,100
You learn more.

82
00:05:36,100 --> 00:05:38,810
I mean you can be really good,

83
00:05:38,810 --> 00:05:43,565
but, you know, you always always learn from others.

84
00:05:43,565 --> 00:05:46,660
No way to know everything yourself.

85
00:05:46,660 --> 00:05:50,600
So it's really good to be able to share points with other people,

86
00:05:50,600 --> 00:05:53,990
see what they do learn from them and become

87
00:05:53,990 --> 00:05:59,085
better and grow as a data scientist, as a model.

88
00:05:59,085 --> 00:06:06,630
From my experience you score far better than trying to solve a problem alone,

89
00:06:06,630 --> 00:06:12,200
and I think these happens for mainly for two ways.

90
00:06:12,200 --> 00:06:14,895
There are more but these are main two.

91
00:06:14,895 --> 00:06:17,900
First you can cover more ground because,

92
00:06:17,900 --> 00:06:22,320
you can say, you can focus on ensembling,

93
00:06:22,320 --> 00:06:26,510
I will focus on feature engineering or you will focus on

94
00:06:26,510 --> 00:06:31,075
joining this type of model and I will focus on another type of model.

95
00:06:31,075 --> 00:06:33,275
So, you can generally cover more ground.

96
00:06:33,275 --> 00:06:37,530
You can divide task and you can search,

97
00:06:37,530 --> 00:06:42,835
you can cover more ground in terms of the possible things you can try in a competition.

98
00:06:42,835 --> 00:06:48,105
The second thing is that every person sees the problem from different angles.

99
00:06:48,105 --> 00:06:54,140
So, that's very likely to generate more diverse predictions.

100
00:06:54,140 --> 00:06:58,210
So something we do is although we kind of

101
00:06:58,210 --> 00:07:03,435
define together by the different strategy when we form teams,

102
00:07:03,435 --> 00:07:06,330
then we would like to work for maybe

103
00:07:06,330 --> 00:07:10,220
one week separately without discussing with one another,

104
00:07:10,220 --> 00:07:13,915
because this helps to create diversity.

105
00:07:13,915 --> 00:07:17,615
Otherwise, if we over discuss this,

106
00:07:17,615 --> 00:07:20,895
we might generate pretty much the same things.

107
00:07:20,895 --> 00:07:23,145
So, in other words,

108
00:07:23,145 --> 00:07:28,740
our solutions might be too correlated to add more value.

109
00:07:28,740 --> 00:07:33,150
So, this is a good way in order to

110
00:07:33,150 --> 00:07:38,815
leverage the different mindset each person has in solving these problems.

111
00:07:38,815 --> 00:07:40,370
So, for one week,

112
00:07:40,370 --> 00:07:45,575
each one works separately and then after some point,

113
00:07:45,575 --> 00:07:50,565
we start combining or work more closely.

114
00:07:50,565 --> 00:07:57,915
I would advise people to start collaborating after getting some experience,

115
00:07:57,915 --> 00:08:03,940
and I say here two or three competitions just because Cargo has some rules.

116
00:08:03,940 --> 00:08:07,170
Sometimes, it is easy to make mistakes.

117
00:08:07,170 --> 00:08:11,050
I think it's better to understand the environment,

118
00:08:11,050 --> 00:08:14,840
the competition environment well before

119
00:08:14,840 --> 00:08:18,585
exploring these options in order to make certain that,

120
00:08:18,585 --> 00:08:21,135
no mistakes are done,

121
00:08:21,135 --> 00:08:25,210
no violation of the rules.

122
00:08:25,210 --> 00:08:29,680
Sometimes new people tend to make these mistakes.

123
00:08:29,680 --> 00:08:35,405
So, it's good to have this experience prior to trying to collaborating.

124
00:08:35,405 --> 00:08:41,310
I advise people to start forming teams with people around their rank

125
00:08:41,310 --> 00:08:44,350
because sometimes it is frustrating when you

126
00:08:44,350 --> 00:08:47,770
join a high rank or a very experienced team I would say.

127
00:08:47,770 --> 00:08:50,100
It's bad to say experience from rank,

128
00:08:50,100 --> 00:08:53,280
because you don't know sometimes how to contribute,

129
00:08:53,280 --> 00:09:02,285
you still don't understand all the competition dynamics and it might stall your progress,

130
00:09:02,285 --> 00:09:06,750
if you join a team and you're not able to contribute.

131
00:09:06,750 --> 00:09:10,790
So, I think it's better to, in most cases,

132
00:09:10,790 --> 00:09:18,600
to try and find people around your rank or around your experience and grow together.

133
00:09:18,600 --> 00:09:27,200
This way is the best form of collaboration I think.

134
00:09:27,200 --> 00:09:32,295
Another tip for collaborating is to try to collaborate

135
00:09:32,295 --> 00:09:34,510
with people that are likely to take

136
00:09:34,510 --> 00:09:37,695
diverse approaches or different approaches than yourself.

137
00:09:37,695 --> 00:09:42,035
You learn more this way and it is more likely that when you combine,

138
00:09:42,035 --> 00:09:43,570
you will get a better score.

139
00:09:43,570 --> 00:09:49,175
So, such for people who are sort of famous

140
00:09:49,175 --> 00:09:55,345
for doing well certain things and in order to get the most out of it,

141
00:09:55,345 --> 00:10:01,755
to learn more from each other and get better results in the leader board.

142
00:10:01,755 --> 00:10:10,120
About selecting submissions, I have employed a strategy that many people have done.

143
00:10:10,120 --> 00:10:14,570
So normally, I select the best submissions I see in

144
00:10:14,570 --> 00:10:20,490
my internal result and the one that work best on the leader board.

145
00:10:20,490 --> 00:10:23,750
At the same time, I also look for correlations.

146
00:10:23,750 --> 00:10:25,750
So, if two submissions,

147
00:10:25,750 --> 00:10:28,305
they tend to be the same pretty much.

148
00:10:28,305 --> 00:10:31,150
So, the one that was the best submission locally,

149
00:10:31,150 --> 00:10:33,175
was also the best on leader boards,

150
00:10:33,175 --> 00:10:38,060
I try to find

151
00:10:38,060 --> 00:10:44,300
other submissions that still work well but they are likely to be quite diverse.

152
00:10:44,300 --> 00:10:50,690
So, they have low correlations with my best submission because this way, I might capture,

153
00:10:50,690 --> 00:10:54,760
I might be lucky,

154
00:10:54,760 --> 00:11:03,500
it maybe be a special type of test data set and just by having a diverse submission,

155
00:11:03,500 --> 00:11:06,315
I might be lucky to get a good score.

156
00:11:06,315 --> 00:11:10,330
So that's the main idea about this.

157
00:11:10,330 --> 00:11:18,110
Some tips I would like to share now in general about competitive modeling,

158
00:11:18,110 --> 00:11:21,500
on land modeling and in Cargo specifically.

159
00:11:21,500 --> 00:11:24,110
In these challenges, you never lose.

160
00:11:24,110 --> 00:11:27,605
[inaudible] lose, yes you may not win prize money.

161
00:11:27,605 --> 00:11:29,100
Out of 5000 people,

162
00:11:29,100 --> 00:11:30,890
sometimes it's difficult to be,

163
00:11:30,890 --> 00:11:35,510
almost to impossible to be in the top three or four that

164
00:11:35,510 --> 00:11:39,890
gives prizes but you always gain in terms of knowledge,

165
00:11:39,890 --> 00:11:41,515
in terms of experience.

166
00:11:41,515 --> 00:11:45,800
You get to collaborate with other people which are talented in the field,

167
00:11:45,800 --> 00:11:52,625
you get to add it to your CV that you try to solve this particular problem,

168
00:11:52,625 --> 00:11:57,140
and I can tell you there has been some criticists here,

169
00:11:57,140 --> 00:12:01,550
people doubt that doing these competitions stops your employ-ability but

170
00:12:01,550 --> 00:12:06,285
I can tell you that i know many examples and not want us,

171
00:12:06,285 --> 00:12:09,760
they really thought the Ocean Cargo like Master

172
00:12:09,760 --> 00:12:13,460
and Grand-master that just by having kind of experience,

173
00:12:13,460 --> 00:12:16,830
they have been able to find very decent jobs and even if they had

174
00:12:16,830 --> 00:12:20,935
completely diverse backgrounds to the science.

175
00:12:20,935 --> 00:12:22,970
So, I can tell you it matters.

176
00:12:22,970 --> 00:12:27,280
So, any time you spend here,

177
00:12:27,280 --> 00:12:30,915
it's definitely a win for you.

178
00:12:30,915 --> 00:12:35,980
I don't see how you can lose by competing in these challenges.

179
00:12:35,980 --> 00:12:38,700
You mean if this is something you like right.

180
00:12:38,700 --> 00:12:43,610
The whole predictive modeling that the science think.

181
00:12:43,610 --> 00:12:46,225
Coffee tempts to shop,

182
00:12:46,225 --> 00:12:49,355
because you tend to spend longer hours.

183
00:12:49,355 --> 00:12:52,780
I tend to do this especially late at night.

184
00:12:52,780 --> 00:12:56,060
So it definitely tells me something to consider or to be

185
00:12:56,060 --> 00:12:59,900
honest any other beverage will do: depends what you like.

186
00:12:59,900 --> 00:13:03,310
I see it a bit like a game and I

187
00:13:03,310 --> 00:13:06,695
advise you to do the same because if you see it like a game,

188
00:13:06,695 --> 00:13:09,380
you never need to work for it.

189
00:13:09,380 --> 00:13:11,720
If you know what I mean.

190
00:13:11,720 --> 00:13:14,800
So it looks a bit like NRPT.

191
00:13:14,800 --> 00:13:19,075
In some way, you have some tools or weapons.

192
00:13:19,075 --> 00:13:22,645
These are all the algorithms and feature engineering techniques you can use.

193
00:13:22,645 --> 00:13:26,440
And then you have this core leader board and you try to beat

194
00:13:26,440 --> 00:13:31,140
all the bad guys and to beat the score and rise above them.

195
00:13:31,140 --> 00:13:33,685
So in a way does look like a game.

196
00:13:33,685 --> 00:13:36,990
You know you try to use all the tools,

197
00:13:36,990 --> 00:13:40,675
all the skills that you have to try to beat the score.

198
00:13:40,675 --> 00:13:44,210
So, I think if you see it like a game it really helps you.

199
00:13:44,210 --> 00:13:50,660
You don't get tired and you enjoy the process more.

200
00:13:50,660 --> 00:13:53,490
I do advise you to take a break though,

201
00:13:53,490 --> 00:13:56,930
from my experience you may spend long hours hitting

202
00:13:56,930 --> 00:14:00,155
on it and that's not good for your body.

203
00:14:00,155 --> 00:14:06,450
You definitely need to take some breaks and do some physical exercise.

204
00:14:06,450 --> 00:14:08,225
Go out for a walk.

205
00:14:08,225 --> 00:14:11,685
I think it can help most of the times by

206
00:14:11,685 --> 00:14:16,330
resting your mind this way can actually help to do better.

207
00:14:16,330 --> 00:14:19,445
You have more rested heart, more clear thinking.

208
00:14:19,445 --> 00:14:21,620
So, I definitely advise you to do this,

209
00:14:21,620 --> 00:14:23,080
generally don't overdo it.

210
00:14:23,080 --> 00:14:27,900
I have overnighted in the past but i advise you not to do the same.

211
00:14:27,900 --> 00:14:32,060
And now there is a thing that I would like to

212
00:14:32,060 --> 00:14:36,045
highlight is that the Cargo community is great.

213
00:14:36,045 --> 00:14:40,790
Is one of the most open and helpful helpful communities

214
00:14:40,790 --> 00:14:44,840
have experience in any social context,

215
00:14:44,840 --> 00:14:52,860
maybe apart from Charities but if you have a question and you posted on

216
00:14:52,860 --> 00:14:56,670
the forums or other associated channels like in Slug

217
00:14:56,670 --> 00:15:01,185
and people are always willing to help you.That's great,

218
00:15:01,185 --> 00:15:04,900
because there are so many people out there and most probably they

219
00:15:04,900 --> 00:15:08,975
know the answer or they can help you for a particular problem.

220
00:15:08,975 --> 00:15:10,705
And this is invaluable.

221
00:15:10,705 --> 00:15:15,500
So many times i have really made use of this,

222
00:15:15,500 --> 00:15:20,420
of this option and it really helps.

223
00:15:20,420 --> 00:15:26,580
You know this kind of mentality was there even before the serine was gamified.

224
00:15:26,580 --> 00:15:30,930
When I say gamified, now you get points by

225
00:15:30,930 --> 00:15:36,350
sharping in a way by sharing code or participating in discussions.

226
00:15:36,350 --> 00:15:41,400
But in the past, people were doing without really getting something out of it.

227
00:15:41,400 --> 00:15:44,480
It maybe the open source mentality of

228
00:15:44,480 --> 00:15:49,650
data science that the fact that many people participating are researchers.

229
00:15:49,650 --> 00:15:52,990
I don't know but it really is a field

230
00:15:52,990 --> 00:15:58,565
that sharing seems to be really important in helping others.

231
00:15:58,565 --> 00:16:06,820
So, I do advise you to consider this and don't be afraid to ask in these forums.

232
00:16:06,820 --> 00:16:09,850
Another thing that I do at shops,

233
00:16:09,850 --> 00:16:15,195
is that after the competition has ended irrespective of how well or not you've done,

234
00:16:15,195 --> 00:16:19,135
is go and look for other people and what they have done.

235
00:16:19,135 --> 00:16:22,520
Normally, there are threads where people share their approaches,

236
00:16:22,520 --> 00:16:26,085
sometimes they share the whole approach would go to sometimes it just

237
00:16:26,085 --> 00:16:29,720
give tips and you know

238
00:16:29,720 --> 00:16:33,050
this is where you can upgrade your tools

239
00:16:33,050 --> 00:16:37,305
and you can see what other people have done and make improvements.

240
00:16:37,305 --> 00:16:39,390
And in tandem with this,

241
00:16:39,390 --> 00:16:42,780
you should have a notebook of useful methods that

242
00:16:42,780 --> 00:16:46,535
you keep updating it at the end of every competition.

243
00:16:46,535 --> 00:16:48,560
So, you found an approach that was good,

244
00:16:48,560 --> 00:16:51,390
you just add it to that notebook and next

245
00:16:51,390 --> 00:16:55,445
time you encounter the same or similar competition you get

246
00:16:55,445 --> 00:16:58,330
that notebook out and you apply

247
00:16:58,330 --> 00:17:03,150
the same techniques at work in the past and this is how you get better.

248
00:17:03,150 --> 00:17:06,675
Actually, if i now start a competition without that notebook,

249
00:17:06,675 --> 00:17:12,830
i think it will take me three or four times more in order to get to

250
00:17:12,830 --> 00:17:16,000
the same score because a lot of the things that I do

251
00:17:16,000 --> 00:17:19,830
now depend on stuff that i have done in the past.

252
00:17:19,830 --> 00:17:21,440
So, it's definitely helpful,

253
00:17:21,440 --> 00:17:26,120
consider creating this notebook or library of all the approaches or

254
00:17:26,120 --> 00:17:32,445
approaches that have worked in the past in order to have an easier time going on.

255
00:17:32,445 --> 00:17:36,940
And that was what I wanted to share with you and

256
00:17:36,940 --> 00:17:42,000
thank you very much for bearing with me and to see you next time, right.