1
00:00:00,025 --> 00:00:05,950
[SOUND] Hi and welcome back.

2
00:00:05,950 --> 00:00:10,632
In the previous videos we discussed
the concept of validation and overfitting.

3
00:00:10,632 --> 00:00:14,250
And discussed how to chose
validation strategy based

4
00:00:14,250 --> 00:00:16,460
on the properties of data we have.

5
00:00:16,460 --> 00:00:22,450
And finally we learned to identify
data split made by organizers.

6
00:00:22,450 --> 00:00:27,363
After all this work being done,
we honestly expect that the relation will,

7
00:00:27,363 --> 00:00:30,230
in a way, substitute a leaderboard for us.

8
00:00:30,230 --> 00:00:34,928
That is the score we see on
the validation will be the same for

9
00:00:34,928 --> 00:00:35,890
the private leaderboard.

10
00:00:35,890 --> 00:00:38,640
Or at least, if we improve our model and

11
00:00:38,640 --> 00:00:42,810
validation, there will be improvements
on the private leaderboard.

12
00:00:42,810 --> 00:00:48,290
And this is usually true, but
sometimes we encounter some problems here.

13
00:00:48,290 --> 00:00:53,410
In most cases these problems can
be divided into two big groups.

14
00:00:53,410 --> 00:00:58,430
In the first group are the problems
we encounter during local validation.

15
00:00:58,430 --> 00:01:02,760
Usually they are caused by
inconsistency of the data,

16
00:01:02,760 --> 00:01:08,150
a widespread example is getting different
optimal parameters for different faults.

17
00:01:08,150 --> 00:01:11,900
In this case we need to make
more thorough validation.

18
00:01:11,900 --> 00:01:14,510
The problems from the second group,

19
00:01:14,510 --> 00:01:19,320
often reveal themselves only when we
send our submissions to the platform.

20
00:01:19,320 --> 00:01:25,250
And observe that scores on the validation
and on the leaderboard don't match.

21
00:01:25,250 --> 00:01:28,200
In this case, the problem usually occurs

22
00:01:28,200 --> 00:01:33,600
because we can't mimic the exact
train test split on our validation.

23
00:01:33,600 --> 00:01:38,630
These are tough problems, and we
definitely want to be able to handle them.

24
00:01:38,630 --> 00:01:43,320
So before we start,
let me provide an overview of this video.

25
00:01:43,320 --> 00:01:48,180
For both validation and submission
stages we will discuss main problems,

26
00:01:48,180 --> 00:01:50,370
their causes, how to handle them.

27
00:01:51,470 --> 00:01:58,190
And then, we'll talk a bit about when
we can expect a leaderboard shuffle.

28
00:01:58,190 --> 00:02:02,340
Let's start with discussion
of validation stage problems.

29
00:02:02,340 --> 00:02:05,956
Usually, they attract our
attention during validation.

30
00:02:05,956 --> 00:02:10,428
Generally, the main problem is
a significant difference in scores and

31
00:02:10,428 --> 00:02:14,316
optimal parameters for
different train validation splits.

32
00:02:14,316 --> 00:02:16,400
Let's start with an example.

33
00:02:16,400 --> 00:02:19,360
So we can easily explain this problem.

34
00:02:19,360 --> 00:02:23,970
Consider that we need to predict
sales in a shop in February.

35
00:02:23,970 --> 00:02:26,060
Say we have target values for

36
00:02:26,060 --> 00:02:31,135
the last year, and, usually,
we will take last month in the validation.

37
00:02:31,135 --> 00:02:37,775
This means January, but clearly January
has much more holidays than February.

38
00:02:37,775 --> 00:02:42,995
And people tend to buy more, which causes
target values to be higher overall.

39
00:02:42,995 --> 00:02:46,835
And that mean squared error
of our predictions for

40
00:02:46,835 --> 00:02:50,600
January will be greater than for February.

41
00:02:50,600 --> 00:02:55,500
Does this mean that the module
will perform worse for February?

42
00:02:55,500 --> 00:02:59,730
Probably not,
at least not in terms of overfitting.

43
00:02:59,730 --> 00:03:05,280
As we can see, sometimes this kind
of model behavior can be expected.

44
00:03:05,280 --> 00:03:11,490
But what if there is no clear reason
why scores differ for different folds?

45
00:03:11,490 --> 00:03:16,380
Let identify several common reasons for
this and see what we can do about it.

46
00:03:16,380 --> 00:03:22,450
The first hypotheses we should consider
and that we have too little data.

47
00:03:22,450 --> 00:03:28,230
For example, consider a case when we have
a lot of patterns and trends in the data.

48
00:03:28,230 --> 00:03:32,690
But we do not have enough samples
to generalize these patterns well.

49
00:03:33,760 --> 00:03:38,561
In that case, a model will utilize
only some general patterns.

50
00:03:38,561 --> 00:03:43,465
And for each train validation split,
these patterns will partially differ.

51
00:03:43,465 --> 00:03:48,330
This indeed, will lead to
a difference in scores of the model.

52
00:03:48,330 --> 00:03:53,081
Furthermore, validation samples
will be different each time only

53
00:03:53,081 --> 00:03:57,264
increasing the dispersion of scores for
different folds.

54
00:03:57,264 --> 00:04:02,390
The second type of this,
is data is too diverse and inconsistent.

55
00:04:02,390 --> 00:04:07,627
For example, if you have very similar
samples with different target variance,

56
00:04:07,627 --> 00:04:09,580
a model can confuse them.

57
00:04:09,580 --> 00:04:12,270
Consider two cases, first,

58
00:04:12,270 --> 00:04:17,850
if one of such examples is in the train
while another is in the validation.

59
00:04:17,850 --> 00:04:21,570
We can get a pretty high error for
the second sample.

60
00:04:21,570 --> 00:04:25,658
And the second case,
if both samples are in validation,

61
00:04:25,658 --> 00:04:29,250
we will get smaller errors for them.

62
00:04:29,250 --> 00:04:33,200
Or let's remember another
example of diverse data

63
00:04:33,200 --> 00:04:35,468
we have already discussed a bit earlier.

64
00:04:35,468 --> 00:04:41,080
I'm talking about the example of
predicting sales for January and February.

65
00:04:41,080 --> 00:04:45,330
Here we have the nature or the reason for
the differences in scores.

66
00:04:45,330 --> 00:04:50,860
As a quick note, notice that in this
example, we can reduce this diversity

67
00:04:50,860 --> 00:04:56,070
a bit if we will validate on
the February from the previous year.

68
00:04:56,070 --> 00:05:00,370
So the main reasons for a difference in
scores and optimal model parameters for

69
00:05:00,370 --> 00:05:04,270
different folds are, first,
having too little data, and

70
00:05:04,270 --> 00:05:07,530
second, having too diverse and
inconsistent data.

71
00:05:07,530 --> 00:05:10,960
Now let's outline our actions here.

72
00:05:10,960 --> 00:05:13,290
If we are facing this kind of problem,

73
00:05:13,290 --> 00:05:17,140
it can be useful to make
more thorough validation.

74
00:05:17,140 --> 00:05:22,290
You can increase K in KFold,
but usually 5 folds are enough.

75
00:05:22,290 --> 00:05:26,122
Make KFold validation several times
with different random splits.

76
00:05:26,122 --> 00:05:31,070
And average scores to get a more
stable estimate of model's quality.

77
00:05:31,070 --> 00:05:33,850
The same way we can choose
the best parameters for

78
00:05:33,850 --> 00:05:37,110
the model if there is a chance to overfit.

79
00:05:37,110 --> 00:05:42,060
It is useful to use one set of KFold
splits to select parameters and

80
00:05:42,060 --> 00:05:45,970
another set of KFold splits
to check model's quality.

81
00:05:47,170 --> 00:05:51,770
Examples of competitions which
required extensive validation include

82
00:05:51,770 --> 00:05:56,600
the Liberty Mutual Group Property
Inspection Prediction competition and

83
00:05:56,600 --> 00:06:00,160
the Santander Customer Satisfaction
competition.

84
00:06:00,160 --> 00:06:05,920
In both of them, scores of the competitors
were very close to each other.

85
00:06:05,920 --> 00:06:10,060
And thus participants tried to
squeeze more from the data.

86
00:06:10,060 --> 00:06:15,240
But do not overfit, so
the thorough validation was crucial.

87
00:06:15,240 --> 00:06:18,021
Now, having discussed
validation stage problems,

88
00:06:18,021 --> 00:06:20,910
let's move on to
submission stage problems.

89
00:06:20,910 --> 00:06:26,735
Sometimes you can diagnose these problems
in the process of doing careful.

90
00:06:26,735 --> 00:06:30,170
But still,
often you encounter these type of problems

91
00:06:30,170 --> 00:06:32,950
only when you submit your
solution to the platform.

92
00:06:34,000 --> 00:06:35,470
But then again,

93
00:06:35,470 --> 00:06:41,360
is your friend when it comes down
to finding the root of the problem.

94
00:06:41,360 --> 00:06:45,070
Generally speaking,
there are two cases of these issues.

95
00:06:45,070 --> 00:06:49,020
In the first case, leaderboard
score is consistently higher or

96
00:06:49,020 --> 00:06:51,580
lower than validation score.

97
00:06:51,580 --> 00:06:57,300
In the second, leaderboard score is not
correlated with validation score at all.

98
00:06:57,300 --> 00:07:01,780
So in the worst case, we can improve
our score on the validation.

99
00:07:01,780 --> 00:07:06,130
While, on the contrary,
score on the leaderboard will decrease.

100
00:07:06,130 --> 00:07:11,448
As you can imagine,
these problems can be much more trouble.

101
00:07:11,448 --> 00:07:15,600
Now remember that the main rule
of making a reliable validation,

102
00:07:15,600 --> 00:07:19,710
is to mimic a train tests
pre made by organizers.

103
00:07:19,710 --> 00:07:23,420
I won't lie to you,
it can be quite hard to identify and

104
00:07:23,420 --> 00:07:25,930
mimic the exact train tests here.

105
00:07:25,930 --> 00:07:30,620
Because of that, I highly you to
start submitting your solutions

106
00:07:30,620 --> 00:07:32,820
right after you enter the competition.

107
00:07:32,820 --> 00:07:37,780
It's good to start exploring other
possible roots of this problem.

108
00:07:37,780 --> 00:07:41,808
Let's first sort out causes we could
observe during validation stage.

109
00:07:41,808 --> 00:07:46,705
Recall, we already have different
model scores on different

110
00:07:46,705 --> 00:07:49,480
folds during validation.

111
00:07:49,480 --> 00:07:54,510
Here it is useful to see a leaderboard
as another validation fold.

112
00:07:54,510 --> 00:07:58,660
Then, if we already have
different scores in KFold,

113
00:07:58,660 --> 00:08:04,330
getting a not very similar result on
the leaderboard is not suprising.

114
00:08:04,330 --> 00:08:09,930
More we can calculate mean and standard
deviation of the validation scores and

115
00:08:09,930 --> 00:08:14,240
estimate if the leaderboard
score is expected.

116
00:08:14,240 --> 00:08:18,250
But if this is not the case,
then something is definitely wrong.

117
00:08:19,290 --> 00:08:22,360
There could be two more reasons for
this problem.

118
00:08:22,360 --> 00:08:26,500
The first reason, we have too
little data in public leaderboard,

119
00:08:26,500 --> 00:08:28,610
which is pretty self explanatory.

120
00:08:28,610 --> 00:08:32,230
Just trust your validation,
and everything will be fine.

121
00:08:32,230 --> 00:08:37,830
And the second train and test data
are from different distributions.

122
00:08:37,830 --> 00:08:42,056
Let me explain what I mean when I
talk about different distributions.

123
00:08:42,056 --> 00:08:46,410
Consider a regression test of
predicting people's height

124
00:08:46,410 --> 00:08:49,440
by their photos on Instagram.

125
00:08:49,440 --> 00:08:53,190
The blue line represents
the distribution of heights for man,

126
00:08:53,190 --> 00:08:58,410
while the red line represents
the distribution of heights for women.

127
00:08:58,410 --> 00:09:02,140
As you can see,
these distributions are different.

128
00:09:02,140 --> 00:09:07,050
Now let's consider that the train
data consists only of women,

129
00:09:07,050 --> 00:09:11,067
while the test data consists only of men.

130
00:09:11,067 --> 00:09:16,540
Then all model predictions will be
around the average height for women.

131
00:09:16,540 --> 00:09:20,960
And the distribution of these predictions
will be very similar to that for

132
00:09:20,960 --> 00:09:22,695
the train data.

133
00:09:22,695 --> 00:09:27,985
No wonder that our model will have
a terrible score on the test data.

134
00:09:27,985 --> 00:09:32,665
Now, because our course is a practical
one, let's take a moment and

135
00:09:32,665 --> 00:09:37,196
think what you can do if you
encounter these in a competition.

136
00:09:37,196 --> 00:09:41,895
Okay, let's start with a general
approach to such problems.

137
00:09:41,895 --> 00:09:43,285
At the broadest level,

138
00:09:43,285 --> 00:09:48,250
we need to find a way to tackle different
distributions in train and test.

139
00:09:48,250 --> 00:09:52,316
Sometimes, these kind of problems
could be solved by adjusting your

140
00:09:52,316 --> 00:09:55,240
solution during the training procedure.

141
00:09:55,240 --> 00:09:58,080
But sometimes, this problem can be solved

142
00:09:58,080 --> 00:10:01,527
only by adjusting your solution
through the leaderboard.

143
00:10:01,527 --> 00:10:04,088
That is through leaderboard probing.

144
00:10:04,088 --> 00:10:08,990
The simplest way to solve this particular
situation in a competition is to try

145
00:10:08,990 --> 00:10:13,310
to figure out the optimal constant
prediction for train and test data.

146
00:10:13,310 --> 00:10:16,960
And shift your predictions
by the difference.

147
00:10:16,960 --> 00:10:22,590
Right here we can calculate the average
height of women from the train data.

148
00:10:22,590 --> 00:10:25,788
Calculating the average height
of men is a big trickier.

149
00:10:25,788 --> 00:10:29,890
If the competition's metric
is means squared error,

150
00:10:29,890 --> 00:10:34,460
we can send two constant submissions,
write down the simple formula.

151
00:10:34,460 --> 00:10:40,290
And find out that the average target
value for the test is equal to 70 inches.

152
00:10:40,290 --> 00:10:44,420
In general, this technique is
known as leaderboard probing.

153
00:10:44,420 --> 00:10:48,593
And we will discuss it in the topic about.

154
00:10:48,593 --> 00:10:53,240
So now we know the difference between the
average target values for the train and

155
00:10:53,240 --> 00:10:56,670
the test data, which is equal to 7 inches.

156
00:10:56,670 --> 00:11:01,160
And as the third step of adjusting
our submission to the leaderboard we

157
00:11:01,160 --> 00:11:06,130
could just try to add
7 to all predictions.

158
00:11:06,130 --> 00:11:13,140
But from this point it is not validational
it is a leaderboard probing and list.

159
00:11:13,140 --> 00:11:18,130
Yes we probably could discover this
during exploratory data analysis and

160
00:11:18,130 --> 00:11:21,460
try to make a correction
in our validation scheme.

161
00:11:21,460 --> 00:11:25,090
But sometimes it is not possible
without leaderboard probing,

162
00:11:25,090 --> 00:11:27,220
just like in this example.

163
00:11:27,220 --> 00:11:33,090
A competition which has something similar
is the Quora question pairs competition.

164
00:11:33,090 --> 00:11:38,190
There, distributions of the target
from train and test were different.

165
00:11:38,190 --> 00:11:41,860
So one could get a good
improvement of a score

166
00:11:41,860 --> 00:11:44,591
adjusting his predictions
to the leaderboard.

167
00:11:45,840 --> 00:11:49,010
But fortunately, this case is rare enough.

168
00:11:49,010 --> 00:11:54,080
More often, we encounter situations
which are more like the following case.

169
00:11:54,080 --> 00:11:58,500
Consider that now train
consists not only of women, but

170
00:11:58,500 --> 00:12:01,640
mostly of women, and test, vice versa.

171
00:12:01,640 --> 00:12:04,640
Consists not only of men,
but mostly of men.

172
00:12:05,690 --> 00:12:09,740
The main strategy to deal with
these kind of situations is simple.

173
00:12:09,740 --> 00:12:13,870
Again, remember to mimic
the train test split.

174
00:12:13,870 --> 00:12:16,780
If the test consists mostly of Men,

175
00:12:16,780 --> 00:12:20,770
force the validation to
have the same distribution.

176
00:12:20,770 --> 00:12:25,190
In that case, you ensure that
your validation will be fair.

177
00:12:26,420 --> 00:12:31,230
This is true for getting raw scores and
optimal parameters correctly.

178
00:12:31,230 --> 00:12:34,430
For example,
we could have quite different scores and

179
00:12:34,430 --> 00:12:39,160
optimal parameters for women's and
men's parts of the data set.

180
00:12:40,280 --> 00:12:45,550
Ensuring the same distribution in test and
validation helps us get scores and

181
00:12:45,550 --> 00:12:48,890
parameters relevant to test.

182
00:12:48,890 --> 00:12:52,460
I want to mention two
examples of this here.

183
00:12:52,460 --> 00:12:57,630
First the Data Science Game Qualification
Phase: Music recommendation challenge.

184
00:12:57,630 --> 00:13:00,230
And second, competition with

185
00:13:00,230 --> 00:13:05,060
CTR prediction which we discussed
earlier in the data topic.

186
00:13:05,060 --> 00:13:07,110
Let's start with the second one,

187
00:13:07,110 --> 00:13:12,250
do you remember the problem,
we have a test of predicting CTR.

188
00:13:12,250 --> 00:13:17,250
So, the train data, which basically
was the history of displayed ads

189
00:13:17,250 --> 00:13:21,750
obviously didn't contain
ads which were not shown.

190
00:13:21,750 --> 00:13:27,310
On the contrary, the test data
consisted of every possible ad.

191
00:13:27,310 --> 00:13:33,211
Notice this is the exact case of different
distributions in train and test.

192
00:13:34,420 --> 00:13:39,260
And again, we need to set up our
validation to mimic test here.

193
00:13:39,260 --> 00:13:44,330
So we have this huge bias towards
showing that in the train and

194
00:13:44,330 --> 00:13:46,500
to set up a correct validation.

195
00:13:46,500 --> 00:13:51,384
We had to complete the validation
set with rows of not shown ads.

196
00:13:51,384 --> 00:13:55,018
Now, let's go back to the first example.

197
00:13:55,018 --> 00:13:58,320
In that competition,
participants had to predict

198
00:13:58,320 --> 00:14:03,150
whether a user will listen to
a song recommended by assistant.

199
00:14:03,150 --> 00:14:07,350
So, the test contained
only recommended songs.

200
00:14:07,350 --> 00:14:12,200
But train, on the contrary,
contained both recommended songs and

201
00:14:12,200 --> 00:14:15,250
songs users selected themselves.

202
00:14:15,250 --> 00:14:21,900
So again, one could adjust his validation
by 50 renowned songs selected by users.

203
00:14:21,900 --> 00:14:27,551
And again, if we will not account for
that fact, then improving our model

204
00:14:27,551 --> 00:14:33,127
on actually selected songs can result
in the validation score going up.

205
00:14:33,127 --> 00:14:38,168
But it doesn't have to result and
the same improvements for the leaderboard.

206
00:14:38,168 --> 00:14:42,269
Okay let's conclude this overview
of handling validation problems for

207
00:14:42,269 --> 00:14:44,460
the submission stage.

208
00:14:44,460 --> 00:14:50,030
If you have too little data in public
leaderboard, just trust your validation.

209
00:14:50,030 --> 00:14:53,705
If that's not the case,
make sure that you did not overfit.

210
00:14:55,250 --> 00:14:58,880
Then check if you made
correct train/test split,

211
00:14:58,880 --> 00:15:01,770
as we discussed in the previous video.

212
00:15:01,770 --> 00:15:08,170
And finally, check if you have different
distributions in train and test.

213
00:15:08,170 --> 00:15:11,890
Great, let's move on to
the next point of this video.

214
00:15:11,890 --> 00:15:15,470
For now,
I hope you did everything all right.

215
00:15:15,470 --> 00:15:17,750
First, you did extensive validation.

216
00:15:17,750 --> 00:15:22,150
Second, you choose a correct splitter
strategy for train validation split.

217
00:15:22,150 --> 00:15:28,020
And finally, you ensured the same
distributions in validation and testing.

218
00:15:28,020 --> 00:15:33,190
But sometimes you have to expect
leaderboard shuffle anyway,

219
00:15:33,190 --> 00:15:36,330
and not just for you, but for everyone.

220
00:15:36,330 --> 00:15:41,060
First, for those who've never heard of it,
a leaderboard shuffle happens

221
00:15:41,060 --> 00:15:46,980
when participants position some public and
private leaderboard drastically different.

222
00:15:46,980 --> 00:15:50,950
Take a look at this screenshot from
the two sigma financial model in challenge

223
00:15:50,950 --> 00:15:52,570
competition.

224
00:15:52,570 --> 00:15:57,510
The green and
the red arrows mean how far a team moved.

225
00:15:57,510 --> 00:16:02,095
For example, the participant who
finished the 3rd on the private

226
00:16:02,095 --> 00:16:08,240
leaderboard was the 392nd
on the public leaderboard.

227
00:16:08,240 --> 00:16:12,995
Let's discuss three main reasons for
that shuffle, randomness,

228
00:16:12,995 --> 00:16:18,100
too little data, and different public,
private distributions.

229
00:16:18,100 --> 00:16:20,170
So first, randomness,

230
00:16:20,170 --> 00:16:24,920
this is the case when all participants
have very similar scores.

231
00:16:24,920 --> 00:16:29,230
This can be either a very good score or
a very poor one.

232
00:16:29,230 --> 00:16:32,745
But the main point here is
that the main reason for

233
00:16:32,745 --> 00:16:35,780
differences in scores is randomness.

234
00:16:35,780 --> 00:16:40,602
To understand this a bit more,
let's go through two quick examples here.

235
00:16:40,602 --> 00:16:43,470
The first one is the Liberty Mutual Group,

236
00:16:43,470 --> 00:16:45,765
Property Inspection Prediction
competition.

237
00:16:45,765 --> 00:16:50,750
In that competition,
scores of competitors were very close.

238
00:16:50,750 --> 00:16:54,820
And though randomness didn't play
a major role in that competition,

239
00:16:54,820 --> 00:16:59,580
still many people overfit
on the public leaderboard.

240
00:16:59,580 --> 00:17:02,410
The second example,
which is opposite to the first

241
00:17:02,410 --> 00:17:06,540
is the TWO SIGMA Financial Model and
Challenge competition.

242
00:17:06,540 --> 00:17:11,530
Because the financial data in that
competition was highly unpredictable,

243
00:17:11,530 --> 00:17:14,470
randomness played a major role in it.

244
00:17:14,470 --> 00:17:18,790
So one could say that the leaderboard
shuffle there was among

245
00:17:18,790 --> 00:17:21,920
the biggest shuffles on KFold platform.

246
00:17:21,920 --> 00:17:26,790
Okay, that was randomness, the second
reason to expect leaderboard shuffle

247
00:17:26,790 --> 00:17:31,524
is too little data overall, and
in private test set especially.

248
00:17:31,524 --> 00:17:36,760
An example of this is the Restaurant
Revenue Prediction Competition.

249
00:17:36,760 --> 00:17:42,210
In that competition, trained set
consisted of less than 200 gross.

250
00:17:42,210 --> 00:17:46,100
And this set consisted
of less than 400 gross.

251
00:17:46,100 --> 00:17:49,110
So as you can see shuffle
here was more than expected.

252
00:17:50,850 --> 00:17:53,590
Last reason of leaderboard
shuffle could be

253
00:17:53,590 --> 00:17:57,660
different distributions between public and
private test sets.

254
00:17:57,660 --> 00:18:00,764
This is usually the case
with time series prediction,

255
00:18:00,764 --> 00:18:04,100
like the Rossmann Stores Sales
competition.

256
00:18:04,100 --> 00:18:09,140
When we have a time based split,
we usually have first few weeks

257
00:18:09,140 --> 00:18:14,400
in public leaderboard, and
next few weeks in private leaderboards.

258
00:18:14,400 --> 00:18:18,576
As people tend to adjust their
submission to public leaderboard and

259
00:18:18,576 --> 00:18:23,160
overfit, we can expect worse
results on private leaderboard.

260
00:18:23,160 --> 00:18:27,920
Here again, trust your validation and
everything will be fine.

261
00:18:27,920 --> 00:18:31,700
Okay, let's go over reasons for
leaderboard shuffling.

262
00:18:31,700 --> 00:18:37,110
Now let's conclude both this video and
the entire validation topic.

263
00:18:37,110 --> 00:18:39,220
Let's start with the video.

264
00:18:39,220 --> 00:18:43,820
First, if you have big dispersion
of scores on validation stage

265
00:18:43,820 --> 00:18:46,790
we should do extensive validation.

266
00:18:46,790 --> 00:18:51,668
That means every score from
different KFold splits, and

267
00:18:51,668 --> 00:18:57,350
team model on one split while
evaluating score on the other.

268
00:18:57,350 --> 00:19:02,180
Second, if submission do not
match local validation score,

269
00:19:02,180 --> 00:19:07,170
we should first, check if we have too
little data in public leaderboard.

270
00:19:07,170 --> 00:19:13,080
Second, check if we did not overfit, check
if you chose correct splitting strategy.

271
00:19:13,080 --> 00:19:18,090
And finally, check if trained test
have different distributions.

272
00:19:18,090 --> 00:19:23,519
You can expect leaderboard shuffle
because of three key things, randomness,

273
00:19:23,519 --> 00:19:29,750
little amount of data, and different
public/private test distributions.

274
00:19:29,750 --> 00:19:33,319
So that's it,
in this topic we defined validation and

275
00:19:33,319 --> 00:19:36,290
its connection to overfitting.

276
00:19:36,290 --> 00:19:39,370
Described common validation strategies.

277
00:19:39,370 --> 00:19:42,780
Demonstrated major data
splitting strategies.

278
00:19:42,780 --> 00:19:47,970
And finally analyzed and learned how
to tackle main validation problems.

279
00:19:47,970 --> 00:19:53,070
Remember this, and it will absolutely
help you out in competitions.

280
00:19:53,070 --> 00:19:57,570
Make sure you understand
the main idea of validation well.

281
00:19:57,570 --> 00:20:00,130
That is,
you need to mimic the trained test split.

282
00:20:00,130 --> 00:20:09,386
[MUSIC]