1
00:00:03,730 --> 00:00:06,406
Since we already know the main strategies for validation,

2
00:00:06,406 --> 00:00:10,240
we can move to more concrete examples.

3
00:00:10,240 --> 00:00:12,237
Let's imagine, we're solving a competition with a time series prediction, namely,

4
00:00:12,237 --> 00:00:20,695
we are to predict a number of customers for a shop for which they're due in next month.

5
00:00:20,695 --> 00:00:24,675
How should we divide the data into train and validation here?

6
00:00:24,675 --> 00:00:27,275
Basically, we have two possibilities.

7
00:00:27,275 --> 00:00:32,610
Having data frame first, we can take random rows in validation and second,

8
00:00:32,610 --> 00:00:34,999
we can make a time-based split,

9
00:00:34,999 --> 00:00:41,120
take everything before some date as a train and everything out there as a validation.

10
00:00:41,120 --> 00:00:44,400
Let's plan these two options next.

11
00:00:44,400 --> 00:00:49,725
Now, when you think about features you need to generate and the model you need to train,

12
00:00:49,725 --> 00:00:52,575
how complicated these two cases are?

13
00:00:52,575 --> 00:00:54,100
In the first block,

14
00:00:54,100 --> 00:00:59,475
we can just interpret between the previous and the next value to get our predictions.

15
00:00:59,475 --> 00:01:01,485
Very easy, but wait.

16
00:01:01,485 --> 00:01:05,990
Do we really have future information about the number of customers in the real world?

17
00:01:05,990 --> 00:01:11,455
Well, probably not. But does this mean that this validation is useless?

18
00:01:11,455 --> 00:01:13,280
Again, it doesn't.

19
00:01:13,280 --> 00:01:16,255
What it does mean really that if we make

20
00:01:16,255 --> 00:01:19,815
train validation split different from train/test split,

21
00:01:19,815 --> 00:01:23,270
then we are going to create a useless model.

22
00:01:23,270 --> 00:01:27,085
And here, we get to the main rule of making a reliable validation.

23
00:01:27,085 --> 00:01:29,475
We should, if possible,

24
00:01:29,475 --> 00:01:31,985
set up validation to mimic train/test split,

25
00:01:31,985 --> 00:01:35,150
but that's a little later.

26
00:01:35,150 --> 00:01:36,960
Let's go back to our example.

27
00:01:36,960 --> 00:01:38,350
On the second picture,

28
00:01:38,350 --> 00:01:39,940
for most of test point,

29
00:01:39,940 --> 00:01:43,680
we have neither the next value nor the previous one.

30
00:01:43,680 --> 00:01:48,945
Now, let's imagine we have a pool of different models trained on different features,

31
00:01:48,945 --> 00:01:53,515
and we selected the best model for each type of validation.

32
00:01:53,515 --> 00:01:56,723
Now, the question, will these models differ?

33
00:01:56,723 --> 00:01:59,510
And if they will, how significantly?

34
00:01:59,510 --> 00:02:04,530
Well, it is certain that if you want to predict what will happen a few points later,

35
00:02:04,530 --> 00:02:07,590
then the model which favor features like previous

36
00:02:07,590 --> 00:02:11,095
and next target values will perform poorly.

37
00:02:11,095 --> 00:02:12,975
It happens because in this case,

38
00:02:12,975 --> 00:02:16,650
we just don't have such observations for the test data.

39
00:02:16,650 --> 00:02:20,655
But we have to give the model something in the feature value,

40
00:02:20,655 --> 00:02:24,485
and it probably will be not numbers or missing values.

41
00:02:24,485 --> 00:02:28,125
How much experience that model have with these type of situations?

42
00:02:28,125 --> 00:02:33,985
Not much. The model just won't expect that and quality will suffer.

43
00:02:33,985 --> 00:02:37,090
Now, let's remember the second case.

44
00:02:37,090 --> 00:02:41,495
Actually, here we need to rely more on the time trend.

45
00:02:41,495 --> 00:02:43,385
And so, the features,

46
00:02:43,385 --> 00:02:45,135
which is the model really we need here,

47
00:02:45,135 --> 00:02:50,570
are more like what was the trend in the last couple of months or weeks?

48
00:02:50,570 --> 00:02:54,990
So, that shows that the model selected as the best model for

49
00:02:54,990 --> 00:03:00,030
the first type of validation will perform poorly for the second type of validation.

50
00:03:00,030 --> 00:03:03,360
On the opposite, the best model for the second type of

51
00:03:03,360 --> 00:03:07,077
validation was trained to predict many points ahead,

52
00:03:07,077 --> 00:03:10,635
and it will not use adjacent target values.

53
00:03:10,635 --> 00:03:13,365
So, to conclude this comparison,

54
00:03:13,365 --> 00:03:16,570
these models indeed differ significantly,

55
00:03:16,570 --> 00:03:22,900
including the fact that most useful features for one model are useless for another.

56
00:03:22,900 --> 00:03:27,565
But, the generated features are not the only problem here.

57
00:03:27,565 --> 00:03:31,656
Consider that actual train/test split is time-based,

58
00:03:31,656 --> 00:03:33,450
here is the question.

59
00:03:33,450 --> 00:03:39,150
If we carefully generate features that are drawing attention to time-based patterns,

60
00:03:39,150 --> 00:03:43,680
we'll get a reliable validation with a random-based split.

61
00:03:43,680 --> 00:03:46,665
Let me say this again in another words.

62
00:03:46,665 --> 00:03:50,280
If we'll create features which are useful for

63
00:03:50,280 --> 00:03:54,255
a time-based split and are useless for a random split,

64
00:03:54,255 --> 00:03:59,070
will be correct to use a random split to select the model?

65
00:03:59,070 --> 00:04:00,565
It's a tough question.

66
00:04:00,565 --> 00:04:03,410
Let's take a moment and think about it.

67
00:04:03,410 --> 00:04:05,790
Okay, now let's answer this.

68
00:04:05,790 --> 00:04:08,710
Consider the case when target falls a linear train.

69
00:04:08,710 --> 00:04:11,401
In the first block,

70
00:04:11,401 --> 00:04:14,685
we see the exact case of randomly chosen validation.

71
00:04:14,685 --> 00:04:20,320
In the second, we see the same time-based split as we consider before.

72
00:04:20,320 --> 00:04:22,675
first, let's notice that in general,

73
00:04:22,675 --> 00:04:28,370
model predictions will be close to targets mean value calculated using train data.

74
00:04:28,370 --> 00:04:29,815
So in the first block,

75
00:04:29,815 --> 00:04:36,375
if the validation points will be closer to this mean value compared to test points,

76
00:04:36,375 --> 00:04:40,195
we'll get a better score in validation than on test.

77
00:04:40,195 --> 00:04:41,790
But in the second case,

78
00:04:41,790 --> 00:04:47,545
the validation points are roughly as far as the test points from target mean value.

79
00:04:47,545 --> 00:04:49,485
And so, in the second case,

80
00:04:49,485 --> 00:04:53,510
validation score will be more similar to the test score.

81
00:04:53,510 --> 00:04:56,512
Great, as we just found out,

82
00:04:56,512 --> 00:04:58,820
in the case of incorrect validation,

83
00:04:58,820 --> 00:05:05,925
not only features, but the value target can lead to unrealistic estimation of the score.

84
00:05:05,925 --> 00:05:09,110
Now, that example was quite similar to what you may

85
00:05:09,110 --> 00:05:12,420
encounter while solving real competitions.

86
00:05:12,420 --> 00:05:18,343
Numerous competitions use time-based split namely: the Rossmann Store Sales competition,

87
00:05:18,343 --> 00:05:22,835
the Grupo Bimbo Inventory Demand competition and others.

88
00:05:22,835 --> 00:05:27,925
So, to quickly summarize this valuable example we just have discussed,

89
00:05:27,925 --> 00:05:31,560
different splitting strategies can differ significantly,

90
00:05:31,560 --> 00:05:34,949
namely: in generated features,

91
00:05:34,949 --> 00:05:37,964
in the way the model will rely on that features,

92
00:05:37,964 --> 00:05:40,260
and in some kind of target leak.

93
00:05:40,260 --> 00:05:43,945
That means, to be able to find smart ideas for

94
00:05:43,945 --> 00:05:48,496
feature generation and to consistently improve our model,

95
00:05:48,496 --> 00:05:53,060
we absolutely want to identify train/test split made by organizers,

96
00:05:53,060 --> 00:05:57,010
including the competition, and reproduce it.

97
00:05:57,010 --> 00:06:01,145
Let's now categorize most of these splitting strategies and competitions,

98
00:06:01,145 --> 00:06:04,055
and discuss examples for them.

99
00:06:04,055 --> 00:06:08,000
Most splits can be united into three categories: a random split,

100
00:06:08,000 --> 00:06:12,910
a time-based split and the id-based split.

101
00:06:12,910 --> 00:06:17,055
Let's start with the most basic one, the random split.

102
00:06:17,055 --> 00:06:20,515
Let's start, the most common way of making

103
00:06:20,515 --> 00:06:24,645
a train/test split is to split data randomly by rows.

104
00:06:24,645 --> 00:06:28,910
This usually means that the rows are independent of each other.

105
00:06:28,910 --> 00:06:34,075
For example, we have a test of predicting if a client will pay off alone.

106
00:06:34,075 --> 00:06:36,420
Each row represents a person,

107
00:06:36,420 --> 00:06:40,615
and these rows are fairly independent of each other.

108
00:06:40,615 --> 00:06:44,790
Now, let's consider that there is some dependency, for example,

109
00:06:44,790 --> 00:06:49,655
within family members or people which work in the same company.

110
00:06:49,655 --> 00:06:52,050
If a husband can pay a credit probably,

111
00:06:52,050 --> 00:06:54,130
his wife can do it too.

112
00:06:54,130 --> 00:06:56,295
That means if by some misfortune,

113
00:06:56,295 --> 00:07:00,750
a husband will will present in the train data and his wife will present in the test data.

114
00:07:00,750 --> 00:07:05,770
We probably can explore this and devise a special feature for that case.

115
00:07:05,770 --> 00:07:07,765
For in such possibilities,

116
00:07:07,765 --> 00:07:11,550
and realizing that kind of features is really interesting.

117
00:07:11,550 --> 00:07:14,910
More in this case and others I will mention here,

118
00:07:14,910 --> 00:07:18,105
comes in the next lesson of our course.

119
00:07:18,105 --> 00:07:21,720
So again, that was a random split.

120
00:07:21,720 --> 00:07:24,100
The second method is a time-based split.

121
00:07:24,100 --> 00:07:29,900
We already discussed the unit example of the split in the beginning of this video.

122
00:07:29,900 --> 00:07:35,850
In that case, we generally have everything before a particular date as a training data,

123
00:07:35,850 --> 00:07:38,720
and the rating after date as a test data.

124
00:07:38,720 --> 00:07:42,410
This can be a signal to use special approach to feature generation,

125
00:07:42,410 --> 00:07:46,470
especially to make useful features based on the target.

126
00:07:46,470 --> 00:07:49,230
For example, if we are to predict a number of

127
00:07:49,230 --> 00:07:52,290
customers for the shop for each day in the next week,

128
00:07:52,290 --> 00:07:54,315
we can came up with something like

129
00:07:54,315 --> 00:07:57,675
the number of customers for the same day in the previous week,

130
00:07:57,675 --> 00:08:02,470
or the average number of customers for the past month.

131
00:08:02,470 --> 00:08:04,135
As I mentioned before,

132
00:08:04,135 --> 00:08:06,405
this split is widespread enough.

133
00:08:06,405 --> 00:08:07,696
It was used in

134
00:08:07,696 --> 00:08:13,140
a Rossmann store sales competition and in the Grupo Bimbo inventory demand competition,

135
00:08:13,140 --> 00:08:15,480
and in other's competitions.

136
00:08:15,480 --> 00:08:22,194
A special case of validation for the time-based split is a moving window validation.

137
00:08:22,194 --> 00:08:23,780
In the previous example,

138
00:08:23,780 --> 00:08:27,710
we can move the date which divides train and validation.

139
00:08:27,710 --> 00:08:32,180
Successively using week after week as a validation set,

140
00:08:32,180 --> 00:08:34,865
just like on this picture.

141
00:08:34,865 --> 00:08:38,240
Now, having dealt with the random and the time-based splits,

142
00:08:38,240 --> 00:08:41,663
let's discuss the ID-based split.

143
00:08:41,663 --> 00:08:44,240
ID can be a unique identifier of user,

144
00:08:44,240 --> 00:08:46,315
shop, or any other entity.

145
00:08:46,315 --> 00:08:49,960
For example, let's imagine we have to solve a task

146
00:08:49,960 --> 00:08:54,055
of music recommendations for completely new users.

147
00:08:54,055 --> 00:08:59,045
That means, we have different sets of users in train and test.

148
00:08:59,045 --> 00:09:04,280
If so, we probably can make a conclusion that features based on user's history,

149
00:09:04,280 --> 00:09:08,265
for example, how many songs user listened in the last week,

150
00:09:08,265 --> 00:09:11,360
will not help for completely new users.

151
00:09:11,360 --> 00:09:13,885
As an example of ID-based split,

152
00:09:13,885 --> 00:09:18,660
I want to tell you a bit about the Caterpillar to pricing competition.

153
00:09:18,660 --> 00:09:26,135
In that competition, train/test split was done on some category ID, namely, tube ID.

154
00:09:26,135 --> 00:09:30,646
There is an interesting case when we should employ the ID-based split,

155
00:09:30,646 --> 00:09:32,865
but IDs are hidden from us.

156
00:09:32,865 --> 00:09:38,630
Here, I want to mention two examples of competitions with hidden ID-based split.

157
00:09:38,630 --> 00:09:44,765
These include Intel and MumbaiODT Cervical Cancer Screening competition,

158
00:09:44,765 --> 00:09:49,385
and The Nature Conservancy fisheries monitoring competition.

159
00:09:49,385 --> 00:09:51,025
In the first competition,

160
00:09:51,025 --> 00:09:54,050
we had to classify patients into three classes,

161
00:09:54,050 --> 00:09:55,400
and for each patient,

162
00:09:55,400 --> 00:09:57,450
we had several photos.

163
00:09:57,450 --> 00:10:02,240
Indeed, photos of one patient belong to the same class.

164
00:10:02,240 --> 00:10:07,340
Again, sets of patients from train and test did not overlap.

165
00:10:07,340 --> 00:10:11,480
And we should also ensure these in the training regulations split.

166
00:10:11,480 --> 00:10:17,690
As another example, in The Nature Conservancy fisheries monitoring competition,

167
00:10:17,690 --> 00:10:21,125
there were photos of fish from several different fishing boats.

168
00:10:21,125 --> 00:10:25,590
Again, fishing boats and train and test did not overlap.

169
00:10:25,590 --> 00:10:31,410
So one could easily overfit if you would ignore risk and make a random-based split.

170
00:10:31,410 --> 00:10:34,130
Because the IDs were not given,

171
00:10:34,130 --> 00:10:37,815
competitors had to derive these IDs by themselves.

172
00:10:37,815 --> 00:10:39,875
In both these competitions,

173
00:10:39,875 --> 00:10:43,145
it could be done by clustering pictures.

174
00:10:43,145 --> 00:10:48,640
The easiest case was when pictures were taken just one after another,

175
00:10:48,640 --> 00:10:51,320
so the images were quite similar.

176
00:10:51,320 --> 00:10:56,715
You can find more details of such clustering in the kernels of these competitions.

177
00:10:56,715 --> 00:11:00,260
Now, having in these two main standalone methods,

178
00:11:00,260 --> 00:11:03,850
we also need to know that they sometimes may be combined.

179
00:11:03,850 --> 00:11:07,685
For example, if we have a task of predicting sales in a shop,

180
00:11:07,685 --> 00:11:11,615
we can choose a split in date for each shop independently,

181
00:11:11,615 --> 00:11:15,995
instead of using one date for every shop in the data.

182
00:11:15,995 --> 00:11:20,360
Or another example, if we have search queries from multiple users,

183
00:11:20,360 --> 00:11:22,490
is using several search engines,

184
00:11:22,490 --> 00:11:26,874
we can split the data by a combination of user ID and search engine ID.

185
00:11:26,874 --> 00:11:32,100
Examples of competitions with combined splits include

186
00:11:32,100 --> 00:11:36,080
the Western Australia Rental Prices competition by Deloitte

187
00:11:36,080 --> 00:11:41,390
and their qualification phase of data science game 2017.

188
00:11:41,390 --> 00:11:43,060
In the first competition,

189
00:11:43,060 --> 00:11:45,760
train/test was split by a single date,

190
00:11:45,760 --> 00:11:52,145
but the public/private split was made by different dates for different geographic areas.

191
00:11:52,145 --> 00:11:53,720
In the second competition,

192
00:11:53,720 --> 00:11:56,195
participants had to predict whether

193
00:11:56,195 --> 00:11:59,520
a user of online music service will listen to the song.

194
00:11:59,520 --> 00:12:03,310
The train/test split was made in the following way.

195
00:12:03,310 --> 00:12:08,860
For each user, the last song he listened to was placed in the test set,

196
00:12:08,860 --> 00:12:12,890
while all other songs were placed in the train set.

197
00:12:12,890 --> 00:12:19,105
Fine. These were the main splitting strategies employed in the competitions.

198
00:12:19,105 --> 00:12:23,860
Again, the main idea I want you to take away from this lesson is that

199
00:12:23,860 --> 00:12:29,385
your validation should always mimic train/test split made by organizers.

200
00:12:29,385 --> 00:12:31,645
It could be something non-trivial.

201
00:12:31,645 --> 00:12:36,580
For example, in the Home Depot Product Search Relevance competition,

202
00:12:36,580 --> 00:12:40,700
participants were asked to estimate search relevancy.

203
00:12:40,700 --> 00:12:47,215
In general, data consisted of search terms and search results for those terms,

204
00:12:47,215 --> 00:12:52,120
but test set contained completely new search terms.

205
00:12:52,120 --> 00:12:59,320
So, we couldn't use either a random split or a search term-based split for validation.

206
00:12:59,320 --> 00:13:02,370
First split favored more complicated models,

207
00:13:02,370 --> 00:13:08,215
which led to overfitting while second split, conversely, to underfitting.

208
00:13:08,215 --> 00:13:11,350
So, in order to select optimal models,

209
00:13:11,350 --> 00:13:17,470
it was crucial to mimic the ratio of new search terms from train/test split.

210
00:13:17,470 --> 00:13:19,565
Great. This is it.

211
00:13:19,565 --> 00:13:24,468
We just demonstrated major data splitting strategies employed in competitions.

212
00:13:24,468 --> 00:13:27,015
Random split, time-based split,

213
00:13:27,015 --> 00:13:30,585
ID-based split, and their combinations.

214
00:13:30,585 --> 00:13:33,655
This will help us build reliable validation,

215
00:13:33,655 --> 00:13:36,550
make a useful decisions about feature generation,

216
00:13:36,550 --> 00:13:41,625
and in the end, select models which will perform best on the test data.

217
00:13:41,625 --> 00:13:43,400
As the main point of this video,

218
00:13:43,400 --> 00:13:48,545
remember the general rule of making a reliable validation.

219
00:13:48,545 --> 00:13:54,620
Set up your validation to mimic the train/test split of the competition.