1
00:00:00,000 --> 00:00:03,844
We can continue our discussion with StackNet.

2
00:00:03,844 --> 00:00:09,106
StackNet is a scalable meta modeling methodology that utilizes

3
00:00:09,106 --> 00:00:15,760
stacking to combine multiple models in a neural network architecture of multiple levels.

4
00:00:15,760 --> 00:00:19,503
It is scalable because within the same level,

5
00:00:19,503 --> 00:00:22,986
we can run all the models in parallel.

6
00:00:22,986 --> 00:00:26,540
It utilizes stacking because it makes use of

7
00:00:26,540 --> 00:00:30,775
this technique we mentioned before where we split the data,

8
00:00:30,775 --> 00:00:33,995
we make predictions so some hold out data,

9
00:00:33,995 --> 00:00:39,430
and then we use another model to train on those predictions.

10
00:00:39,430 --> 00:00:41,525
And as we will see later on,

11
00:00:41,525 --> 00:00:46,620
this resembles a lot in neural network.

12
00:00:46,620 --> 00:00:54,000
Now let us continue that naive example we gave before with the students and the teacher,

13
00:00:54,000 --> 00:00:57,634
in order to understand what conceptually, in a real world,

14
00:00:57,634 --> 00:01:02,270
would need to add another layer.

15
00:01:02,270 --> 00:01:03,630
So in that example,

16
00:01:03,630 --> 00:01:06,870
we have a teacher that she was trying to combine the answers of

17
00:01:06,870 --> 00:01:13,560
different students and she was outputting an estimate of 17 under certain assumptions.

18
00:01:13,560 --> 00:01:18,885
We can make this example more interesting by introducing one more meta learner.

19
00:01:18,885 --> 00:01:21,345
Let's call him Mr. RF,

20
00:01:21,345 --> 00:01:23,395
who's also a physics teacher.

21
00:01:23,395 --> 00:01:29,305
Mr. RF believes that LR should have a bigger contribution to the ensemble

22
00:01:29,305 --> 00:01:31,810
because he has been doing private lessons with

23
00:01:31,810 --> 00:01:35,920
him and he knows he couldn't be that far off.

24
00:01:35,920 --> 00:01:40,360
So he's able to see the data from slightly different ways to

25
00:01:40,360 --> 00:01:46,330
capitalize on different parts of these predictions and make a different estimate.

26
00:01:46,330 --> 00:01:51,905
Whereas, the teachers could work it out and take an average,

27
00:01:51,905 --> 00:02:00,970
we could create or we can introduce a higher authority or another layer of modeling here.

28
00:02:00,970 --> 00:02:03,375
Let's call it the headmaster,

29
00:02:03,375 --> 00:02:07,540
GBM, in order to shop, make better predictions.

30
00:02:07,540 --> 00:02:12,935
And GBM doesn't need to know the answers that the students have given.

31
00:02:12,935 --> 00:02:16,413
The only thing he needs to know is the input from the teachers.

32
00:02:16,413 --> 00:02:17,500
And in this case,

33
00:02:17,500 --> 00:02:26,994
he's more keen to trust his physics teacher by outputting a 16.2 prediction.

34
00:02:26,994 --> 00:02:29,100
Why would this be of any use to people?

35
00:02:29,100 --> 00:02:32,235
I mean, isn't that already complicated?

36
00:02:32,235 --> 00:02:36,810
Why would we want to ever try something so complicated?

37
00:02:36,810 --> 00:02:43,155
I'm giving you an example of a competition my team used,

38
00:02:43,155 --> 00:02:45,851
four layer of stacking,

39
00:02:45,851 --> 00:02:47,690
in order to win.

40
00:02:47,690 --> 00:02:52,950
And we used two different sources of input data.

41
00:02:52,950 --> 00:02:55,920
We generated multiple models.

42
00:02:55,920 --> 00:02:59,880
Normally, exit boost and logistic regressions,

43
00:02:59,880 --> 00:03:07,235
and then we fed those into a four-layer architecture in order to get the top score.

44
00:03:07,235 --> 00:03:13,385
And although we could have escaped without using that fourth layer,

45
00:03:13,385 --> 00:03:17,970
we still need it up to level three in order to win.

46
00:03:17,970 --> 00:03:24,340
So you can understand the usefulness of deploying deep stacking.

47
00:03:24,340 --> 00:03:31,365
Another example is the Homesite competition organized by Homesite insurance where again,

48
00:03:31,365 --> 00:03:35,580
we created many different views of the data.

49
00:03:35,580 --> 00:03:38,010
So we had different transformations.

50
00:03:38,010 --> 00:03:40,465
We generated many models.

51
00:03:40,465 --> 00:03:46,600
We fed those models into a three-level architecture.

52
00:03:46,600 --> 00:03:48,770
I think we didn't need the third layer again.

53
00:03:48,770 --> 00:03:54,770
Probably, we could have escaped with only two levels but again,

54
00:03:54,770 --> 00:03:58,395
deep stacking was necessary in order to win.

55
00:03:58,395 --> 00:04:00,270
So there is your answer,

56
00:04:00,270 --> 00:04:07,005
deep stacking on multiple levels really helps you to win competitions.

57
00:04:07,005 --> 00:04:10,975
In the spirit of fairness and openness,

58
00:04:10,975 --> 00:04:13,525
there has been some criticism about

59
00:04:13,525 --> 00:04:18,060
large ensembles that maybe they don't have commercial value,

60
00:04:18,060 --> 00:04:21,055
they are confidentially expensive.

61
00:04:21,055 --> 00:04:24,180
I have to add three things on that.

62
00:04:24,180 --> 00:04:28,020
The first is, what is considered expensive today may

63
00:04:28,020 --> 00:04:32,280
not be expensive tomorrow and we have seen that, for example,

64
00:04:32,280 --> 00:04:33,840
with the deep learning,

65
00:04:33,840 --> 00:04:35,670
where with the advent of GPUs,

66
00:04:35,670 --> 00:04:42,295
they have become 100 times faster and now they have become again very, very popular.

67
00:04:42,295 --> 00:04:43,845
The other thing is,

68
00:04:43,845 --> 00:04:48,420
you don't need to always build very,

69
00:04:48,420 --> 00:04:50,530
very deep ensembles but still,

70
00:04:50,530 --> 00:04:53,180
small ensembles would still really help.

71
00:04:53,180 --> 00:04:57,285
So knowing how to do them can add value to businesses,

72
00:04:57,285 --> 00:05:02,245
again based on different assumptions about how fast they want the decisions,

73
00:05:02,245 --> 00:05:04,825
how much is the uplift you can see from stacking,

74
00:05:04,825 --> 00:05:09,195
which may vary, sometimes it's more, sometime is less.

75
00:05:09,195 --> 00:05:13,070
And generally, how much computing power they have.

76
00:05:13,070 --> 00:05:20,065
We can make a case that even stacking on multiple layers can be very useful.

77
00:05:20,065 --> 00:05:22,410
And the last point is that these are

78
00:05:22,410 --> 00:05:27,040
predictive modeling competitions so it is a bit like the Olympics.

79
00:05:27,040 --> 00:05:31,410
It is nice to be able to see the theoretical best you

80
00:05:31,410 --> 00:05:36,000
can get because this is how innovation takes over.

81
00:05:36,000 --> 00:05:37,740
This is how we move forward.

82
00:05:37,740 --> 00:05:43,265
We can express StackNet as a neural network.

83
00:05:43,265 --> 00:05:46,955
So normally, in a neural network,

84
00:05:46,955 --> 00:05:51,110
we have these architecture of hidden units where they are

85
00:05:51,110 --> 00:05:57,500
connected with input with the form of linear regression.

86
00:05:57,500 --> 00:06:01,830
So actually, it looks pretty much like a linear regression.

87
00:06:01,830 --> 00:06:04,700
So whether you have a set of coefficients and you have

88
00:06:04,700 --> 00:06:08,275
a constant value where you call it bias in neaural networks,

89
00:06:08,275 --> 00:06:11,420
and this is how your output predictions which one of

90
00:06:11,420 --> 00:06:14,652
the hidden units which are then taken,

91
00:06:14,652 --> 00:06:18,560
collected, to create the output.

92
00:06:18,560 --> 00:06:23,350
The concept of StackNet is actually not that much different.

93
00:06:23,350 --> 00:06:25,495
The only thing we want to do is,

94
00:06:25,495 --> 00:06:30,172
we don't want to be limited to that linear regression or to that perception.

95
00:06:30,172 --> 00:06:34,230
We want to be able to use any machine learning algorithm.

96
00:06:34,230 --> 00:06:38,520
Putting that aside, the architecture should be exactly the same,

97
00:06:38,520 --> 00:06:41,750
could be fairly similar.

98
00:06:41,750 --> 00:06:44,375
So how to train this?

99
00:06:44,375 --> 00:06:48,254
In a typical neural network, we use bipropagation.

100
00:06:48,254 --> 00:06:50,135
Here in this context,

101
00:06:50,135 --> 00:06:51,980
this is not feasible.

102
00:06:51,980 --> 00:06:56,200
I mean in the context of trying to make this network work

103
00:06:56,200 --> 00:07:01,170
with any input model because not all are differentiable.

104
00:07:01,170 --> 00:07:04,420
So this is why we can use stacking.

105
00:07:04,420 --> 00:07:09,845
Stacking here is a way to link the output,

106
00:07:09,845 --> 00:07:13,985
the prediction, the output of the node, with target variable.

107
00:07:13,985 --> 00:07:21,605
This is how the link also is made from the input features with a node.

108
00:07:21,605 --> 00:07:29,850
However, if you remember the way that stacking works is you have some train data.

109
00:07:29,850 --> 00:07:32,970
And then, you need to divide it into two halves.

110
00:07:32,970 --> 00:07:35,790
So, you use the first part called, training,

111
00:07:35,790 --> 00:07:39,800
in order to make predictions to the other part called, valid.

112
00:07:39,800 --> 00:07:47,385
If we, assuming that adding more layers gives us some uplift,

113
00:07:47,385 --> 00:07:49,938
if we wanted to do this again,

114
00:07:49,938 --> 00:07:54,080
we would have re-split the valid data into two parts.

115
00:07:54,080 --> 00:07:56,056
Let's call it, mini train, and mini valid.

116
00:07:56,056 --> 00:07:58,000
And you can see the problem here.

117
00:07:58,000 --> 00:08:00,672
I mean, assuming if we have really big data,

118
00:08:00,672 --> 00:08:03,710
then this may not really be an issue.

119
00:08:03,710 --> 00:08:07,670
But in certain situations where we don't have that much data.

120
00:08:07,670 --> 00:08:15,375
Ideally, we would like to do this without having to constantly re-split our data.

121
00:08:15,375 --> 00:08:19,470
And therefore minimizing the training data set.

122
00:08:19,470 --> 00:08:24,395
So, this is why we use a K-Fold paradigm.

123
00:08:24,395 --> 00:08:30,135
Let's assume we have a training data set with four features x0, x1,

124
00:08:30,135 --> 00:08:36,520
x2, x3, and the y variable, or target.

125
00:08:36,520 --> 00:08:39,827
If we are use k-fold where k = 4,

126
00:08:39,827 --> 00:08:43,000
this is a hyper-parameter which is what to put here.

127
00:08:43,000 --> 00:08:49,250
We would make four different parts out of these datasets.

128
00:08:49,250 --> 00:08:52,285
Here I have put different colors,

129
00:08:52,285 --> 00:08:54,300
colors to each one of these parts.

130
00:08:54,300 --> 00:08:58,780
What we would do then in order to commence the training,

131
00:08:58,780 --> 00:09:05,190
is we will create an empty vector that has the same size as rows,

132
00:09:05,190 --> 00:09:07,065
as in the training data,

133
00:09:07,065 --> 00:09:08,845
but for now is empty.

134
00:09:08,845 --> 00:09:13,115
And then, for each one of the folds,

135
00:09:13,115 --> 00:09:19,435
we would start, we will take a subset of the training data.

136
00:09:19,435 --> 00:09:25,555
In this case, we will start with red, yellow, and green.

137
00:09:25,555 --> 00:09:27,230
We will train a model,

138
00:09:27,230 --> 00:09:30,115
and then we will take the blue part,

139
00:09:30,115 --> 00:09:32,300
and will make predictions.

140
00:09:32,300 --> 00:09:35,165
And we will take these predictions,

141
00:09:35,165 --> 00:09:36,590
and we will put them in

142
00:09:36,590 --> 00:09:41,989
the corresponding location in the prediction array which was empty.

143
00:09:41,989 --> 00:09:50,265
Now, we are going to repeat the same process always using this rotation.

144
00:09:50,265 --> 00:09:54,375
So, we are now going to use the blue,

145
00:09:54,375 --> 00:09:56,600
the yellow, and the green part,

146
00:09:56,600 --> 00:09:59,510
and we will keep to create a model,

147
00:09:59,510 --> 00:10:04,317
and we will keep the red part for prediction.

148
00:10:04,317 --> 00:10:07,460
Again, we will take these predictions and put it

149
00:10:07,460 --> 00:10:12,240
into the corresponding part in the prediction array.

150
00:10:12,240 --> 00:10:19,244
And we will repeat again with the yellow, and the green.

151
00:10:19,244 --> 00:10:21,930
Something that I need to mention is

152
00:10:21,930 --> 00:10:27,285
that the K-Fold doesn't need to be sequential as a date.

153
00:10:27,285 --> 00:10:29,240
So, it would have been shuffled.

154
00:10:29,240 --> 00:10:33,480
I did it as this way in order to illustrate it better.

155
00:10:33,480 --> 00:10:36,255
But once we have finished and we have

156
00:10:36,255 --> 00:10:41,355
generated a whole prediction for the whole training data,

157
00:10:41,355 --> 00:10:45,165
then we can use the whole training data,

158
00:10:45,165 --> 00:10:50,210
in order to fit one last model and make now predictions for the test data.

159
00:10:50,210 --> 00:10:56,010
Another way we could have done this is for each

160
00:10:56,010 --> 00:11:02,470
one of the four models we were making predictions for the validation data.

161
00:11:02,470 --> 00:11:03,720
At the same time,

162
00:11:03,720 --> 00:11:08,443
we could have been making predictions for the whole test data.

163
00:11:08,443 --> 00:11:10,230
And after four models,

164
00:11:10,230 --> 00:11:12,463
we will just take an average at the end.

165
00:11:12,463 --> 00:11:15,275
We'll just divide the test predictions by four.

166
00:11:15,275 --> 00:11:17,430
But a different way to do it, I have found

167
00:11:17,430 --> 00:11:20,123
this way I just explained better with neural networks,

168
00:11:20,123 --> 00:11:25,560
and the method where you use the whole training data to

169
00:11:25,560 --> 00:11:31,570
generate predictions for test better with tree-based methods.

170
00:11:31,570 --> 00:11:34,570
So, once we finish the predictions with the test,

171
00:11:34,570 --> 00:11:38,385
you can start again with another model this time.

172
00:11:38,385 --> 00:11:40,810
So you will generate an empty prediction,

173
00:11:40,810 --> 00:11:43,710
you will stack it next to your previous one.

174
00:11:43,710 --> 00:11:45,700
And you will repeat the same process.

175
00:11:45,700 --> 00:11:48,265
You will essentially repeat this until you're

176
00:11:48,265 --> 00:11:51,785
finished with all models for the same layer.

177
00:11:51,785 --> 00:11:55,860
And then, this will become your new training data set and you

178
00:11:55,860 --> 00:12:00,390
will generally begin all over again if you have a new layer.

179
00:12:00,390 --> 00:12:03,420
This is generally the concept.

180
00:12:03,420 --> 00:12:06,300
Though we could say this,

181
00:12:06,300 --> 00:12:10,270
in order to extend on many layers,

182
00:12:10,270 --> 00:12:12,450
we use this K-Fold paradigm.

183
00:12:12,450 --> 00:12:18,880
However, normally, neural networks we have this notion of epochs.

184
00:12:18,880 --> 00:12:25,222
We have iterations which help us to re-calibrate the weights between the nodes.

185
00:12:25,222 --> 00:12:28,270
Here we don't have this option,

186
00:12:28,270 --> 00:12:30,100
the way stacking is.

187
00:12:30,100 --> 00:12:33,010
However, we can introduce

188
00:12:33,010 --> 00:12:38,945
this ability of revisiting the initial data through connections.

189
00:12:38,945 --> 00:12:44,400
So, a typical way to connect the nodes is the one we have

190
00:12:44,400 --> 00:12:49,375
already explored where you have it input nodes,

191
00:12:49,375 --> 00:12:54,945
each node is directly related with the nodes of the previous layer.

192
00:12:54,945 --> 00:12:59,017
Another way to do this is to say,

193
00:12:59,017 --> 00:13:02,205
a node is not only affected,

194
00:13:02,205 --> 00:13:06,595
connected with the nodes of the directly previous layer,

195
00:13:06,595 --> 00:13:11,455
but from all previous nodes from any previous layer.

196
00:13:11,455 --> 00:13:14,745
So, in order to illustrate this better,

197
00:13:14,745 --> 00:13:16,980
if you remember the example with

198
00:13:16,980 --> 00:13:21,750
the headmaster where he was using predictions from the teachers,

199
00:13:21,750 --> 00:13:27,594
he could have been using also predictions from the students at the same time.

200
00:13:27,594 --> 00:13:29,965
This actually can work quite well.

201
00:13:29,965 --> 00:13:33,645
And you can also refit the initial data.

202
00:13:33,645 --> 00:13:35,140
Not just the predictions,

203
00:13:35,140 --> 00:13:38,040
you can actually put your initial x data set,

204
00:13:38,040 --> 00:13:40,440
and append it to your predictions.

205
00:13:40,440 --> 00:13:46,265
This can work really well if you haven't made many models.

206
00:13:46,265 --> 00:13:53,335
So that way, you get the chance to revisit that initial training data,

207
00:13:53,335 --> 00:13:55,870
and try to capture more informations.

208
00:13:55,870 --> 00:14:00,660
And because we already have metal-models present,

209
00:14:00,660 --> 00:14:05,610
the model tries to focus on where we can explore any new information.

210
00:14:05,610 --> 00:14:08,690
So in this kind of situation it works quite well.

211
00:14:08,690 --> 00:14:13,980
Also, this is very similar to target encoding or many encoding you've seen

212
00:14:13,980 --> 00:14:19,975
before where you use some part of the data,

213
00:14:19,975 --> 00:14:22,676
let's say, a code on a categorical column,

214
00:14:22,676 --> 00:14:29,640
given some cross-validation, you generate some estimates for the target variable.

215
00:14:29,640 --> 00:14:33,720
And then, you insert this into your training data.

216
00:14:33,720 --> 00:14:36,470
Okay, you don't stack it,

217
00:14:36,470 --> 00:14:38,635
as in you don't create a new column,

218
00:14:38,635 --> 00:14:42,380
but essentially you replace one column with hold out

219
00:14:42,380 --> 00:14:47,220
predictions of your target variable which is essentially very similar.

220
00:14:47,220 --> 00:14:50,100
You have created the logic for the target variable,

221
00:14:50,100 --> 00:14:55,000
and you are essentially inserting it into your training data idea.