1
00:00:00,000 --> 00:00:03,876
[MUSIC]

2
00:00:03,876 --> 00:00:09,178
In this video, we will talk about
hyperparameter optimization for

3
00:00:09,178 --> 00:00:11,161
some tree based models.

4
00:00:11,161 --> 00:00:15,707
Nowadays, XGBoost and
LightGBM became really gold standard.

5
00:00:15,707 --> 00:00:20,325
They are just awesome implementation
of a very versatile gradient boosted

6
00:00:20,325 --> 00:00:22,790
decision trees model.

7
00:00:22,790 --> 00:00:27,580
There is also a CatBoost library it
appeared exactly at the time when we were

8
00:00:27,580 --> 00:00:33,350
preparing this course, so CatBoost
didn't have time to win people's hearts.

9
00:00:33,350 --> 00:00:38,105
But it looks very interesting and
promising, so check it out.

10
00:00:38,105 --> 00:00:40,988
There is a very nice
implementation of RandomForest and

11
00:00:40,988 --> 00:00:42,530
ExtraTrees models sklearn.

12
00:00:42,530 --> 00:00:48,630
These models are powerful, and
can be used along with gradient boosting.

13
00:00:49,940 --> 00:00:54,940
And finally, there is a model
called regularized Greedy Forest.

14
00:00:54,940 --> 00:00:59,390
It showed very nice results from several
competitions, but its implementation is

15
00:00:59,390 --> 00:01:04,420
very slow and hard to use, but
you can try it on small data sets.

16
00:01:05,450 --> 00:01:10,030
Okay, what important parameters do
we have in XGBoost and LightGBM?

17
00:01:11,500 --> 00:01:17,960
The two libraries have similar parameters
and we'll use names from XGBoost.

18
00:01:17,960 --> 00:01:21,991
And on the right half of the slide
you will see somehow loosely

19
00:01:21,991 --> 00:01:25,242
corresponding parameter
names from LightGBM.

20
00:01:25,242 --> 00:01:30,524
To understand the parameters,
we better understand how XGBoost and

21
00:01:30,524 --> 00:01:33,720
LightGBM work at least a very high level.

22
00:01:35,280 --> 00:01:40,033
What these models do, these models
build decision trees one after another

23
00:01:40,033 --> 00:01:42,760
gradually optimizing a given objective.

24
00:01:44,270 --> 00:01:48,405
And first there are many parameters
that control the tree building process.

25
00:01:48,405 --> 00:01:52,200
Max_depth is the maximum depth of a tree.

26
00:01:53,310 --> 00:01:57,770
And of course, the deeper a tree can be
grown the better it can fit a dataset.

27
00:01:58,820 --> 00:02:02,850
So increasing this parameter will lead
to faster fitting to the train set.

28
00:02:03,970 --> 00:02:08,408
Depending on the task,
the optimal depth can vary a lot,

29
00:02:08,408 --> 00:02:11,748
sometimes it is 2, sometimes it is 27.

30
00:02:11,748 --> 00:02:17,073
If you increase the depth and can not get
the model to overfit, that is, the model

31
00:02:17,073 --> 00:02:22,260
is becoming better and better on the
validation set as you increase the depth.

32
00:02:23,330 --> 00:02:27,920
It can be a sign that there are a lot
of important interactions to

33
00:02:27,920 --> 00:02:29,480
extract from the data.

34
00:02:29,480 --> 00:02:33,410
So it's better to stop tuning and
try to generate some features.

35
00:02:34,940 --> 00:02:38,560
I would recommend to start with
a max_depth of about seven.

36
00:02:40,320 --> 00:02:44,170
Also remember that as
you increase the depth,

37
00:02:44,170 --> 00:02:46,090
the learning will take a longer time.

38
00:02:46,090 --> 00:02:49,091
So do not set depth to
a very higher values

39
00:02:49,091 --> 00:02:52,196
unless you are 100% sure you need it.

40
00:02:52,196 --> 00:02:56,120
In LightGBM,
it is possible to control the number of

41
00:02:56,120 --> 00:02:59,240
leaves in the tree rather
than the maximum depth.

42
00:03:00,290 --> 00:03:03,590
It is nice since a resulting
tree can be very deep,

43
00:03:03,590 --> 00:03:08,150
but have small number of leaves and
not over fit.

44
00:03:08,150 --> 00:03:13,060
Some simple parameter controls a fraction
of objects to use when feeding a tree.

45
00:03:14,380 --> 00:03:16,210
It's a value between 0 and 1.

46
00:03:17,760 --> 00:03:23,130
One might think that it's better
always use all the objects, right?

47
00:03:23,130 --> 00:03:25,860
But in practice,
it turns out that it's not.

48
00:03:27,550 --> 00:03:31,350
Actually, if only a fraction of
objects is used at every duration,

49
00:03:31,350 --> 00:03:35,400
then the model is less
prone to overfitting.

50
00:03:35,400 --> 00:03:41,740
So using a fraction of objects, the model
will fit slower on the train set, but at

51
00:03:41,740 --> 00:03:48,640
the same time it will probably generalize
better than this over-fitted model.

52
00:03:48,640 --> 00:03:52,060
So, it works kind of as a regularization.

53
00:03:53,390 --> 00:03:58,771
Similarly, if we can consider only
a fraction of features [INAUDIBLE] split,

54
00:03:58,771 --> 00:04:04,430
this is controlled by parameters
colsample_bytree and colsample_bylevel.

55
00:04:04,430 --> 00:04:07,310
Once again, if the model is over fitting,

56
00:04:07,310 --> 00:04:10,650
you can try to lower
down these parameters.

57
00:04:10,650 --> 00:04:15,040
There are also various regularization
parameters, min_child_weight,

58
00:04:15,040 --> 00:04:17,410
lambda, alpha and others.

59
00:04:18,410 --> 00:04:20,895
The most important one
is min_child_weight.

60
00:04:22,400 --> 00:04:26,280
If we increase it,
the model will become more conservative.

61
00:04:26,280 --> 00:04:29,180
If we set it to 0,
which is the minimum value for

62
00:04:29,180 --> 00:04:32,220
this parameter,
the model will be less constrained.

63
00:04:33,560 --> 00:04:34,360
In my experience,

64
00:04:34,360 --> 00:04:39,910
it's one of the most important parameters
to tune in XGBoost and LightGBM.

65
00:04:39,910 --> 00:04:45,041
Depending on the task,
I find optimal values to be 0, 5,

66
00:04:45,041 --> 00:04:52,653
15, 300, so do not hesitate to try a wide
range of values, it depends on the data.

67
00:04:54,855 --> 00:04:58,384
To this end we were discussing
hyperparameters that are used to

68
00:04:58,384 --> 00:04:59,190
build a tree.

69
00:05:00,400 --> 00:05:04,878
And next, there are two very important
parameters that are tightly connected,

70
00:05:04,878 --> 00:05:08,600
eta and num_rounds.

71
00:05:08,600 --> 00:05:12,922
Eta is essentially a learning weight,
like in gradient descent.

72
00:05:12,922 --> 00:05:18,250
And the num_round is the how many
learning steps we want to perform or

73
00:05:18,250 --> 00:05:21,506
in other words how many
tree's we want to build.

74
00:05:21,506 --> 00:05:25,570
With each iteration
a new tree is built and

75
00:05:25,570 --> 00:05:29,350
added to the model with
a learning rate eta.

76
00:05:29,350 --> 00:05:31,860
So in general,
the higher the learning rate,

77
00:05:31,860 --> 00:05:37,230
the faster the model fits to the train set
and probably it can lead to over fitting.

78
00:05:37,230 --> 00:05:41,490
And more steps model does,
the better the model fits.

79
00:05:42,550 --> 00:05:44,540
But there are several caveats here.

80
00:05:45,670 --> 00:05:51,100
It happens that with a too high learning
rate the model will not fit at all,

81
00:05:51,100 --> 00:05:52,620
it will just not converge.

82
00:05:53,730 --> 00:05:58,420
So first, we need to find out if we
are using small enough learning rate.

83
00:05:59,450 --> 00:06:02,420
On the other hand,
if the learning rate is too small,

84
00:06:02,420 --> 00:06:07,300
the model will learn nothing
after a large number of rounds.

85
00:06:08,600 --> 00:06:13,420
But at the same time, small learning rate
often leads to a better generalization.

86
00:06:14,600 --> 00:06:19,280
So it means that learning
rate should be just right, so

87
00:06:19,280 --> 00:06:22,400
that the model generalize and
doesn't take forever to train.

88
00:06:24,431 --> 00:06:30,966
The nice thing is that we can freeze
eta to be reasonably small, say, 0.1 or

89
00:06:30,966 --> 00:06:37,917
0.01, and then find how many rounds we
should train the model til it over fits.

90
00:06:37,917 --> 00:06:40,960
We usually use early stopping for it.

91
00:06:40,960 --> 00:06:45,860
We monitor the validation loss and exit
the training when loss starts to go up.

92
00:06:47,170 --> 00:06:51,690
Now when we found
the right number of rounds,

93
00:06:51,690 --> 00:06:55,180
we can do a trick that
usually improves the score.

94
00:06:55,180 --> 00:06:59,990
We multiply the number of
steps by a factor of alpha and

95
00:06:59,990 --> 00:07:04,180
at the same time,
we divide eta by the factor of alpha.

96
00:07:05,280 --> 00:07:10,310
For example, we double the number
of steps and divide eta by 2.

97
00:07:10,310 --> 00:07:13,720
In this case, the learning will
take twice longer in time, but

98
00:07:13,720 --> 00:07:16,080
the resulting model
usually becomes better.

99
00:07:17,230 --> 00:07:21,190
It may happen that the valid parameters
will need to be adjusted too,

100
00:07:21,190 --> 00:07:23,600
but usually it's okay to leave them as is.

101
00:07:24,610 --> 00:07:28,850
Finally, you may want to
use random seed argument,

102
00:07:28,850 --> 00:07:31,590
many people recommend to
fix seed before hand.

103
00:07:33,162 --> 00:07:36,918
I think it doesn't make too much
sense to fix seed in XGBoost,

104
00:07:36,918 --> 00:07:42,520
as anyway every changed parameter will
lead to completely different model.

105
00:07:42,520 --> 00:07:46,210
But I would use this
parameter to verify that

106
00:07:46,210 --> 00:07:49,807
different random seeds do not
change training results much.

107
00:07:49,807 --> 00:07:55,850
Say [INAUDIBLE] competition,
one could jump 1,000 places up or

108
00:07:55,850 --> 00:08:02,660
down on the leaderboard just by training
a model with different random seeds.

109
00:08:02,660 --> 00:08:06,310
If random seed doesn't
affect model too much, good.

110
00:08:07,510 --> 00:08:11,550
In other case, I suggest you to think
one more time if it's a good idea to

111
00:08:11,550 --> 00:08:16,420
participate in that competition as
the results can be quite random.

112
00:08:17,530 --> 00:08:21,120
Or at least I suggest you to adjust
validation scheme and account for

113
00:08:21,120 --> 00:08:21,750
the randomness.

114
00:08:22,860 --> 00:08:26,370
All right,
we're finished with gradient boosting.

115
00:08:26,370 --> 00:08:29,940
Now let's get to RandomForest and
ExtraTrees.

116
00:08:31,430 --> 00:08:36,000
In fact, ExtraTrees is just a more
randomized version of RandomForest and

117
00:08:36,000 --> 00:08:37,400
has the same parameters.

118
00:08:37,400 --> 00:08:42,120
So I will say RandomForest
meaning both of the models.

119
00:08:42,120 --> 00:08:46,230
RandomForest and ExtraBoost build trees,
one tree after another.

120
00:08:47,310 --> 00:08:52,580
But, RandomForest builds each
tree to be independent of others.

121
00:08:52,580 --> 00:08:56,850
It means that having a lot of trees
doesn't lead to overfeeding for

122
00:08:56,850 --> 00:08:59,738
RandomForest as opposed
to gradient boosting.

123
00:09:01,480 --> 00:09:03,420
In sklearn, the number of trees for

124
00:09:03,420 --> 00:09:09,091
random forest is controlled
by N_estimators parameter.

125
00:09:09,091 --> 00:09:10,500
At the start,

126
00:09:10,500 --> 00:09:15,820
we may want to determine what number
of trees is sufficient to have.

127
00:09:15,820 --> 00:09:21,230
That is, if we use more than that,
the result will not change much,

128
00:09:21,230 --> 00:09:23,890
but the models will fit longer.

129
00:09:25,020 --> 00:09:29,314
I usually first set N_estimators
to very small number, say 10,

130
00:09:29,314 --> 00:09:33,020
and see how long does it take
to fit 10 trees on that data.

131
00:09:34,620 --> 00:09:39,720
If it is not too long then I set
N_estimators to a huge value,

132
00:09:39,720 --> 00:09:42,950
say 300, but it actually depends.

133
00:09:44,000 --> 00:09:45,020
And feed the model.

134
00:09:46,160 --> 00:09:51,038
And then I plot how the validation
error changed depending on a number of

135
00:09:51,038 --> 00:09:52,840
used trees.

136
00:09:52,840 --> 00:09:54,470
This plot usually looks like that.

137
00:09:55,480 --> 00:10:02,040
We have number of trees on the x-axis and
the accuracy score on y-axis.

138
00:10:02,040 --> 00:10:06,740
We see here that about 50 trees
already give reasonable score and

139
00:10:06,740 --> 00:10:11,830
we don't need to use more
while tuning parameter.

140
00:10:11,830 --> 00:10:15,486
It's pretty reliable to use 50 trees.

141
00:10:15,486 --> 00:10:17,880
Before submitting to leaderboard,

142
00:10:17,880 --> 00:10:21,670
we can set N_estimators to
a higher value just to be sure.

143
00:10:23,410 --> 00:10:26,338
You can find code for this plot,
actually, in the reading materials.

144
00:10:26,338 --> 00:10:28,079
Similarly to XGBoost,

145
00:10:28,079 --> 00:10:33,500
there is a parameter max_depth
that controls depth of the trees.

146
00:10:33,500 --> 00:10:35,830
But differently to XGBoost,

147
00:10:35,830 --> 00:10:40,040
it can be set to none,
which corresponds to unlimited depth.

148
00:10:41,370 --> 00:10:45,720
It can be very useful actually when
the features in the data set have repeated

149
00:10:45,720 --> 00:10:49,330
values and important interactions.

150
00:10:49,330 --> 00:10:49,960
In other cases,

151
00:10:49,960 --> 00:10:54,445
the model with unconstrained
depth will over fit immediately.

152
00:10:54,445 --> 00:10:59,870
I recommend you to start with a depth
of about 7 for random forest.

153
00:11:01,290 --> 00:11:05,588
Usually an optimal depth for
random forests is higher than for

154
00:11:05,588 --> 00:11:11,133
gradient boosting, so do not hesitate
to try a depth 10, 20, and higher.

155
00:11:11,133 --> 00:11:17,929
Max_features is similar to call
sample parameter from XGBoost.

156
00:11:17,929 --> 00:11:22,200
The more features I use to decipher
a split, the faster the training.

157
00:11:22,200 --> 00:11:29,159
But on the other hand,
you don't want to use too few features.

158
00:11:29,159 --> 00:11:33,967
And min_samples_leaf is
a regularization parameter similar to

159
00:11:33,967 --> 00:11:39,990
min_child_weight from XGBoost and
the same as min_data_leaf from LightGPM.

160
00:11:41,140 --> 00:11:45,702
For Random Forest classifier,
we can select a criterion to eleviate

161
00:11:45,702 --> 00:11:48,860
a split in the tree with
a criterion parameter.

162
00:11:50,140 --> 00:11:52,810
It can be either Gini or Entropy.

163
00:11:54,040 --> 00:11:57,450
To choose one, we should just try both and
pick the best performing one.

164
00:11:58,520 --> 00:12:03,569
In my experience Gini is better more
often, but sometimes Entropy wins.

165
00:12:05,490 --> 00:12:09,990
We can also fix random seed using
random_state parameter, if we want.

166
00:12:11,180 --> 00:12:16,940
And finally, do not forget to set n_jobs
parameter to a number of cores you have.

167
00:12:16,940 --> 00:12:22,280
As by default, RandomForest from sklearn
uses only one core for some reason.

168
00:12:23,560 --> 00:12:28,300
So in this video, we were talking
about various hyperparameters

169
00:12:28,300 --> 00:12:31,420
of gradient boost and
decision trees, and random forest.

170
00:12:32,590 --> 00:12:38,033
In the following video, we'll discuss

171
00:12:38,033 --> 00:12:43,001
neural networks and linear models.

172
00:12:43,001 --> 00:12:49,789
[MUSIC]