1
00:00:00,000 --> 00:00:03,780
[MUSIC]

2
00:00:03,780 --> 00:00:08,272
In this video we'll briefly discuss
neural network libraries and

3
00:00:08,272 --> 00:00:13,979
then we'll see how to tune hyperparameters
for neural networks and linear models.

4
00:00:13,979 --> 00:00:20,966
There are so many frameworks,
Keras, TensorFlow, MxNet, PyTorch.

5
00:00:20,966 --> 00:00:23,009
The choice is really personal,

6
00:00:23,009 --> 00:00:29,100
all frameworks implement more than enough
functionality for competition tasks.

7
00:00:29,100 --> 00:00:34,360
Keras is for sure the most popular in
Kaggle and has very simple interface.

8
00:00:35,650 --> 00:00:40,070
It takes only several dozen lines
to train a network using Keras.

9
00:00:41,570 --> 00:00:46,000
TensorFlow is extensively used
by companies for production.

10
00:00:46,000 --> 00:00:50,320
And PyTorch is very popular in
deep learning research community.

11
00:00:51,410 --> 00:00:54,230
I personally recommend
you to try PyTorch and

12
00:00:54,230 --> 00:00:58,690
Keras as they are most transparent and
easy to use frameworks.

13
00:00:59,730 --> 00:01:03,810
Now, how do you tune
hyperparameters in a network?

14
00:01:03,810 --> 00:01:07,250
We'll now talk about only
dense neural networks,

15
00:01:07,250 --> 00:01:11,280
that is the networks that consist
only of fully connected layers.

16
00:01:12,430 --> 00:01:15,428
Say we start with a three
layer neural network,

17
00:01:15,428 --> 00:01:20,920
what do we expect to happen if we
increase the number of neurons per layer?

18
00:01:22,170 --> 00:01:25,860
The network now can learn more
complex decision boundaries and

19
00:01:25,860 --> 00:01:27,602
so it will over fit faster.

20
00:01:28,870 --> 00:01:34,100
The same should happen when the number
of layers are increased, but

21
00:01:34,100 --> 00:01:39,620
due to optimization problems,
the learning can even stop to converge.

22
00:01:40,680 --> 00:01:44,700
But anyway, if you think your
network is not powerful enough,

23
00:01:44,700 --> 00:01:47,760
you can try to add another layer and
see what happens.

24
00:01:48,940 --> 00:01:52,666
My recommendation here is to
start with something very simple,

25
00:01:52,666 --> 00:01:58,640
say 1 or 2 layer and 64 units per layer.

26
00:01:58,640 --> 00:02:02,330
Debug the code, make sure the training and
[INAUDIBLE] losses go down.

27
00:02:03,510 --> 00:02:08,660
And then try to find a configuration that
is able to overfit the training set,

28
00:02:09,830 --> 00:02:11,280
just as another sanity check.

29
00:02:13,120 --> 00:02:16,550
After it, it is time to tune
something in the network.

30
00:02:18,140 --> 00:02:22,780
One of the crucial parts of neural
network is selected optimization method.

31
00:02:23,890 --> 00:02:28,560
Broadly speaking, we can pick either
vanilla stochastic gradient descent with

32
00:02:28,560 --> 00:02:34,040
momentum or
one of modern adaptive methods like Adam,

33
00:02:34,040 --> 00:02:36,680
Adadelta, Adagrad and so on.

34
00:02:38,150 --> 00:02:42,138
On this slide,
the adaptive methods are colored in green,

35
00:02:42,138 --> 00:02:43,904
as compared to SGD in red.

36
00:02:43,904 --> 00:02:47,869
I want to show here that adaptive
methods do really allow you

37
00:02:47,869 --> 00:02:50,740
to fit the training set faster.

38
00:02:50,740 --> 00:02:54,908
But in my experience,
they also lead to overfitting.

39
00:02:54,908 --> 00:02:58,969
Plain old stochastic gradient
descent converges slower, but

40
00:02:58,969 --> 00:03:02,270
the trained network usually
generalizes better.

41
00:03:03,730 --> 00:03:05,110
Adaptive methods are useful,

42
00:03:05,110 --> 00:03:09,100
but in the settings others in
classification and regression.

43
00:03:11,090 --> 00:03:13,960
Now here is a question for you.

44
00:03:13,960 --> 00:03:15,570
Just keep the size.

45
00:03:15,570 --> 00:03:18,780
What should we expect when
increasing a batch size with

46
00:03:18,780 --> 00:03:20,480
other hyperparameters fixed?

47
00:03:21,800 --> 00:03:26,730
In fact, it turns out that huge batch
size leads to more overfitting.

48
00:03:26,730 --> 00:03:31,110
Say a batch of 500 objects
is large in experience.

49
00:03:32,290 --> 00:03:37,664
I recommend to pick a value around 32 or
64.

50
00:03:37,664 --> 00:03:40,931
Then if you see the network is
still overfitting try to decrease

51
00:03:40,931 --> 00:03:41,800
the batch size.

52
00:03:43,220 --> 00:03:45,220
If it is underfitting, try to increase it.

53
00:03:46,420 --> 00:03:50,220
Know that is a the number
of outbox is fixed,

54
00:03:50,220 --> 00:03:55,050
then a network with a batch
size reduced by a factor of 2

55
00:03:55,050 --> 00:03:59,710
gets updated twice more times
compared to original network.

56
00:04:00,840 --> 00:04:03,120
So take this into consideration.

57
00:04:03,120 --> 00:04:07,290
Maybe you need to reduce the number of
networks or at least somehow adjust it.

58
00:04:09,110 --> 00:04:14,490
The batch size also should not be too
small, the gradient will be too noisy.

59
00:04:15,590 --> 00:04:20,620
Same as in gradient boosting,
we need to set the proper learning rate.

60
00:04:20,620 --> 00:04:24,270
When the learning rate is too high,
network will not converge and

61
00:04:24,270 --> 00:04:28,610
with too small a learning rate,
the network will learn forever.

62
00:04:30,060 --> 00:04:33,140
The learning rate should be
not too high and not too low.

63
00:04:33,140 --> 00:04:37,160
So the optimal learning rate
depends on the other parameters.

64
00:04:38,160 --> 00:04:43,750
I usually start with a huge learning rate,
say 0.1, and try to lower it down till

65
00:04:43,750 --> 00:04:50,470
I find one with which network converges
and then I try to revise further.

66
00:04:50,470 --> 00:04:53,900
Interestingly, there is a connection
between the batch size and

67
00:04:53,900 --> 00:04:55,460
the learning rate.

68
00:04:55,460 --> 00:05:00,450
It is theoretically grounded for
a specific type of models, but

69
00:05:00,450 --> 00:05:05,330
people usually use it,
well actually some people use it as

70
00:05:05,330 --> 00:05:08,610
a rule of thumb with neural networks.

71
00:05:08,610 --> 00:05:10,670
The connection is the following.

72
00:05:10,670 --> 00:05:14,420
If you increase the batch
size by a factor of alpha,

73
00:05:14,420 --> 00:05:18,640
you can also increase the learning
rate by the same factor.

74
00:05:19,890 --> 00:05:22,710
But remember that the larger batch size,

75
00:05:24,630 --> 00:05:26,970
the more your network is
prone to overfitting.

76
00:05:26,970 --> 00:05:29,840
So you need a good regularization here.

77
00:05:31,092 --> 00:05:37,350
Sometime ago, people mostly use L2 and
L1 regularization for weights.

78
00:05:37,350 --> 00:05:41,280
Nowadays, most people use
dropout regularization.

79
00:05:41,280 --> 00:05:47,257
So whenever you see a network overfitting,
try first to a dropout layer.

80
00:05:47,257 --> 00:05:53,310
You can override dropout probability and a
place where you insert the dropout layer.

81
00:05:53,310 --> 00:05:58,010
Usually people add the dropout layer
closer to the end of the network,

82
00:05:58,010 --> 00:06:02,400
but it's okay to add some dropout
to every layer, it also works.

83
00:06:03,870 --> 00:06:09,230
Dropout helps network to find features
that really matters, and what never worked

84
00:06:09,230 --> 00:06:14,800
for me is to have dropout as the very
first layer, immediately after data layer.

85
00:06:15,920 --> 00:06:19,850
This way some information is lost
completely at the very beginning

86
00:06:19,850 --> 00:06:23,070
of the network and
we observe performance degradation.

87
00:06:24,990 --> 00:06:28,360
An interesting regularization
technique that we used

88
00:06:28,360 --> 00:06:32,700
in the [UNKOWN] competition is
static dropconnect, as we call it.

89
00:06:34,242 --> 00:06:41,920
So recall that, usually we have an input
layer densely connected to, say 128 units.

90
00:06:41,920 --> 00:06:47,048
We will instead use a first
hidden layer with a very

91
00:06:47,048 --> 00:06:51,820
huge number of units, say 4,096 units.

92
00:06:53,430 --> 00:06:58,860
This is a huge network for a usual
competition and it will overfeed badly.

93
00:07:00,520 --> 00:07:05,110
But now to regularlize it,
we'll at random drop 99% of

94
00:07:05,110 --> 00:07:10,270
connections between the input layer and
the first hidden layer.

95
00:07:11,700 --> 00:07:16,310
We call it static dropconnect
because originally in dropconnect,

96
00:07:16,310 --> 00:07:23,180
we need to drop random connections at
every learning iterations while we

97
00:07:23,180 --> 00:07:28,657
fix connectivity pattern for the network
for the whole learning process.

98
00:07:28,657 --> 00:07:33,057
So you see the point, we increase
the number of hidden units, but

99
00:07:33,057 --> 00:07:37,550
the number of parameters in the first
hidden layer remains small.

100
00:07:39,270 --> 00:07:43,700
Notice that anyway the weight matrix
of the second layer becomes huge,

101
00:07:43,700 --> 00:07:46,550
but it turns out to be
okay in the practice.

102
00:07:48,590 --> 00:07:51,600
This is very powerful regularizations.

103
00:07:51,600 --> 00:07:55,840
And more of the networks with
different connectivity patterns makes

104
00:07:55,840 --> 00:07:59,625
much nicer than networks
without static dropconnect.

105
00:08:00,850 --> 00:08:06,040
All right, last class of models
to discuss are my neuro models.

106
00:08:07,590 --> 00:08:13,000
Yet, a carefully tuned live GPM would
probably beat support vector machines,

107
00:08:13,000 --> 00:08:16,100
even on a large, sparse data set.

108
00:08:16,100 --> 00:08:21,480
SVM's do not require almost any tuning,
which is truly beneficial.

109
00:08:22,650 --> 00:08:27,788
SVM's for classification and regression
are implemented in SK learners or

110
00:08:27,788 --> 00:08:32,534
wrappers to algorithms from libraries
called libLinear and libSVM.

111
00:08:33,570 --> 00:08:38,677
The latest version of libLinear and
libSVM support multicore competitions,

112
00:08:38,677 --> 00:08:43,549
but unfortunately it is not possible
to use multicore version in Sklearn,

113
00:08:43,549 --> 00:08:47,900
so we need to compile these libraries
manually to use this option.

114
00:08:49,770 --> 00:08:53,340
And I've never had anyone
use kernel SVC lately,

115
00:08:53,340 --> 00:08:56,750
so in this video we will
talk only about linear SVM.

116
00:08:58,250 --> 00:09:03,758
In Sklearn we can also find logistic and
linear regression with various

117
00:09:03,758 --> 00:09:09,462
regularization options and also,
as your declassifier and regressor.

118
00:09:09,462 --> 00:09:12,050
We've already mentioned them
while discussing metrics.

119
00:09:13,160 --> 00:09:18,980
For the data sets that do not fit in
the memory, we can use Vowpal Wabbit.

120
00:09:18,980 --> 00:09:23,380
It implements learning of linear
models in online fashion.

121
00:09:23,380 --> 00:09:27,680
It only reads data row by row
directly from the hard drive and

122
00:09:27,680 --> 00:09:30,958
never loads the whole
data set in the memory.

123
00:09:30,958 --> 00:09:35,900
Thus, allowing to learn
on a very huge data sets.

124
00:09:37,280 --> 00:09:43,040
A method of online learning for linear
models called flow the regularized leader

125
00:09:43,040 --> 00:09:50,100
or FTRL in short was particularly
popular some time ago.

126
00:09:50,100 --> 00:09:52,940
It is implemented in Vowpal Wabbit but

127
00:09:52,940 --> 00:09:56,580
there are also lots of
implementations in pure Python.

128
00:09:57,772 --> 00:10:02,640
The hyperparameters we usually need
to tune linear models are L2 and

129
00:10:02,640 --> 00:10:04,600
L1 regularization of weights.

130
00:10:05,720 --> 00:10:09,380
Once again, regularization is denoted
with red color because of the higher

131
00:10:09,380 --> 00:10:13,900
the regularization weight is the more
model struggle to learn something.

132
00:10:14,970 --> 00:10:20,080
But know that, the parameter see in
SVM is inversely proportional to

133
00:10:20,080 --> 00:10:23,270
regularization weight, so
the dynamics is opposite.

134
00:10:25,030 --> 00:10:29,850
In fact, we do not need to think about the
meaning of the parameters in the case of

135
00:10:29,850 --> 00:10:31,700
one parameter, right?

136
00:10:31,700 --> 00:10:36,290
We just try several values and
find one that works best.

137
00:10:37,530 --> 00:10:44,065
For SVM, I usually start a very small
seed, say 10 to the power of minus 6 and

138
00:10:44,065 --> 00:10:49,530
then I try to increase it,
multiply each time by a factor of 10.

139
00:10:50,770 --> 00:10:55,240
I start from small values because
the larger the parameter C is,

140
00:10:55,240 --> 00:10:56,890
the longer the training takes.

141
00:10:58,170 --> 00:11:02,070
What type of regularization,
L1 or L2 do you choose?

142
00:11:03,980 --> 00:11:06,380
Actually, my answer is try both.

143
00:11:06,380 --> 00:11:12,070
To my mind actually they are quite similar
and one benefit that L1 can give us

144
00:11:12,070 --> 00:11:18,020
is weight sparsity, so the sparsity
pattern can be used for feature selection.

145
00:11:19,360 --> 00:11:23,180
A general advise I want to give
here do not spend too much time on

146
00:11:23,180 --> 00:11:27,700
tuning hyperparameters, especially
when the competition has only begun.

147
00:11:28,840 --> 00:11:32,280
You cannot win a competition
by tuning parameters.

148
00:11:32,280 --> 00:11:36,780
Appropriate features, hacks,
leaks, and insights will

149
00:11:36,780 --> 00:11:41,409
give you much more than carefully
tuned model built on default features.

150
00:11:42,590 --> 00:11:44,350
I also advice you to be patient.

151
00:11:45,690 --> 00:11:47,640
It was my personal mistake several times.

152
00:11:47,640 --> 00:11:53,271
I hated to spend more then ten minutes
on training models and was amazed

153
00:11:53,271 --> 00:11:59,199
how much the models could improve if I
would let it train for a longer time.

154
00:11:59,199 --> 00:12:02,262
And finally, average everything.

155
00:12:02,262 --> 00:12:06,516
When submitting, learn five models
starting from different random

156
00:12:06,516 --> 00:12:09,680
initializations and
average their predictions.

157
00:12:11,150 --> 00:12:16,579
It helps a lot actually and
some people average not only random seed,

158
00:12:16,579 --> 00:12:20,330
but also other parameters
around an optimal value.

159
00:12:20,330 --> 00:12:24,920
For example, if optimal depth for
extra boost is 5,

160
00:12:24,920 --> 00:12:29,819
we can average 3 digiboosts with depth 3,
4, and 5.

161
00:12:29,819 --> 00:12:32,881
Wow, it would be better if we could

162
00:12:32,881 --> 00:12:37,534
averaged 3 digiboosts with depth 4,
5, and 6.

163
00:12:38,660 --> 00:12:40,570
Finally, in this lecture,

164
00:12:40,570 --> 00:12:46,300
we discussed what is a general pipeline
for a hyperparameter optimization.

165
00:12:46,300 --> 00:12:50,890
And we saw, in particular,
what important hyperparameters derive for

166
00:12:50,890 --> 00:12:54,900
several models,
gradient boosting decision trees,

167
00:12:54,900 --> 00:12:59,630
random forests and extra trees,
neural networks, and linear models.

168
00:13:00,870 --> 00:13:04,985
I hope you found something interesting
in this lecture and see you later.

169
00:13:04,985 --> 00:13:14,985
[MUSIC]