[MUSIC] In this video we'll briefly discuss
neural network libraries and then we'll see how to tune hyperparameters
for neural networks and linear models. There are so many frameworks,
Keras, TensorFlow, MxNet, PyTorch. The choice is really personal, all frameworks implement more than enough
functionality for competition tasks. Keras is for sure the most popular in
Kaggle and has very simple interface. It takes only several dozen lines
to train a network using Keras. TensorFlow is extensively used
by companies for production. And PyTorch is very popular in
deep learning research community. I personally recommend
you to try PyTorch and Keras as they are most transparent and
easy to use frameworks. Now, how do you tune
hyperparameters in a network? We'll now talk about only
dense neural networks, that is the networks that consist
only of fully connected layers. Say we start with a three
layer neural network, what do we expect to happen if we
increase the number of neurons per layer? The network now can learn more
complex decision boundaries and so it will over fit faster. The same should happen when the number
of layers are increased, but due to optimization problems,
the learning can even stop to converge. But anyway, if you think your
network is not powerful enough, you can try to add another layer and
see what happens. My recommendation here is to
start with something very simple, say 1 or 2 layer and 64 units per layer. Debug the code, make sure the training and
[INAUDIBLE] losses go down. And then try to find a configuration that
is able to overfit the training set, just as another sanity check. After it, it is time to tune
something in the network. One of the crucial parts of neural
network is selected optimization method. Broadly speaking, we can pick either
vanilla stochastic gradient descent with momentum or
one of modern adaptive methods like Adam, Adadelta, Adagrad and so on. On this slide,
the adaptive methods are colored in green, as compared to SGD in red. I want to show here that adaptive
methods do really allow you to fit the training set faster. But in my experience,
they also lead to overfitting. Plain old stochastic gradient
descent converges slower, but the trained network usually
generalizes better. Adaptive methods are useful, but in the settings others in
classification and regression. Now here is a question for you. Just keep the size. What should we expect when
increasing a batch size with other hyperparameters fixed? In fact, it turns out that huge batch
size leads to more overfitting. Say a batch of 500 objects
is large in experience. I recommend to pick a value around 32 or
64. Then if you see the network is
still overfitting try to decrease the batch size. If it is underfitting, try to increase it. Know that is a the number
of outbox is fixed, then a network with a batch
size reduced by a factor of 2 gets updated twice more times
compared to original network. So take this into consideration. Maybe you need to reduce the number of
networks or at least somehow adjust it. The batch size also should not be too
small, the gradient will be too noisy. Same as in gradient boosting,
we need to set the proper learning rate. When the learning rate is too high,
network will not converge and with too small a learning rate,
the network will learn forever. The learning rate should be
not too high and not too low. So the optimal learning rate
depends on the other parameters. I usually start with a huge learning rate,
say 0.1, and try to lower it down till I find one with which network converges
and then I try to revise further. Interestingly, there is a connection
between the batch size and the learning rate. It is theoretically grounded for
a specific type of models, but people usually use it,
well actually some people use it as a rule of thumb with neural networks. The connection is the following. If you increase the batch
size by a factor of alpha, you can also increase the learning
rate by the same factor. But remember that the larger batch size, the more your network is
prone to overfitting. So you need a good regularization here. Sometime ago, people mostly use L2 and
L1 regularization for weights. Nowadays, most people use
dropout regularization. So whenever you see a network overfitting,
try first to a dropout layer. You can override dropout probability and a
place where you insert the dropout layer. Usually people add the dropout layer
closer to the end of the network, but it's okay to add some dropout
to every layer, it also works. Dropout helps network to find features
that really matters, and what never worked for me is to have dropout as the very
first layer, immediately after data layer. This way some information is lost
completely at the very beginning of the network and
we observe performance degradation. An interesting regularization
technique that we used in the [UNKOWN] competition is
static dropconnect, as we call it. So recall that, usually we have an input
layer densely connected to, say 128 units. We will instead use a first
hidden layer with a very huge number of units, say 4,096 units. This is a huge network for a usual
competition and it will overfeed badly. But now to regularlize it,
we'll at random drop 99% of connections between the input layer and
the first hidden layer. We call it static dropconnect
because originally in dropconnect, we need to drop random connections at
every learning iterations while we fix connectivity pattern for the network
for the whole learning process. So you see the point, we increase
the number of hidden units, but the number of parameters in the first
hidden layer remains small. Notice that anyway the weight matrix
of the second layer becomes huge, but it turns out to be
okay in the practice. This is very powerful regularizations. And more of the networks with
different connectivity patterns makes much nicer than networks
without static dropconnect. All right, last class of models
to discuss are my neuro models. Yet, a carefully tuned live GPM would
probably beat support vector machines, even on a large, sparse data set. SVM's do not require almost any tuning,
which is truly beneficial. SVM's for classification and regression
are implemented in SK learners or wrappers to algorithms from libraries
called libLinear and libSVM. The latest version of libLinear and
libSVM support multicore competitions, but unfortunately it is not possible
to use multicore version in Sklearn, so we need to compile these libraries
manually to use this option. And I've never had anyone
use kernel SVC lately, so in this video we will
talk only about linear SVM. In Sklearn we can also find logistic and
linear regression with various regularization options and also,
as your declassifier and regressor. We've already mentioned them
while discussing metrics. For the data sets that do not fit in
the memory, we can use Vowpal Wabbit. It implements learning of linear
models in online fashion. It only reads data row by row
directly from the hard drive and never loads the whole
data set in the memory. Thus, allowing to learn
on a very huge data sets. A method of online learning for linear
models called flow the regularized leader or FTRL in short was particularly
popular some time ago. It is implemented in Vowpal Wabbit but there are also lots of
implementations in pure Python. The hyperparameters we usually need
to tune linear models are L2 and L1 regularization of weights. Once again, regularization is denoted
with red color because of the higher the regularization weight is the more
model struggle to learn something. But know that, the parameter see in
SVM is inversely proportional to regularization weight, so
the dynamics is opposite. In fact, we do not need to think about the
meaning of the parameters in the case of one parameter, right? We just try several values and
find one that works best. For SVM, I usually start a very small
seed, say 10 to the power of minus 6 and then I try to increase it,
multiply each time by a factor of 10. I start from small values because
the larger the parameter C is, the longer the training takes. What type of regularization,
L1 or L2 do you choose? Actually, my answer is try both. To my mind actually they are quite similar
and one benefit that L1 can give us is weight sparsity, so the sparsity
pattern can be used for feature selection. A general advise I want to give
here do not spend too much time on tuning hyperparameters, especially
when the competition has only begun. You cannot win a competition
by tuning parameters. Appropriate features, hacks,
leaks, and insights will give you much more than carefully
tuned model built on default features. I also advice you to be patient. It was my personal mistake several times. I hated to spend more then ten minutes
on training models and was amazed how much the models could improve if I
would let it train for a longer time. And finally, average everything. When submitting, learn five models
starting from different random initializations and
average their predictions. It helps a lot actually and
some people average not only random seed, but also other parameters
around an optimal value. For example, if optimal depth for
extra boost is 5, we can average 3 digiboosts with depth 3,
4, and 5. Wow, it would be better if we could averaged 3 digiboosts with depth 4,
5, and 6. Finally, in this lecture, we discussed what is a general pipeline
for a hyperparameter optimization. And we saw, in particular,
what important hyperparameters derive for several models,
gradient boosting decision trees, random forests and extra trees,
neural networks, and linear models. I hope you found something interesting
in this lecture and see you later. [MUSIC]