[MUSIC] In this video, we will talk about hyperparameter optimization for some tree based models. Nowadays, XGBoost and LightGBM became really gold standard. They are just awesome implementation of a very versatile gradient boosted decision trees model. There is also a CatBoost library it appeared exactly at the time when we were preparing this course, so CatBoost didn't have time to win people's hearts. But it looks very interesting and promising, so check it out. There is a very nice implementation of RandomForest and ExtraTrees models sklearn. These models are powerful, and can be used along with gradient boosting. And finally, there is a model called regularized Greedy Forest. It showed very nice results from several competitions, but its implementation is very slow and hard to use, but you can try it on small data sets. Okay, what important parameters do we have in XGBoost and LightGBM? The two libraries have similar parameters and we'll use names from XGBoost. And on the right half of the slide you will see somehow loosely corresponding parameter names from LightGBM. To understand the parameters, we better understand how XGBoost and LightGBM work at least a very high level. What these models do, these models build decision trees one after another gradually optimizing a given objective. And first there are many parameters that control the tree building process. Max_depth is the maximum depth of a tree. And of course, the deeper a tree can be grown the better it can fit a dataset. So increasing this parameter will lead to faster fitting to the train set. Depending on the task, the optimal depth can vary a lot, sometimes it is 2, sometimes it is 27. If you increase the depth and can not get the model to overfit, that is, the model is becoming better and better on the validation set as you increase the depth. It can be a sign that there are a lot of important interactions to extract from the data. So it's better to stop tuning and try to generate some features. I would recommend to start with a max_depth of about seven. Also remember that as you increase the depth, the learning will take a longer time. So do not set depth to a very higher values unless you are 100% sure you need it. In LightGBM, it is possible to control the number of leaves in the tree rather than the maximum depth. It is nice since a resulting tree can be very deep, but have small number of leaves and not over fit. Some simple parameter controls a fraction of objects to use when feeding a tree. It's a value between 0 and 1. One might think that it's better always use all the objects, right? But in practice, it turns out that it's not. Actually, if only a fraction of objects is used at every duration, then the model is less prone to overfitting. So using a fraction of objects, the model will fit slower on the train set, but at the same time it will probably generalize better than this over-fitted model. So, it works kind of as a regularization. Similarly, if we can consider only a fraction of features [INAUDIBLE] split, this is controlled by parameters colsample_bytree and colsample_bylevel. Once again, if the model is over fitting, you can try to lower down these parameters. There are also various regularization parameters, min_child_weight, lambda, alpha and others. The most important one is min_child_weight. If we increase it, the model will become more conservative. If we set it to 0, which is the minimum value for this parameter, the model will be less constrained. In my experience, it's one of the most important parameters to tune in XGBoost and LightGBM. Depending on the task, I find optimal values to be 0, 5, 15, 300, so do not hesitate to try a wide range of values, it depends on the data. To this end we were discussing hyperparameters that are used to build a tree. And next, there are two very important parameters that are tightly connected, eta and num_rounds. Eta is essentially a learning weight, like in gradient descent. And the num_round is the how many learning steps we want to perform or in other words how many tree's we want to build. With each iteration a new tree is built and added to the model with a learning rate eta. So in general, the higher the learning rate, the faster the model fits to the train set and probably it can lead to over fitting. And more steps model does, the better the model fits. But there are several caveats here. It happens that with a too high learning rate the model will not fit at all, it will just not converge. So first, we need to find out if we are using small enough learning rate. On the other hand, if the learning rate is too small, the model will learn nothing after a large number of rounds. But at the same time, small learning rate often leads to a better generalization. So it means that learning rate should be just right, so that the model generalize and doesn't take forever to train. The nice thing is that we can freeze eta to be reasonably small, say, 0.1 or 0.01, and then find how many rounds we should train the model til it over fits. We usually use early stopping for it. We monitor the validation loss and exit the training when loss starts to go up. Now when we found the right number of rounds, we can do a trick that usually improves the score. We multiply the number of steps by a factor of alpha and at the same time, we divide eta by the factor of alpha. For example, we double the number of steps and divide eta by 2. In this case, the learning will take twice longer in time, but the resulting model usually becomes better. It may happen that the valid parameters will need to be adjusted too, but usually it's okay to leave them as is. Finally, you may want to use random seed argument, many people recommend to fix seed before hand. I think it doesn't make too much sense to fix seed in XGBoost, as anyway every changed parameter will lead to completely different model. But I would use this parameter to verify that different random seeds do not change training results much. Say [INAUDIBLE] competition, one could jump 1,000 places up or down on the leaderboard just by training a model with different random seeds. If random seed doesn't affect model too much, good. In other case, I suggest you to think one more time if it's a good idea to participate in that competition as the results can be quite random. Or at least I suggest you to adjust validation scheme and account for the randomness. All right, we're finished with gradient boosting. Now let's get to RandomForest and ExtraTrees. In fact, ExtraTrees is just a more randomized version of RandomForest and has the same parameters. So I will say RandomForest meaning both of the models. RandomForest and ExtraBoost build trees, one tree after another. But, RandomForest builds each tree to be independent of others. It means that having a lot of trees doesn't lead to overfeeding for RandomForest as opposed to gradient boosting. In sklearn, the number of trees for random forest is controlled by N_estimators parameter. At the start, we may want to determine what number of trees is sufficient to have. That is, if we use more than that, the result will not change much, but the models will fit longer. I usually first set N_estimators to very small number, say 10, and see how long does it take to fit 10 trees on that data. If it is not too long then I set N_estimators to a huge value, say 300, but it actually depends. And feed the model. And then I plot how the validation error changed depending on a number of used trees. This plot usually looks like that. We have number of trees on the x-axis and the accuracy score on y-axis. We see here that about 50 trees already give reasonable score and we don't need to use more while tuning parameter. It's pretty reliable to use 50 trees. Before submitting to leaderboard, we can set N_estimators to a higher value just to be sure. You can find code for this plot, actually, in the reading materials. Similarly to XGBoost, there is a parameter max_depth that controls depth of the trees. But differently to XGBoost, it can be set to none, which corresponds to unlimited depth. It can be very useful actually when the features in the data set have repeated values and important interactions. In other cases, the model with unconstrained depth will over fit immediately. I recommend you to start with a depth of about 7 for random forest. Usually an optimal depth for random forests is higher than for gradient boosting, so do not hesitate to try a depth 10, 20, and higher. Max_features is similar to call sample parameter from XGBoost. The more features I use to decipher a split, the faster the training. But on the other hand, you don't want to use too few features. And min_samples_leaf is a regularization parameter similar to min_child_weight from XGBoost and the same as min_data_leaf from LightGPM. For Random Forest classifier, we can select a criterion to eleviate a split in the tree with a criterion parameter. It can be either Gini or Entropy. To choose one, we should just try both and pick the best performing one. In my experience Gini is better more often, but sometimes Entropy wins. We can also fix random seed using random_state parameter, if we want. And finally, do not forget to set n_jobs parameter to a number of cores you have. As by default, RandomForest from sklearn uses only one core for some reason. So in this video, we were talking about various hyperparameters of gradient boost and decision trees, and random forest. In the following video, we'll discuss neural networks and linear models. [MUSIC]