[MUSIC] In this and the next video, we will discuss, what are the ways to optimize classification metrics? In this video, we will discuss logloss and accuracy, and in the next one, AUC and quadratic-weighted kappa. Let's start with logloss, logloss for classification is like MSE for aggression, it is implemented everywhere. All we need to do is to find out what arguments should be passed to a library to make it use logloss for training. There are a huge number of libraries to try, like XGBoost, LightGBM, Logistic Regression, and [INAUDIBLE] classifier from sklearn, Vowpal Wabbit. All neural nets, by default, optimize logloss for classification. Random forest classifier predictions turn out to be quite bad in terms of logloss. But there is a way to make them better, we can calibrate the predictions to better fit logloss. We've mentioned several times that logloss requires model to output exterior probabilities, but what does it mean? It actually means that if we take all the points that have a score of, for example, 0.8, then there will be exactly four times more positive objects than negatives. That is, 80% of the points will be from class 1, and 20% from class 0. If the classifier doesn't directly optimize logloss, its predictions should be calibrated. Take a look at this plot, the blue line shows sorted by value predictions for the validation set. And the red line shows correspondent target values smoothed with rolling window. We clearly see that our predictions are kind of conservative. They´re much greater than two target mean on the left side, and much lower than they should be on the right side. So this classifier is not calibrated, and the green curve shows the predictions after calibration. But if we plot sorted predictions for calibrated classifier, the curve will be very similar to target rolling mean. And in fact, the calibrator predictions will have lower log loss. Now, there are several ways to calibrate predictions, for example, we can use so-called Platt scaling. Basically, we just need to fit a logistic regression to our predictions. I will not go into the details how to do that, but it's very similar to how we stack models, and we will discuss stacking in detail in a different video. Second, we can fit isotonic regression to our predictions, and again, it is done very similar to stacking, just another model. While finally, we can use stacking, so the idea is, we can fit any classifier. It doesn't need to optimize logloss, it just needs to be good, for example, in terms of AUC. And then we can fit another model on top that will take the predictions of our model, and calibrate them properly. And that model on top will use logloss as its optimization loss. So it will be optimizing indirectly, and its predictions will be calibrated. Logloss was the only metric that is easy to optimize directly. With accuracy, there is no easy recipe how to directly optimize it. In general, the recipe is following, actually, if it is a binary classification task, fit any metric, and tune with the binarization threshold. For multi-class tasks, fit any metric and tune parameters comparing the models by their accuracy score, not by the metric that the models were really optimizing. So this is kind of early stopping and the cross validation, where you look at the accuracy score. Just to get an intuition why accuracy is hard to optimize, let's look at this plot. So on the vertical axis we will show the loss, and the horizontal axis shows signed distance to the decision boundary, for example, to a hyper plane or for a linear model. The distance is considered to be positive if the class is predicted correctly. And negative if the object is located at the wrong side of the decision boundary. The blue line here shows zero-one loss, this is the loss that corresponds to accuracy score. We pay 1 if the object is misclassified, that is, the object has negative distance, and we pay nothing otherwise. The problem is that, this loss has zero almost everywhere gradient, with respect to the predictions. And most learning algorithms require a nonzero gradient to fit, otherwise it's not clear how we need to change the predictions such that loss is decreased. And so people came up with proxy losses that are upper bounds for these zero-one loss. So if you perfectly fit the proxy loss, the accuracy will be perfect too, but differently to zero-one loss, they are differentiable. For example, you see here logistic loss, the red curve used in logistic regression, and hinge loss, loss used in SVM. Now recall that to obtain hard labels for a test object, we usually take argmax of our soft predictions, picking the class with a maximum score. If our task is binary and soft predictions sum up to 1, argmax is equivalent to threshold function. Output 1 when the predictions for the class one is higher than 0.5, and output 0 when the prediction's lower. So we've already seen this example where threshold 0.5 is not optimal, so what can we do? We can tune the threshold we apply, we can do it with a simple grid search implemented with a for loop. Well, it means that we can basically fit any sufficiently powerful model. It will not matter much what loss exactly, say, hinge or log loss the model will optimize. All we want from our model's predictions is the existence of a good threshold that will separate the classes. Also, if our classifier is ideally calibrated, then it is really returning posterior probabilities. And for such a classifier, threshold 0.5 would be optimal, but such classifiers are rarely the case, and threshold tuning helps often. So in this video, we discussed logloss and accuracy, in the next video we will discuss AUC and quadratic weighted kappa. [MUSIC]