[MUSIC] In the previous videos, we discussed metrics for regression problems. And here, we'll review classification metrics. We will first talk about accuracy, logarithmic loss, and then get to area under a receiver operating curve, and Cohen's Kappa. And specifically Quadratic weighted Kappa. Let's start by fixing the notation. N will be the number of objects in our dataset, L, the number of classes. As before, y will stand for the target, and y hat, for predictions. If you see an expression in square brackets, that is an indicator function. It fields one if the expression is true and zero if it's false. Throughout the video, we'll use two more terms hard labels or hard predictions, and soft labels or soft predictions. Usually models output some kind of scores. For example, probabilities for an objects to belong to each class. The scores can be written as a vector of size L, and I will refer to this vector as to soft predictions. Now in classification we are usually asked to predict a label for the object, do a hard prediction. To do it, we usually find a maximum value in the soft predictions, and set class that corresponds to this maximum score as our predicted label. So hard label is a function of soft labels, it's usually arg max for multi class tasks, but for binary classification it can be thought of as a thresholding function. So we output label 1 when the soft score for the class 1 is higher than the threshold, and we output class 0 otherwise. Let's start our journey with the accuracy score. Accuracy is the most straightforward measure of classifiers quality. It's a value between 0 and 1. The higher, the better. And it is equal to the fraction of correctly classified objects. To compute accuracy, we need hard predictions. We need to assign each object a specific table. Now, what is the best constant to predict in case of accuracy? Actually, there are a small number of constants to try. We can only assign a class label to all the objects at once. So what class should we assign? Obviously, the most frequent one. Then the number of correctly guessed objects will be the highest. But exactly because of that reason, there is a caveat in interpreting the values of the accuracy score. Take a look at this example. Say we have 10 cats and 90 dogs in our train set. If we always predicted dog for every object, then the accuracy would be already 0.9. And imagine you tell someone that your classifier is correct 9 times out of 10. The person would probably think you have a nice model. But in fact, your model just predicts dog class no matter what input is. So the problem is, that the base line accuracy can be very high for a data set, even 99%, and that makes it hard to interpret the results. Although accuracy score is very clean and intuitive, it turns out to be quite hard to optimize. Accuracy also doesn't care how confident the classifier is in the predictions, and what soft predictions are. It cares only about arg max of soft predictions. And thus, people sometimes prefer to use different metrics that are first, easier to optimize. And second, these metrics work with soft predictions, not hard ones. One of such metrics is logarithmic loss. It tries to make the classifier to output two posterior probabilities for their objects to be of a certain kind, of a certain class. A log loss is usually the reason a little bit differently for binary and multi class tasks. For binary, it is assumed that y hat is a number from 01 range, and it is a probability of an object to belong to class one. So 1 minus y hat is the probability for this object to be of class 0. For multiclass tasks, LogLoss is written in this form. Here y hat ith is a vector of size L, and its sum is exactly 1. The elements are the probabilities to belong to each of the classes. Try to write this formula down for L equals 2, and you will see it is exactly binary loss from above. And finally, it should be mentioned that to avoid in practice, predictions are clipped to be not from 0 to 1, but from some small positive number to 1 minus some small positive number. Okay, now let us analyze it a little bit. Assume a target for an object is 0, and here on the plot, we see how the error will change if we change our predictions from 0 to 1. For comparison, we'll plot absolute error with another color. Logloss usually penalizes completely wrong answers and prefers to make a lot of small mistakes to one but severer mistake. Now, what is the best constant for logarithmic loss? It turns out that you need to set predictions to the frequencies of each class in the data set. In our case, the frequencies for the cat class is 0.1, and it is 0.9 for class dog. Then the best constant is vector of those two values. How do I, well how do I know that is so? To prove it we should take a derivative with the respect to constant alpha, set it to 0, and find alpha from this equation. Okay, we've discussed accuracy and log loss, now let's move on. Take a look at the example. We show ground truth target value with color, and the position of the point shows the classifier score. Recall that to compute accuracy score for a binary task, we usually take soft predictions from our model and apply threshold. We can see the prediction to be green if the score is higher than 0.5 and red if it's lower. For this example the accuracy is 6 or 7, as we misclassified one red object. But look, if the threshold was 0.7, then all the objects would be classified correctly. So this is kind of motivation for our next metric, Area Under Curve. We shouldn't fix the threshold for it, but this metric kind of tries all possible ones and aggregates those scores. So this metric doesn't really cares about absolute values of the predictions. But it depends only on the order of the objects. Actually, there are several ways AUC, or this area under curve, can be explained. The first one explains under what curve we should compute area. And the second explains AUC as the probability of object pairs to be correctly ordered by our model. We will see both explanations in the moment. So let's start with the first one. So we need to calculate an area under a curve. What curve? Let's construct it right now. Once again, say we have six objects, and their true label is shown with a color. And the position of the dot shows the classifier's predictions. And for now we will use word positive as synonym to belongs to the red class. So positive side is on the left. What we will do now, we'll go from left to right, jump from one object to another. And for each we will calculate how many red and green dots are there to the left, to this object that we stand on. The red dots we'll have a name for them, true positives. And for the green ones we'll have name false positives. So we will kind of compute how many true positives and false positives we see to the left of the object we stand on. Actually it's very simple, we start from bottom left corner and go up every time we see red point. And right when we see a green one. Let's see. So we stand on the leftmost point first. And it is red, or positive. So we increase the number of true positives and move up. Next, we jump on the green point. It is false positive, and so we go right. Then two times up for two red points. And finally two times right for the last green point. We finished in the top right corner. And it always works like that. We start from bottom left and end up in top right corner when we jump on the right most point. By the way, the curve we've just built is called Receiver Operating Curve or ROC Curve. And now we are ready to calculate an area under this curve. The area is seven and we need to normalize it by the total plural area of the square. So AUC is 7/9, cool. Now what AUC will be for the data set that can be separated with a threshold, like in our initial example? Actually AUC will be 1, maximum value of AUC. So it works. It doesn't need a threshold to be specified and it doesn't depend on absolute values. Recall that we've never used absolute values while constructing the curve. Now in practice, if you build such curve for a huge data set in real classifier, you would observe a picture like that. Here curves for different classifiers are shown with different colors. The curves usually lie above the dashed line which shows how would the curve look like if we made predictions at random. So it kind of shows us a baseline. And note that the area under the dashed line is 0.5. All right, we've seen that we can build a curve and compute area under it. There is another total different explanation for the AUC. Consider all pairs of objects, such that one object is from red class and another one is from green. AUC is a probability that score for the green one will be higher than the score for the red one. In other words, AUC is a fraction of correctly ordered pairs. You see in our example we have two incorrectly ordered pairs and nine pairs in total. And then there are 7 correctly ordered pairs and thus AUC is 7/9. Exactly as we got before, while computing area under the curve. All right, we've discussed how to compute AUC. Now let's think what is the best constant prediction for it. In fact, AUC doesn't depend on the exact values of the predictions. So all constants will lead to the same score and this score will be around 0.5, the baseline. This is actually something that people love about AUC. It is clear what the baseline is. Of course there are flaws in AUC, every metric has some. But still AUC is metric I usually use when no one sets up another one for me. All right, finally let's get to the last metric to discuss, Cohen's Kappa and it's derivatives. Recall that if we always predict the label of the most frequent class, we can already get pretty high accuracy score, and that can be misleading. Actually in our example all the models will fit, will have a score somewhere between 0.9 and 1. So we can introduce a new metric such that for an accuracy of 1 it would give us 1, and for the baseline accuracy it would output 0. And of course, baselines are going to be different for every data, not necessarily 0.9 or whatever. It is also very similar to what r squared does with MSE. It informally saying is kind of normalizes it. So we do the same here. And this is actually already almost Cohen's Kappa. In Cohen's Kappa we take another value as the baseline. We take the higher predictions for the data set and shuffle them, like randomly permute. And then we calculate an accuracy for these shuffled predictions. And that will be our baseline. Well to be precise, we permute and calculate accuracies many times and take, as the baseline, an average for those computed accuracies. In practice, of course, we do not need to do any permutations. This baseline score can be computed analytically. We need, first, to multiply the empirical frequencies of our predictions and grant those labels for each class, and then sum them up. For example, if we assign 20 cat labels and 80 dog labels at random, then the baseline accuracy will be 0.2*0.1 + 0.8*0.9 = 0.74. You can find more examples in actually. Here I wanted to explain a nice way of thinking about eliminator as a baseline. We can also recall that error is equal to 1 minus accuracy. We could rewrite the formula as 1 minus model's error/baseline error. It will still be Cohen's Kappa, but now, it would be easier to derive weighted Cohen's Kappa. To explain weighted Kappa, we first need to do a step aside, and introduce weighted error. See now we have cats, dogs and tigers to classify. And we are more or less okay if we predict dog instead of cat. But it's undesirable to predict cat or dog if it's really a tiger. So we're going to form a weight matrix where each cell contains The weight for the mistake we might do. In our case, we set error weight to be ten times larger if we predict cat or dog, but the ground truth label is tiger. So with error weight matrix, we can express our preference on the errors that the classifier would make. Now, to calculate weight and error we need another matrix, confusion matrix, for the classifier's prediction. This matrix shows how our classifier distributes the predictions over the objects. For example, the first column indicates that four cats out of ten were recognized correctly, two were classified as dogs and four as tigers. So to get a weighted error score, we need to multiply these two matrices element-wise and sum their results. This formula needs a proper normalization to make sure the quantity is between 0 and 1, but it doesn't matter for our purposes, as the normalization constant will anyway cancel. And finally, weighted kappa is calculated as 1- weighted error / weighted baseline error. In many cases, the weight matrices are defined in a very simple way. For example, for classification problems with ordered labels. Say you need to assign each object a value from 1 to 3. It can be, for instance, a rating of how severe the disease is. And it is not regression, since you do not allow to output values to be somewhere between the ratings and the ground truth values also look more like labels, not as numeric values to predict. So such problems are usually treated as classification problems, but weight matrix is introduced to account for order of the labels. For example, weights can be linear, if we predict two instead of one, we pay one. If we predict three instead of of one, we pay two. Or the weights can be quadratic, if we'll predict two instead of one, we still pay one, but if we predict three instead of one, we now pay for. Depending on what weight matrix is used, we get either linear weighted kappa or quadratic weighted kappa. The quadratic weighted kappa has been used in several competitions on Kaggle. It is usually explained as inter-rater agreement coefficient, how much the predictions of the model agree with ground-truth raters. Which is quite intuitive for medicine applications, how much the model agrees with professional doctors. Finally, in this video, we've discussed classification matrix. The accuracy, it is an essential metric for classification. But a simple model that predicts always the same value can possibly have a very high accuracy that makes it hard to interpret this metric. The score also depends on the threshold we choose to convert soft predictions to hard labels. Logloss is another metric, as opposed to accuracy it depends on soft predictions rather than on hard labels. And it forces the model to predict probabilities of an object to belong to each class. AUC, area under receiver operating curve, doesn't depend on the absolute values predicted by the classifier, but only considers the ordering of the object. It also implicitly tries all the thresholds to converge soft predictions to hard labels, and thus removes the dependence of the score on the threshold. Finally, Cohen's Kappa fixes the baseline for accuracy score to be zero. In spirit it is very similar to how R-squared beta scales MSE value to be easier explained. If instead of accuracy we used weighted accuracy, we would get weighted kappa. Weighted kappa with quadratic weights is called quadratic weighted kappa and commonly used on Kaggle. [MUSIC]