1 00:00:00,066 --> 00:00:03,966 [MUSIC] 2 00:00:03,966 --> 00:00:08,900 In the previous videos, we discussed metrics for regression problems. 3 00:00:08,900 --> 00:00:11,310 And here, we'll review classification metrics. 4 00:00:12,932 --> 00:00:17,310 We will first talk about accuracy, logarithmic loss, and 5 00:00:17,310 --> 00:00:23,200 then get to area under a receiver operating curve, and Cohen's Kappa. 6 00:00:23,200 --> 00:00:25,390 And specifically Quadratic weighted Kappa. 7 00:00:26,420 --> 00:00:28,110 Let's start by fixing the notation. 8 00:00:29,350 --> 00:00:34,600 N will be the number of objects in our dataset, L, the number of classes. 9 00:00:35,720 --> 00:00:41,458 As before, y will stand for the target, and y hat, for predictions. 10 00:00:41,458 --> 00:00:46,430 If you see an expression in square brackets, that is an indicator function. 11 00:00:46,430 --> 00:00:51,510 It fields one if the expression is true and zero if it's false. 12 00:00:51,510 --> 00:00:56,426 Throughout the video, we'll use two more terms hard labels or 13 00:00:56,426 --> 00:01:00,985 hard predictions, and soft labels or soft predictions. 14 00:01:00,985 --> 00:01:04,970 Usually models output some kind of scores. 15 00:01:04,970 --> 00:01:09,470 For example, probabilities for an objects to belong to each class. 16 00:01:10,730 --> 00:01:14,950 The scores can be written as a vector of size L, and 17 00:01:14,950 --> 00:01:18,840 I will refer to this vector as to soft predictions. 18 00:01:18,840 --> 00:01:22,630 Now in classification we are usually asked to predict a label for 19 00:01:22,630 --> 00:01:24,700 the object, do a hard prediction. 20 00:01:26,110 --> 00:01:30,656 To do it, we usually find a maximum value in the soft predictions, and 21 00:01:30,656 --> 00:01:35,445 set class that corresponds to this maximum score as our predicted label. 22 00:01:35,445 --> 00:01:38,613 So hard label is a function of soft labels, 23 00:01:38,613 --> 00:01:42,485 it's usually arg max for multi class tasks, but for 24 00:01:42,485 --> 00:01:47,860 binary classification it can be thought of as a thresholding function. 25 00:01:49,170 --> 00:01:52,310 So we output label 1 when the soft score for 26 00:01:52,310 --> 00:01:56,740 the class 1 is higher than the threshold, and we output class 0 otherwise. 27 00:01:57,820 --> 00:02:00,500 Let's start our journey with the accuracy score. 28 00:02:01,510 --> 00:02:04,920 Accuracy is the most straightforward measure of classifiers quality. 29 00:02:06,150 --> 00:02:08,380 It's a value between 0 and 1. 30 00:02:08,380 --> 00:02:10,014 The higher, the better. 31 00:02:10,014 --> 00:02:15,679 And it is equal to the fraction of correctly classified objects. 32 00:02:15,679 --> 00:02:18,908 To compute accuracy, we need hard predictions. 33 00:02:18,908 --> 00:02:22,975 We need to assign each object a specific table. 34 00:02:22,975 --> 00:02:28,319 Now, what is the best constant to predict in case of accuracy? 35 00:02:28,319 --> 00:02:31,339 Actually, there are a small number of constants to try. 36 00:02:31,339 --> 00:02:35,660 We can only assign a class label to all the objects at once. 37 00:02:35,660 --> 00:02:39,440 So what class should we assign? 38 00:02:39,440 --> 00:02:40,970 Obviously, the most frequent one. 39 00:02:42,060 --> 00:02:45,330 Then the number of correctly guessed objects will be the highest. 40 00:02:46,830 --> 00:02:49,000 But exactly because of that reason, 41 00:02:49,000 --> 00:02:53,360 there is a caveat in interpreting the values of the accuracy score. 42 00:02:54,840 --> 00:02:57,330 Take a look at this example. 43 00:02:57,330 --> 00:03:02,048 Say we have 10 cats and 90 dogs in our train set. 44 00:03:02,048 --> 00:03:06,081 If we always predicted dog for every object, 45 00:03:06,081 --> 00:03:10,020 then the accuracy would be already 0.9. 46 00:03:10,020 --> 00:03:15,429 And imagine you tell someone that your classifier is correct 9 times out of 10. 47 00:03:16,460 --> 00:03:19,523 The person would probably think you have a nice model. 48 00:03:19,523 --> 00:03:25,598 But in fact, your model just predicts dog class no matter what input is. 49 00:03:25,598 --> 00:03:31,374 So the problem is, that the base line accuracy can be very high for 50 00:03:31,374 --> 00:03:38,004 a data set, even 99%, and that makes it hard to interpret the results. 51 00:03:38,004 --> 00:03:40,865 Although accuracy score is very clean and intuitive, 52 00:03:40,865 --> 00:03:43,050 it turns out to be quite hard to optimize. 53 00:03:44,230 --> 00:03:49,867 Accuracy also doesn't care how confident the classifier is in the predictions, 54 00:03:49,867 --> 00:03:52,080 and what soft predictions are. 55 00:03:53,090 --> 00:03:56,000 It cares only about arg max of soft predictions. 56 00:03:57,330 --> 00:04:01,768 And thus, people sometimes prefer to use different metrics that are first, 57 00:04:01,768 --> 00:04:03,005 easier to optimize. 58 00:04:03,005 --> 00:04:08,180 And second, these metrics work with soft predictions, not hard ones. 59 00:04:09,340 --> 00:04:11,960 One of such metrics is logarithmic loss. 60 00:04:11,960 --> 00:04:17,775 It tries to make the classifier to output two posterior probabilities for 61 00:04:17,775 --> 00:04:22,385 their objects to be of a certain kind, of a certain class. 62 00:04:22,385 --> 00:04:26,265 A log loss is usually the reason a little bit differently for binary and 63 00:04:26,265 --> 00:04:27,395 multi class tasks. 64 00:04:28,500 --> 00:04:35,102 For binary, it is assumed that y hat is a number from 01 range, 65 00:04:35,102 --> 00:04:41,112 and it is a probability of an object to belong to class one. 66 00:04:41,112 --> 00:04:47,400 So 1 minus y hat is the probability for this object to be of class 0. 67 00:04:47,400 --> 00:04:51,460 For multiclass tasks, LogLoss is written in this form. 68 00:04:52,630 --> 00:05:00,183 Here y hat ith is a vector of size L, and its sum is exactly 1. 69 00:05:00,183 --> 00:05:05,210 The elements are the probabilities to belong to each of the classes. 70 00:05:06,400 --> 00:05:10,138 Try to write this formula down for L equals 2, and 71 00:05:10,138 --> 00:05:13,887 you will see it is exactly binary loss from above. 72 00:05:13,887 --> 00:05:19,404 And finally, it should be mentioned that to avoid in practice, 73 00:05:19,404 --> 00:05:23,796 predictions are clipped to be not from 0 to 1, but 74 00:05:23,796 --> 00:05:30,053 from some small positive number to 1 minus some small positive number. 75 00:05:30,053 --> 00:05:33,230 Okay, now let us analyze it a little bit. 76 00:05:33,230 --> 00:05:39,080 Assume a target for an object is 0, and here on the plot, 77 00:05:39,080 --> 00:05:43,213 we see how the error will change if we change our predictions from 0 to 1. 78 00:05:43,213 --> 00:05:48,640 For comparison, we'll plot absolute error with another color. 79 00:05:48,640 --> 00:05:53,269 Logloss usually penalizes completely wrong answers and 80 00:05:53,269 --> 00:05:58,676 prefers to make a lot of small mistakes to one but severer mistake. 81 00:05:58,676 --> 00:06:03,190 Now, what is the best constant for logarithmic loss? 82 00:06:03,190 --> 00:06:07,560 It turns out that you need to set predictions to the frequencies of each 83 00:06:07,560 --> 00:06:08,977 class in the data set. 84 00:06:08,977 --> 00:06:12,789 In our case, the frequencies for 85 00:06:12,789 --> 00:06:19,200 the cat class is 0.1, and it is 0.9 for class dog. 86 00:06:19,200 --> 00:06:22,250 Then the best constant is vector of those two values. 87 00:06:23,420 --> 00:06:26,070 How do I, well how do I know that is so? 88 00:06:27,160 --> 00:06:31,790 To prove it we should take a derivative with the respect to constant alpha, 89 00:06:31,790 --> 00:06:35,240 set it to 0, and find alpha from this equation. 90 00:06:36,360 --> 00:06:41,480 Okay, we've discussed accuracy and log loss, now let's move on. 91 00:06:42,570 --> 00:06:44,580 Take a look at the example. 92 00:06:44,580 --> 00:06:47,880 We show ground truth target value with color, and 93 00:06:47,880 --> 00:06:50,490 the position of the point shows the classifier score. 94 00:06:51,490 --> 00:06:55,670 Recall that to compute accuracy score for a binary task, 95 00:06:55,670 --> 00:06:59,529 we usually take soft predictions from our model and apply threshold. 96 00:07:00,530 --> 00:07:05,631 We can see the prediction to be green if the score is higher than 0.5 and 97 00:07:05,631 --> 00:07:07,059 red if it's lower. 98 00:07:07,059 --> 00:07:14,486 For this example the accuracy is 6 or 7, as we misclassified one red object. 99 00:07:14,486 --> 00:07:18,115 But look, if the threshold was 0.7, 100 00:07:18,115 --> 00:07:22,934 then all the objects would be classified correctly. 101 00:07:22,934 --> 00:07:27,620 So this is kind of motivation for our next metric, Area Under Curve. 102 00:07:29,120 --> 00:07:31,710 We shouldn't fix the threshold for it, but 103 00:07:32,820 --> 00:07:37,127 this metric kind of tries all possible ones and aggregates those scores. 104 00:07:38,190 --> 00:07:43,139 So this metric doesn't really cares about absolute values of the predictions. 105 00:07:43,139 --> 00:07:46,602 But it depends only on the order of the objects. 106 00:07:46,602 --> 00:07:53,767 Actually, there are several ways AUC, or this area under curve, can be explained. 107 00:07:53,767 --> 00:07:57,630 The first one explains under what curve we should compute area. 108 00:07:57,630 --> 00:08:01,520 And the second explains AUC as the probability of 109 00:08:01,520 --> 00:08:04,530 object pairs to be correctly ordered by our model. 110 00:08:04,530 --> 00:08:08,897 We will see both explanations in the moment. 111 00:08:08,897 --> 00:08:11,031 So let's start with the first one. 112 00:08:11,031 --> 00:08:15,211 So we need to calculate an area under a curve. 113 00:08:15,211 --> 00:08:17,151 What curve? 114 00:08:17,151 --> 00:08:19,471 Let's construct it right now. 115 00:08:19,471 --> 00:08:25,201 Once again, say we have six objects, and their true label is shown with a color. 116 00:08:25,201 --> 00:08:29,951 And the position of the dot shows the classifier's predictions. 117 00:08:29,951 --> 00:08:36,801 And for now we will use word positive as synonym to belongs to the red class. 118 00:08:36,801 --> 00:08:40,021 So positive side is on the left. 119 00:08:40,021 --> 00:08:46,744 What we will do now, we'll go from left to right, jump from one object to another. 120 00:08:46,744 --> 00:08:51,320 And for each we will calculate how many red and 121 00:08:51,320 --> 00:08:57,961 green dots are there to the left, to this object that we stand on. 122 00:08:57,961 --> 00:09:02,421 The red dots we'll have a name for them, true positives. 123 00:09:02,421 --> 00:09:06,851 And for the green ones we'll have name false positives. 124 00:09:06,851 --> 00:09:12,072 So we will kind of compute how many true positives and 125 00:09:12,072 --> 00:09:18,015 false positives we see to the left of the object we stand on. 126 00:09:18,015 --> 00:09:22,855 Actually it's very simple, we start from bottom left corner and 127 00:09:22,855 --> 00:09:25,331 go up every time we see red point. 128 00:09:25,331 --> 00:09:28,331 And right when we see a green one. 129 00:09:28,331 --> 00:09:30,101 Let's see. 130 00:09:30,101 --> 00:09:32,441 So we stand on the leftmost point first. 131 00:09:32,441 --> 00:09:35,020 And it is red, or positive. 132 00:09:35,020 --> 00:09:38,873 So we increase the number of true positives and move up. 133 00:09:38,873 --> 00:09:41,641 Next, we jump on the green point. 134 00:09:41,641 --> 00:09:45,906 It is false positive, and so we go right. 135 00:09:45,906 --> 00:09:49,691 Then two times up for two red points. 136 00:09:49,691 --> 00:09:53,561 And finally two times right for the last green point. 137 00:09:53,561 --> 00:09:55,925 We finished in the top right corner. 138 00:09:55,925 --> 00:09:59,930 And it always works like that. 139 00:09:59,930 --> 00:10:01,680 We start from bottom left and 140 00:10:01,680 --> 00:10:06,600 end up in top right corner when we jump on the right most point. 141 00:10:06,600 --> 00:10:12,290 By the way, the curve we've just built is called Receiver Operating Curve or 142 00:10:12,290 --> 00:10:12,860 ROC Curve. 143 00:10:14,180 --> 00:10:17,290 And now we are ready to calculate an area under this curve. 144 00:10:18,460 --> 00:10:25,550 The area is seven and we need to normalize it by the total plural area of the square. 145 00:10:25,550 --> 00:10:30,102 So AUC is 7/9, cool. 146 00:10:30,102 --> 00:10:34,929 Now what AUC will be for the data set that can be separated 147 00:10:34,929 --> 00:10:39,150 with a threshold, like in our initial example? 148 00:10:39,150 --> 00:10:43,778 Actually AUC will be 1, maximum value of AUC. 149 00:10:43,778 --> 00:10:46,526 So it works. 150 00:10:46,526 --> 00:10:49,270 It doesn't need a threshold to be specified and 151 00:10:49,270 --> 00:10:51,610 it doesn't depend on absolute values. 152 00:10:51,610 --> 00:10:56,328 Recall that we've never used absolute values while constructing the curve. 153 00:10:56,328 --> 00:10:59,680 Now in practice, if you build such curve for 154 00:10:59,680 --> 00:11:05,397 a huge data set in real classifier, you would observe a picture like that. 155 00:11:05,397 --> 00:11:10,510 Here curves for different classifiers are shown with different colors. 156 00:11:10,510 --> 00:11:15,368 The curves usually lie above the dashed line which shows how 157 00:11:15,368 --> 00:11:20,240 would the curve look like if we made predictions at random. 158 00:11:20,240 --> 00:11:24,500 So it kind of shows us a baseline. 159 00:11:24,500 --> 00:11:29,818 And note that the area under the dashed line is 0.5. 160 00:11:29,818 --> 00:11:35,420 All right, we've seen that we can build a curve and compute area under it. 161 00:11:35,420 --> 00:11:40,832 There is another total different explanation for the AUC. 162 00:11:40,832 --> 00:11:43,324 Consider all pairs of objects, 163 00:11:43,324 --> 00:11:48,595 such that one object is from red class and another one is from green. 164 00:11:48,595 --> 00:11:51,527 AUC is a probability that score for 165 00:11:51,527 --> 00:11:56,491 the green one will be higher than the score for the red one. 166 00:11:56,491 --> 00:12:01,556 In other words, AUC is a fraction of correctly ordered pairs. 167 00:12:01,556 --> 00:12:06,750 You see in our example we have two incorrectly ordered pairs and 168 00:12:06,750 --> 00:12:08,520 nine pairs in total. 169 00:12:08,520 --> 00:12:15,214 And then there are 7 correctly ordered pairs and thus AUC is 7/9. 170 00:12:15,214 --> 00:12:20,758 Exactly as we got before, while computing area under the curve. 171 00:12:20,758 --> 00:12:24,770 All right, we've discussed how to compute AUC. 172 00:12:24,770 --> 00:12:28,780 Now let's think what is the best constant prediction for it. 173 00:12:28,780 --> 00:12:32,580 In fact, AUC doesn't depend on the exact values of the predictions. 174 00:12:32,580 --> 00:12:36,382 So all constants will lead to the same score and 175 00:12:36,382 --> 00:12:40,490 this score will be around 0.5, the baseline. 176 00:12:40,490 --> 00:12:44,957 This is actually something that people love about AUC. 177 00:12:44,957 --> 00:12:48,110 It is clear what the baseline is. 178 00:12:48,110 --> 00:12:53,240 Of course there are flaws in AUC, every metric has some. 179 00:12:53,240 --> 00:12:59,180 But still AUC is metric I usually use when no one sets up another one for me. 180 00:12:59,180 --> 00:13:05,914 All right, finally let's get to the last metric to discuss, 181 00:13:05,914 --> 00:13:10,331 Cohen's Kappa and it's derivatives. 182 00:13:10,331 --> 00:13:15,128 Recall that if we always predict the label of the most frequent class, 183 00:13:15,128 --> 00:13:20,350 we can already get pretty high accuracy score, and that can be misleading. 184 00:13:20,350 --> 00:13:24,960 Actually in our example all the models will fit, 185 00:13:24,960 --> 00:13:29,800 will have a score somewhere between 0.9 and 1. 186 00:13:29,800 --> 00:13:36,682 So we can introduce a new metric such that for an accuracy of 1 it would give us 1, 187 00:13:36,682 --> 00:13:40,950 and for the baseline accuracy it would output 0. 188 00:13:40,950 --> 00:13:45,290 And of course, baselines are going to be different for 189 00:13:45,290 --> 00:13:49,543 every data, not necessarily 0.9 or whatever. 190 00:13:49,543 --> 00:13:53,560 It is also very similar to what r squared does with MSE. 191 00:13:53,560 --> 00:13:58,480 It informally saying is kind of normalizes it. 192 00:13:58,480 --> 00:14:01,690 So we do the same here. 193 00:14:01,690 --> 00:14:05,660 And this is actually already almost Cohen's Kappa. 194 00:14:05,660 --> 00:14:09,572 In Cohen's Kappa we take another value as the baseline. 195 00:14:09,572 --> 00:14:14,428 We take the higher predictions for the data set and shuffle them, 196 00:14:14,428 --> 00:14:16,291 like randomly permute. 197 00:14:16,291 --> 00:14:21,050 And then we calculate an accuracy for these shuffled predictions. 198 00:14:21,050 --> 00:14:24,778 And that will be our baseline. 199 00:14:24,778 --> 00:14:29,845 Well to be precise, we permute and calculate accuracies many times and 200 00:14:29,845 --> 00:14:34,750 take, as the baseline, an average for those computed accuracies. 201 00:14:34,750 --> 00:14:39,210 In practice, of course, we do not need to do any permutations. 202 00:14:39,210 --> 00:14:43,411 This baseline score can be computed analytically. 203 00:14:43,411 --> 00:14:48,522 We need, first, to multiply the empirical frequencies of our predictions and 204 00:14:48,522 --> 00:14:52,196 grant those labels for each class, and then sum them up. 205 00:14:52,196 --> 00:14:56,859 For example, if we assign 20 cat labels and 206 00:14:56,859 --> 00:15:02,505 80 dog labels at random, then the baseline accuracy 207 00:15:02,505 --> 00:15:08,411 will be 0.2*0.1 + 0.8*0.9 = 0.74. 208 00:15:08,411 --> 00:15:12,185 You can find more examples in actually. 209 00:15:12,185 --> 00:15:18,584 Here I wanted to explain a nice way of thinking about eliminator as a baseline. 210 00:15:18,584 --> 00:15:23,356 We can also recall that error is equal to 1 minus accuracy. 211 00:15:23,356 --> 00:15:31,206 We could rewrite the formula as 1 minus model's error/baseline error. 212 00:15:31,206 --> 00:15:33,018 It will still be Cohen's Kappa, 213 00:15:33,018 --> 00:15:36,520 but now, it would be easier to derive weighted Cohen's Kappa. 214 00:15:36,520 --> 00:15:41,274 To explain weighted Kappa, we first need to do a step aside, and 215 00:15:41,274 --> 00:15:43,396 introduce weighted error. 216 00:15:43,396 --> 00:15:46,740 See now we have cats, dogs and tigers to classify. 217 00:15:46,740 --> 00:15:52,704 And we are more or less okay if we predict dog instead of cat. 218 00:15:52,704 --> 00:15:57,760 But it's undesirable to predict cat or dog if it's really a tiger. 219 00:15:57,760 --> 00:16:01,736 So we're going to form a weight matrix where 220 00:16:01,736 --> 00:16:06,150 each cell contains The weight for the mistake we might do. 221 00:16:07,170 --> 00:16:11,610 In our case, we set error weight to be ten times larger if we predict cat or 222 00:16:11,610 --> 00:16:14,430 dog, but the ground truth label is tiger. 223 00:16:15,830 --> 00:16:18,340 So with error weight matrix, 224 00:16:18,340 --> 00:16:22,260 we can express our preference on the errors that the classifier would make. 225 00:16:23,390 --> 00:16:25,080 Now, to calculate weight and 226 00:16:25,080 --> 00:16:29,500 error we need another matrix, confusion matrix, for the classifier's prediction. 227 00:16:31,010 --> 00:16:35,270 This matrix shows how our classifier distributes the predictions over 228 00:16:35,270 --> 00:16:36,720 the objects. 229 00:16:36,720 --> 00:16:41,880 For example, the first column indicates that four cats out of ten were recognized 230 00:16:41,880 --> 00:16:47,980 correctly, two were classified as dogs and four as tigers. 231 00:16:47,980 --> 00:16:51,350 So to get a weighted error score, 232 00:16:51,350 --> 00:16:56,520 we need to multiply these two matrices element-wise and sum their results. 233 00:16:57,850 --> 00:17:01,940 This formula needs a proper normalization 234 00:17:01,940 --> 00:17:06,850 to make sure the quantity is between 0 and 1, but it doesn't matter for 235 00:17:06,850 --> 00:17:10,450 our purposes, as the normalization constant will anyway cancel. 236 00:17:12,120 --> 00:17:15,868 And finally, weighted kappa is calculated as 1- 237 00:17:15,868 --> 00:17:19,730 weighted error / weighted baseline error. 238 00:17:21,010 --> 00:17:26,070 In many cases, the weight matrices are defined in a very simple way. 239 00:17:26,070 --> 00:17:29,470 For example, for classification problems with ordered labels. 240 00:17:30,700 --> 00:17:35,200 Say you need to assign each object a value from 1 to 3. 241 00:17:35,200 --> 00:17:39,200 It can be, for instance, a rating of how severe the disease is. 242 00:17:39,200 --> 00:17:44,950 And it is not regression, since you do not allow to output values to be somewhere 243 00:17:44,950 --> 00:17:49,690 between the ratings and the ground truth values also look more like labels, 244 00:17:49,690 --> 00:17:51,730 not as numeric values to predict. 245 00:17:52,940 --> 00:17:57,240 So such problems are usually treated as classification problems, but 246 00:17:57,240 --> 00:18:02,150 weight matrix is introduced to account for order of the labels. 247 00:18:03,270 --> 00:18:08,720 For example, weights can be linear, if we predict two instead of one, we pay one. 248 00:18:09,820 --> 00:18:13,971 If we predict three instead of of one, we pay two. 249 00:18:13,971 --> 00:18:19,506 Or the weights can be quadratic, if we'll predict two instead of one, 250 00:18:19,506 --> 00:18:25,150 we still pay one, but if we predict three instead of one, we now pay for. 251 00:18:26,190 --> 00:18:30,510 Depending on what weight matrix is used, 252 00:18:30,510 --> 00:18:34,460 we get either linear weighted kappa or quadratic weighted kappa. 253 00:18:35,590 --> 00:18:40,281 The quadratic weighted kappa has been used in several competitions on Kaggle. 254 00:18:40,281 --> 00:18:44,761 It is usually explained as inter-rater agreement coefficient, 255 00:18:44,761 --> 00:18:49,649 how much the predictions of the model agree with ground-truth raters. 256 00:18:49,649 --> 00:18:53,024 Which is quite intuitive for medicine applications, 257 00:18:53,024 --> 00:18:56,410 how much the model agrees with professional doctors. 258 00:18:57,410 --> 00:19:01,590 Finally, in this video, we've discussed classification matrix. 259 00:19:02,620 --> 00:19:07,580 The accuracy, it is an essential metric for classification. 260 00:19:07,580 --> 00:19:12,485 But a simple model that predicts always the same value can possibly have 261 00:19:12,485 --> 00:19:16,991 a very high accuracy that makes it hard to interpret this metric. 262 00:19:16,991 --> 00:19:21,473 The score also depends on the threshold we choose to convert soft 263 00:19:21,473 --> 00:19:23,554 predictions to hard labels. 264 00:19:23,554 --> 00:19:25,600 Logloss is another metric, 265 00:19:25,600 --> 00:19:31,493 as opposed to accuracy it depends on soft predictions rather than on hard labels. 266 00:19:31,493 --> 00:19:36,408 And it forces the model to predict probabilities of an object to belong 267 00:19:36,408 --> 00:19:37,500 to each class. 268 00:19:37,500 --> 00:19:43,158 AUC, area under receiver operating curve, doesn't depend on the absolute values 269 00:19:43,158 --> 00:19:48,430 predicted by the classifier, but only considers the ordering of the object. 270 00:19:49,750 --> 00:19:54,530 It also implicitly tries all the thresholds to converge soft predictions to 271 00:19:54,530 --> 00:20:00,210 hard labels, and thus removes the dependence of the score on the threshold. 272 00:20:01,960 --> 00:20:08,262 Finally, Cohen's Kappa fixes the baseline for accuracy score to be zero. 273 00:20:08,262 --> 00:20:11,777 In spirit it is very similar to how R-squared 274 00:20:11,777 --> 00:20:15,390 beta scales MSE value to be easier explained. 275 00:20:17,480 --> 00:20:24,910 If instead of accuracy we used weighted accuracy, we would get weighted kappa. 276 00:20:24,910 --> 00:20:29,746 Weighted kappa with quadratic weights is called quadratic weighted kappa and 277 00:20:29,746 --> 00:20:31,366 commonly used on Kaggle. 278 00:20:31,366 --> 00:20:41,366 [MUSIC]