1 00:00:03,120 --> 00:00:05,920 So in the previous video, 2 00:00:05,920 --> 00:00:08,230 we've discussed Logloss and Accuracy. 3 00:00:08,230 --> 00:00:10,880 In this video we'll discuss Area Under Curve, 4 00:00:10,880 --> 00:00:13,000 AUC, and (Quadratic weighted) Kappa. 5 00:00:13,000 --> 00:00:15,917 Let's start with AUC. 6 00:00:15,917 --> 00:00:21,619 Although the loss function of AUC has zero gradients almost everywhere, 7 00:00:21,619 --> 00:00:23,740 exactly as accuracy loss, 8 00:00:23,740 --> 00:00:29,180 there exists an algorithm to optimize AUC with gradient-based methods, 9 00:00:29,180 --> 00:00:32,150 and some models implement this algorithm. 10 00:00:32,150 --> 00:00:35,445 So we can use it by setting the right parameters. 11 00:00:35,445 --> 00:00:38,280 I will give you an idea about this method without 12 00:00:38,280 --> 00:00:42,895 much details as there is more than one way to implement it. 13 00:00:42,895 --> 00:00:48,745 Recall that originally, classification task is usually solved at the level of objects. 14 00:00:48,745 --> 00:00:52,275 We want to assign 0 to red objects, 15 00:00:52,275 --> 00:00:54,400 and 1 to the green ones. 16 00:00:54,400 --> 00:00:57,655 But we do it independently for each object, 17 00:00:57,655 --> 00:01:00,130 and so our loss is pointwise. 18 00:01:00,130 --> 00:01:03,325 We compute it for each object individually, 19 00:01:03,325 --> 00:01:09,795 and sum or average the losses for all the objects to get a total loss. 20 00:01:09,795 --> 00:01:13,330 Now, recall that AUC is the probability 21 00:01:13,330 --> 00:01:17,500 of a pair of the objects to be ordered in the right way. 22 00:01:17,500 --> 00:01:19,145 So ideally, we want 23 00:01:19,145 --> 00:01:25,315 predictions Y^ for the green objects to be larger than for the red ones. 24 00:01:25,315 --> 00:01:28,655 So, instead of working with single objects, 25 00:01:28,655 --> 00:01:31,288 we should work with pairs of objects. 26 00:01:31,288 --> 00:01:34,235 And instead of using pointwise loss, 27 00:01:34,235 --> 00:01:36,760 we should use pairwise loss. 28 00:01:36,760 --> 00:01:40,400 A pairwise loss takes predictions and labels 29 00:01:40,400 --> 00:01:44,175 for a pair of objects and computes their loss. 30 00:01:44,175 --> 00:01:49,030 Ideally, the loss would be zero when the ordering is correct, 31 00:01:49,030 --> 00:01:54,205 and greater than zero when the ordering is not correct, incorrect. 32 00:01:54,205 --> 00:01:57,140 But in practice, different loss functions can be used. 33 00:01:57,140 --> 00:01:59,605 For example, we can use logloss. 34 00:01:59,605 --> 00:02:04,825 We may think that the target for this pairwise loss is always one, 35 00:02:04,825 --> 00:02:08,310 red minus green should be one. 36 00:02:08,310 --> 00:02:12,905 That is why there is only one term in logloss objective instead of two. 37 00:02:12,905 --> 00:02:17,280 The prob function in the formula is needed to 38 00:02:17,280 --> 00:02:21,660 make sure that the difference between the predictions is still in the 0,1 range, 39 00:02:21,660 --> 00:02:25,500 and I use it here just for the sake of simplicity. 40 00:02:25,500 --> 00:02:31,745 Well, basically, XGBoost, LightGBM have pairwise loss we've discussed implemented. 41 00:02:31,745 --> 00:02:35,605 It is straightforward to implement in any neural net library, 42 00:02:35,605 --> 00:02:40,655 and for sure, you can find implementations on GitHub. 43 00:02:40,655 --> 00:02:42,230 I should say that in practice, 44 00:02:42,230 --> 00:02:49,845 most people still use logloss as an optimization loss without any more post processing. 45 00:02:49,845 --> 00:02:53,610 I personally observed XGBoost learned with loglosst to give 46 00:02:53,610 --> 00:02:59,650 comparable AUC score to the one learned with pairwise loss. All right. 47 00:02:59,650 --> 00:03:04,655 Now, let's move to the last topic to discuss. 48 00:03:04,655 --> 00:03:08,720 It is Quadratic weighted Kappa metric. There are two methods. 49 00:03:08,720 --> 00:03:11,180 One is very common and very easy, 50 00:03:11,180 --> 00:03:14,320 the second is not that common and will require you to implement 51 00:03:14,320 --> 00:03:19,330 a custom loss function for either XGBoost or neural net. 52 00:03:19,330 --> 00:03:22,110 But we've already implemented it for XGBoost, 53 00:03:22,110 --> 00:03:26,825 so you will be able to find the implementation in the reading materials. 54 00:03:26,825 --> 00:03:29,185 But let's start with the simple one. 55 00:03:29,185 --> 00:03:31,435 Recall that we're solving 56 00:03:31,435 --> 00:03:36,610 an ordered classification problem and our labels can be found of us integer ratings, 57 00:03:36,610 --> 00:03:38,645 say from one to five. 58 00:03:38,645 --> 00:03:42,160 The task is classification as we cannot output, 59 00:03:42,160 --> 00:03:44,915 for example, 4.5 as an answer. 60 00:03:44,915 --> 00:03:49,255 But anyway, we can treat it as a regression problem, and then somehow, 61 00:03:49,255 --> 00:03:53,715 post-process the predictions and convert them to integer ratings. 62 00:03:53,715 --> 00:04:00,315 And actually quadratic weights make Kappa as somehow similar to regression with MSE loss. 63 00:04:00,315 --> 00:04:05,640 If we allow our predictions to take values between the labels, 64 00:04:05,640 --> 00:04:08,855 that is relax the predictions. 65 00:04:08,855 --> 00:04:12,580 But in fact, it is different to MSE. 66 00:04:12,580 --> 00:04:16,500 So if relaxed, Kappa would be 67 00:04:16,500 --> 00:04:22,540 one minus MSE divided by something that really depends on the predictions. 68 00:04:22,540 --> 00:04:25,935 And it looks like everyone's logic is, well, 69 00:04:25,935 --> 00:04:28,110 there is MSE in the denominator, 70 00:04:28,110 --> 00:04:29,200 we can optimize it, 71 00:04:29,200 --> 00:04:32,340 and let's don't care about denominator. 72 00:04:32,340 --> 00:04:34,980 Well, of course it's not correct way to do it, 73 00:04:34,980 --> 00:04:37,580 but it turns out to be useful in practice. 74 00:04:37,580 --> 00:04:42,090 But anyway, MSE gives us flat values instead of integers. 75 00:04:42,090 --> 00:04:46,320 So now, we need somehow to convert them into integers. 76 00:04:46,320 --> 00:04:52,215 And the straightforward way would be to do rounding all the predictions. 77 00:04:52,215 --> 00:04:57,505 But we can think about rounding as of applying a threshold. 78 00:04:57,505 --> 00:05:06,700 Like if the value is greater than 3.5 and less than 4.5, then output 3. 79 00:05:06,700 --> 00:05:08,635 But then we can ask ourselves a question, 80 00:05:08,635 --> 00:05:12,870 why do we use exactly those thresholds? Let's tune them. 81 00:05:12,870 --> 00:05:15,230 And again, it's just straightforward, 82 00:05:15,230 --> 00:05:18,295 it can be easily done with grid search. 83 00:05:18,295 --> 00:05:23,060 So to summarize, we need to fit MSE loss to 84 00:05:23,060 --> 00:05:28,810 our data and then find appropriate thresholds. 85 00:05:28,810 --> 00:05:31,355 Finally, there is a paper which 86 00:05:31,355 --> 00:05:34,920 suggests a way to relax classification problem to regression, 87 00:05:34,920 --> 00:05:40,530 but it deals with this- hard to deal with part in denominator that we had. 88 00:05:40,530 --> 00:05:42,908 I will not get into the details here, 89 00:05:42,908 --> 00:05:46,215 but it's clearly written and easy to understand paper, 90 00:05:46,215 --> 00:05:48,950 so I really encourage you to read it. 91 00:05:48,950 --> 00:05:52,895 And more, you can find loss implementation in the reading materials, 92 00:05:52,895 --> 00:05:55,690 and just use it if you don't want to read the paper. 93 00:05:55,690 --> 00:05:59,125 Finally, we finished this lesson. 94 00:05:59,125 --> 00:06:05,580 We've discussed that evaluation or target metric is how all submissions are scored. 95 00:06:05,580 --> 00:06:10,128 We've discussed the difference between target metric and optimization loss. 96 00:06:10,128 --> 00:06:13,440 Optimization loss is what our model optimizes, 97 00:06:13,440 --> 00:06:19,290 and it is not always the same as target metric that we want to optimize. 98 00:06:19,290 --> 00:06:25,020 Sometimes, we only can set our model to optimize completely different to target metric. 99 00:06:25,020 --> 00:06:28,145 But later, we usually try to post-process the predictions 100 00:06:28,145 --> 00:06:32,035 to make them better fit target metric. 101 00:06:32,035 --> 00:06:34,080 We've discussed intuition behind 102 00:06:34,080 --> 00:06:37,020 different metrics for regression and classification tasks, 103 00:06:37,020 --> 00:06:41,310 and saw how to efficiently optimize different metrics. 104 00:06:41,310 --> 00:06:43,995 I hope you've enjoyed this lesson, 105 00:06:43,995 --> 00:06:46,240 and see you later.