1 00:00:00,000 --> 00:00:03,972 [MUSIC] 2 00:00:03,972 --> 00:00:09,880 Hi, in this lesson, we will talk about a major part of any competition. 3 00:00:09,880 --> 00:00:12,410 The metrics that are used to evaluate a solution. 4 00:00:13,490 --> 00:00:17,250 In this video, we'll discuss why there are so many metrics and 5 00:00:17,250 --> 00:00:21,090 why it is necessary to know what metric is used in a competition. 6 00:00:22,150 --> 00:00:23,280 In the following videos, 7 00:00:23,280 --> 00:00:27,100 we will study what is the difference between a loss and a metric? 8 00:00:27,100 --> 00:00:31,040 And we'll overview and show optimization techniques for the most important and 9 00:00:31,040 --> 00:00:31,810 common metrics. 10 00:00:32,990 --> 00:00:36,420 In the course, we focus on regression and classification. 11 00:00:36,420 --> 00:00:39,510 So we only discuss metric for these tasks. 12 00:00:40,520 --> 00:00:45,000 For better understanding, we will also build a simple baseline for each metric. 13 00:00:46,140 --> 00:00:50,700 That is what the best constant to predict for that particular method. 14 00:00:51,750 --> 00:00:55,620 So metrics are an essential part of any competition. 15 00:00:55,620 --> 00:00:58,810 They are used to evaluate our submissions. 16 00:00:58,810 --> 00:01:03,360 Okay, but why do we have a different evolution metric on each competition? 17 00:01:04,550 --> 00:01:08,540 That is because there are plenty of ways to measure equality of an algorithm and 18 00:01:08,540 --> 00:01:09,750 each company decides for 19 00:01:09,750 --> 00:01:15,200 themselves what is the most appropriate way for their particular problem. 20 00:01:15,200 --> 00:01:16,024 For example, 21 00:01:16,024 --> 00:01:21,670 let's say an online shop is trying to maximize effectiveness of their website. 22 00:01:21,670 --> 00:01:26,570 The thing is you need to formalize what is effectiveness. 23 00:01:26,570 --> 00:01:29,770 You need to define a metric how effectiveness is measured. 24 00:01:31,650 --> 00:01:35,390 It can be a number of times a website was visited, or 25 00:01:35,390 --> 00:01:39,180 the number of times something was ordered using this website. 26 00:01:39,180 --> 00:01:44,320 So the company usually decides for itself what quantity 27 00:01:44,320 --> 00:01:49,560 is most important for it and then tries to optimize it. 28 00:01:50,560 --> 00:01:54,440 In the competitions, the metrics is fixed for us and the models and 29 00:01:54,440 --> 00:01:56,440 competitors are ranked using it. 30 00:01:57,540 --> 00:02:02,740 In order to get higher leader board score you need to get a better metric score. 31 00:02:04,150 --> 00:02:09,220 That's basically the only thing in the competition that we need to care about, 32 00:02:09,220 --> 00:02:11,290 how to get a better score. 33 00:02:11,290 --> 00:02:15,600 And so it is very important to understand how metric works and 34 00:02:15,600 --> 00:02:18,090 how to optimize it efficiently. 35 00:02:19,140 --> 00:02:22,990 I want to stress out that it is really important to 36 00:02:22,990 --> 00:02:27,400 optimize exactly the metric we're given in the competition and not any other metric. 37 00:02:28,518 --> 00:02:30,950 Consider an example, blue and 38 00:02:30,950 --> 00:02:35,810 red lines represent objects of a class zero and one respectively. 39 00:02:35,810 --> 00:02:40,454 And say we decided to use a linear classifier,and came up 40 00:02:40,454 --> 00:02:43,951 with two matrix to optimize, M1 and M2. 41 00:02:45,268 --> 00:02:49,740 The question is, how much different the resulting classifiers would be? 42 00:02:50,910 --> 00:02:51,970 Actually by a lot. 43 00:02:53,460 --> 00:02:55,600 The two lines here, the solid and 44 00:02:55,600 --> 00:02:59,929 the dashed one show the best line your boundaries for the two cases. 45 00:03:00,950 --> 00:03:06,220 For the dashed, M1 score is the highest among all possible hyperplanes. 46 00:03:06,220 --> 00:03:08,830 But M2 score for the hyperplane is low. 47 00:03:09,860 --> 00:03:13,160 And we have an opposite situation for the solid boundary. 48 00:03:13,160 --> 00:03:16,900 M2 score is the highest, whereas M1 score is low. 49 00:03:18,570 --> 00:03:23,760 Now, if we know that in this particular competition, the ranking is based on M1 50 00:03:23,760 --> 00:03:29,760 score, then we need to optimize M1 score and so we should submit the prediction. 51 00:03:29,760 --> 00:03:32,040 Predictions of the model with dash boundary. 52 00:03:33,280 --> 00:03:37,960 Once again, if your model is scored with some metric, 53 00:03:37,960 --> 00:03:41,460 you get best results by optimizing exactly that metric. 54 00:03:43,080 --> 00:03:47,830 Now, the biggest problem is that some metrics cannot be optimized efficiently. 55 00:03:47,830 --> 00:03:52,405 That is there is no simple enough way to find, say, the optimal hyperplane. 56 00:03:53,590 --> 00:03:58,010 That is why sometimes we need to train our model to optimize something 57 00:03:58,010 --> 00:04:00,120 different than competition metric. 58 00:04:00,120 --> 00:04:04,130 But in this case we will need to apply various heuristics to improve 59 00:04:04,130 --> 00:04:05,360 competition metric score. 60 00:04:06,530 --> 00:04:10,700 And there's another case where we need to be smart about the metrics. 61 00:04:10,700 --> 00:04:13,580 It is one that train and the test sets are different. 62 00:04:15,050 --> 00:04:18,710 In the lesson about leaks, we'll discuss leader board probing. 63 00:04:18,710 --> 00:04:20,620 That is, we can check, for example, 64 00:04:20,620 --> 00:04:25,840 if the mean target value on public part of test set is the same as on train. 65 00:04:25,840 --> 00:04:31,437 If it's not, we would need to adapt our predictions to suit rest set better. 66 00:04:32,610 --> 00:04:36,230 This is basically a specific metric optimization technique we apply, 67 00:04:36,230 --> 00:04:38,290 because train and test are different. 68 00:04:39,490 --> 00:04:44,680 Or there can be more severe cases where improved metric validation set 69 00:04:44,680 --> 00:04:47,990 could possibly not result into improved metric on the test set. 70 00:04:49,230 --> 00:04:52,080 In these situations, it's a good idea to stop and 71 00:04:52,080 --> 00:04:56,210 think maybe there is a different way to approach the problem. 72 00:04:57,400 --> 00:05:01,400 In particular, time series can be very challenging to forecast. 73 00:05:02,460 --> 00:05:04,945 Even if you did a validation just right. 74 00:05:04,945 --> 00:05:09,810 [INAUDIBLE] by time, rolling windows, fill the distribution in 75 00:05:09,810 --> 00:05:14,110 the future can be much different from what we had in the train set. 76 00:05:15,420 --> 00:05:18,634 Or sometimes, there's just not enough training data, so 77 00:05:18,634 --> 00:05:20,700 a model cannot capture the patterns. 78 00:05:22,540 --> 00:05:25,220 In one of the compositions I took part, 79 00:05:25,220 --> 00:05:28,890 I had to use some tricks to boost my score after the modeling. 80 00:05:29,980 --> 00:05:34,920 And the trick was as a consequence of a particular metric 81 00:05:34,920 --> 00:05:36,120 used in that competition. 82 00:05:38,020 --> 00:05:41,620 The metric was quite unusual actually, but it is intuitive. 83 00:05:42,760 --> 00:05:47,510 If a trend is guessed correctly, then the absolute difference between the prediction 84 00:05:47,510 --> 00:05:49,389 and the target is considered as an error. 85 00:05:50,430 --> 00:05:55,270 If for instance, model predict end value in the prediction horizon to be 86 00:05:55,270 --> 00:06:00,810 higher than the last value from the train side but in reality it is lower, 87 00:06:00,810 --> 00:06:05,090 then the trend is predicted incorrectly, and 88 00:06:05,090 --> 00:06:09,360 the error was set to absolute difference squared. 89 00:06:10,560 --> 00:06:16,180 So if we predict a value to be above the dashline, but it turns 90 00:06:16,180 --> 00:06:21,310 out to be below or vice versa, the trend [INAUDIBLE] to be predicted incorrectly. 91 00:06:23,006 --> 00:06:26,660 So this metric carries a lot more about 92 00:06:26,660 --> 00:06:31,060 correct trend to be predicted than about actual value you predict. 93 00:06:31,060 --> 00:06:33,670 And that is something it was possible to exploit. 94 00:06:35,110 --> 00:06:37,630 There were several times series was to forecast, 95 00:06:37,630 --> 00:06:41,370 the horizon to predict was wrong, and the model's predictions were unreliable. 96 00:06:42,530 --> 00:06:47,130 Moreover, it was not possible to optimize this metric exactly. 97 00:06:47,130 --> 00:06:51,350 So I realized that it would be much better to set all the predictions to either 98 00:06:52,650 --> 00:06:59,706 last value plus a very tiny constant, or last value minus very tiny constant. 99 00:06:59,706 --> 00:07:04,120 The same value for all the points in the time interval, we are to predict for 100 00:07:04,120 --> 00:07:05,250 each time series. 101 00:07:06,360 --> 00:07:09,280 And design depends on the estimation. 102 00:07:09,280 --> 00:07:12,600 What is more likely the values in the horizon to be lower 103 00:07:12,600 --> 00:07:15,540 than the last known value, or to be higher? 104 00:07:17,120 --> 00:07:21,210 This trick actually took me to the first place in that competition. 105 00:07:21,210 --> 00:07:26,030 So finding a nice way to optimize a metric can give you an advantage 106 00:07:26,030 --> 00:07:30,010 over other participants, especially if the metric is peculiar. 107 00:07:31,650 --> 00:07:34,750 So maybe I should formulate it like that. 108 00:07:34,750 --> 00:07:39,370 We should not forget to do kind of exploratory metric analysis along with 109 00:07:39,370 --> 00:07:41,420 exploratory data analysis. 110 00:07:41,420 --> 00:07:44,920 At least when the metric is an unusual one. 111 00:07:46,050 --> 00:07:50,090 So in this video we've understood that each business has 112 00:07:50,090 --> 00:07:54,320 its own way to measure ineffectiveness of an algorithm based on its needs, 113 00:07:54,320 --> 00:07:57,340 and therefore, there are so many different metrics. 114 00:07:58,400 --> 00:08:01,260 And we saw two motivational examples. 115 00:08:01,260 --> 00:08:02,830 Why should we care about the metrics? 116 00:08:04,570 --> 00:08:08,790 Well, basically because it is how competitors are compared to each other. 117 00:08:10,360 --> 00:08:13,420 In the following videos we'll talk about concrete metrics. 118 00:08:13,420 --> 00:08:17,292 We'll first discuss high level intuition for each metric and 119 00:08:17,292 --> 00:08:20,056 then talk about optimization techniques. 120 00:08:20,056 --> 00:08:30,056 [MUSIC]