1 00:00:00,000 --> 00:00:04,717 [SOUND] So far we've discussed different metrics, 2 00:00:04,717 --> 00:00:09,003 their definitions, and intuition for them. 3 00:00:09,003 --> 00:00:15,030 We've studied the difference between optimization loss and target metric. 4 00:00:15,030 --> 00:00:19,920 In this video, we'll see how we can efficiently optimize metrics used for 5 00:00:19,920 --> 00:00:20,889 regression problems. 6 00:00:21,915 --> 00:00:25,680 We've discussed, we always can use earl stopping. 7 00:00:25,680 --> 00:00:28,280 So I won't mention it for ever metrics. 8 00:00:28,280 --> 00:00:29,200 But keep it in mind. 9 00:00:29,200 --> 00:00:32,940 Let's start with mean squared error. 10 00:00:32,940 --> 00:00:35,760 It's the most commonly used metric for regression tasks. 11 00:00:35,760 --> 00:00:38,499 So we should expect it to be easy to work with. 12 00:00:39,530 --> 00:00:45,740 In fact, almost every modelling software will implement MSE as a loss function. 13 00:00:45,740 --> 00:00:51,450 So all you need to do to optimize it is to turn this on in your favorite library. 14 00:00:52,540 --> 00:00:57,030 And here are some of the library that support mean square error optimization. 15 00:00:57,030 --> 00:00:59,930 Both XGBoost and LightGBM will do it easily. 16 00:01:01,222 --> 00:01:06,120 A RandomForestRegresor from a scaler and also can split based on MSE, 17 00:01:06,120 --> 00:01:09,830 thus optimizing individually. 18 00:01:09,830 --> 00:01:13,770 A lot of linear models implemented in siclicar, and 19 00:01:13,770 --> 00:01:17,040 most of them are designed to optimize MSE. 20 00:01:17,040 --> 00:01:23,020 For example, ordinarily squares, reach regression, regression and so on. 21 00:01:24,800 --> 00:01:28,040 There's also SGRegressor class and Sklearn. 22 00:01:29,180 --> 00:01:31,320 It also implements a linear model but 23 00:01:31,320 --> 00:01:34,640 differently to other linear models in Sklearn. 24 00:01:34,640 --> 00:01:39,440 It uses [INAUDIBLE] gradient decent to train it, and thus very versatile. 25 00:01:39,440 --> 00:01:43,238 Well and of course MSE was built in. 26 00:01:43,238 --> 00:01:49,007 The library for online learning of linear models, 27 00:01:49,007 --> 00:01:53,054 also accepts MSC as lost function. 28 00:01:53,054 --> 00:01:59,250 But every neural net package like PyTorch, Keras, Flow, has MSE loss implemented. 29 00:01:59,250 --> 00:02:02,480 You just need to find an example on GitHub or wherever, and 30 00:02:02,480 --> 00:02:06,500 see what name MSE loss has in that particular library. 31 00:02:07,640 --> 00:02:11,010 For example, it is sometimes called L two loss, 32 00:02:11,010 --> 00:02:13,600 as L to distance in Matt Luke's using. 33 00:02:14,770 --> 00:02:17,790 But basically for all the metrics we consider in this lesson, 34 00:02:17,790 --> 00:02:21,100 you may find plaintal flames since they were used and 35 00:02:21,100 --> 00:02:23,900 discovered independently in different communities. 36 00:02:25,100 --> 00:02:27,598 Now, what about mean absolute error. 37 00:02:27,598 --> 00:02:33,083 MAE is popular too, so it is easy to find a model that will optimize it. 38 00:02:33,083 --> 00:02:39,122 Unfortunately, the extra boost cannot optimize MAE because 39 00:02:39,122 --> 00:02:44,588 MAE has zero as a second derivative while LightGBM can. 40 00:02:44,588 --> 00:02:48,977 So you still can use gradient boosting decision trees to this metric. 41 00:02:48,977 --> 00:02:55,302 MAE criteria was implemented for RandomForestRegressor from Sklearn. 42 00:02:55,302 --> 00:03:00,258 But note that running time will be quite high compared with MSE Corte. 43 00:03:00,258 --> 00:03:04,911 Unfortunately, linear models from SKLearn including 44 00:03:04,911 --> 00:03:08,873 SG Regressor can not optimize MAE negatively. 45 00:03:08,873 --> 00:03:15,510 But, there is a loss called Huber Loss, it is implemented in some of the models. 46 00:03:15,510 --> 00:03:20,770 Basically, it is very similar to MAE, especially when the errors are large. 47 00:03:22,050 --> 00:03:24,468 We will discuss it in the next slide. 48 00:03:24,468 --> 00:03:27,420 In [INAUDIBLE], MAE loss is implemented, 49 00:03:27,420 --> 00:03:32,200 but under a different name that's called quantile loss. 50 00:03:32,200 --> 00:03:35,780 In fact, MAE is just a special case of quantile loss. 51 00:03:36,920 --> 00:03:41,583 Although I will not go into the details here, but just recall that MAE is 52 00:03:41,583 --> 00:03:46,695 somehow connected to median values and median is a particular quantile. 53 00:03:47,860 --> 00:03:49,940 What about neural networks? 54 00:03:49,940 --> 00:03:53,920 As we've discussed MAE is not differentiable only when the predictions 55 00:03:53,920 --> 00:03:55,930 are equal to target. 56 00:03:55,930 --> 00:03:58,960 And it is of a rare case. 57 00:03:58,960 --> 00:04:04,152 That is why we may use any model train to put to optimize MAE. 58 00:04:05,430 --> 00:04:10,420 It may be that you will not find MAE implemented in a neural library, 59 00:04:10,420 --> 00:04:13,100 but it is very easy to implement it. 60 00:04:13,100 --> 00:04:16,200 In fact, all the models need 61 00:04:16,200 --> 00:04:19,030 is a loss function gradient with respect to predictions. 62 00:04:19,030 --> 00:04:23,104 And in this case, this is just a set function. 63 00:04:23,104 --> 00:04:27,661 Different names you may encounter for MAE is, L1 that fit and 64 00:04:27,661 --> 00:04:32,216 a one loss, and sometimes people refer to that special case of 65 00:04:32,216 --> 00:04:35,736 quintile regression as to median regression. 66 00:04:35,736 --> 00:04:40,521 A lot, a lot of, a lot of ways to make MAE smooth. 67 00:04:40,521 --> 00:04:45,218 You can actually make up your own smooth function that have upload that loops like 68 00:04:45,218 --> 00:04:46,770 MAE error. 69 00:04:46,770 --> 00:04:49,590 The most famous one is Huber loss. 70 00:04:49,590 --> 00:04:53,430 It's basically a mix between MSE and MAE. 71 00:04:54,640 --> 00:05:02,100 MSE is computed when the error is small, so we can safely approach zero error. 72 00:05:02,100 --> 00:05:06,580 And MAE is computed for large errors given robustness. 73 00:05:07,670 --> 00:05:12,110 So, to this end, we discuss the libraries that can optimize mean square error and 74 00:05:12,110 --> 00:05:13,640 mean absolute error. 75 00:05:13,640 --> 00:05:17,358 Now, let's get to not ask common relative metrics. 76 00:05:17,358 --> 00:05:22,270 MSPE and MAPE. 77 00:05:22,270 --> 00:05:26,900 It's much harder to find the model which can optimize them out of the box. 78 00:05:26,900 --> 00:05:31,880 Of course we can always can use, either, of course we can always 79 00:05:31,880 --> 00:05:35,120 either implement a custom loss for an integer boost or a neural net. 80 00:05:35,120 --> 00:05:38,310 It is really easy to do there. 81 00:05:38,310 --> 00:05:41,280 Or we can optimize different metric and do early stopping. 82 00:05:42,490 --> 00:05:45,760 But there are several specific approaches that I want to mention. 83 00:05:46,760 --> 00:05:52,169 This approach is based on the fact that MSP is a weighted version of MSE and 84 00:05:52,169 --> 00:05:54,663 MAP is a weighted version of MAE. 85 00:05:54,663 --> 00:06:00,697 On the right side, we've sen expression for MSP and MAP. 86 00:06:00,697 --> 00:06:05,378 The summon denominator just ensures that the weights 87 00:06:05,378 --> 00:06:09,340 are summed up to 1, but it's not required. 88 00:06:10,950 --> 00:06:16,180 Intuitively, the sample weights are indicating how important the object is for 89 00:06:16,180 --> 00:06:17,950 us while training the model. 90 00:06:19,300 --> 00:06:22,650 The smaller the target, is the more important the object. 91 00:06:24,510 --> 00:06:27,230 So, how do we use this knowledge? 92 00:06:28,270 --> 00:06:30,810 In fact, many libraries accept sample weights. 93 00:06:31,860 --> 00:06:33,460 Say we want to optimize MSP. 94 00:06:34,790 --> 00:06:38,760 So if we can set sample weights to the ones from the previous slide, 95 00:06:39,880 --> 00:06:41,580 we can use MSE laws with it. 96 00:06:43,020 --> 00:06:48,300 And, the model will actually optimize desired MSPE loss. 97 00:06:48,300 --> 00:06:53,060 Although most important libraries like XGBoost, LightGBM, most neural net 98 00:06:53,060 --> 00:06:57,390 packages support sample weighting, not every library implements it. 99 00:06:58,700 --> 00:07:03,500 But there is another method which works whenever a library can optimize MSE 100 00:07:03,500 --> 00:07:04,040 or MAE. 101 00:07:04,040 --> 00:07:05,449 Nothing else is needed. 102 00:07:06,860 --> 00:07:11,205 All we need to do is to create a new training set by sampling it from 103 00:07:11,205 --> 00:07:15,548 the original set that we have and fit a model with, for example, 104 00:07:15,548 --> 00:07:18,864 I'm a secretarian if you want to optimize MSPE. 105 00:07:20,617 --> 00:07:22,874 It is important to set the probabilities for 106 00:07:22,874 --> 00:07:25,990 each object to be sampled to the weights we've calculated. 107 00:07:27,160 --> 00:07:29,260 The size of the new data set is up to you. 108 00:07:29,260 --> 00:07:35,020 You can sample for example, twice as many objects as it was in original train set. 109 00:07:36,418 --> 00:07:39,920 And note that we do not need to do anything with the test set. 110 00:07:39,920 --> 00:07:41,740 It stays as is. 111 00:07:42,900 --> 00:07:46,430 I would also advise you to re-sample train set several times. 112 00:07:46,430 --> 00:07:48,480 Each time fitting a model. 113 00:07:48,480 --> 00:07:52,790 And then average models predictions, if we'll get the score much better and 114 00:07:52,790 --> 00:07:53,490 more stable. 115 00:07:54,710 --> 00:07:57,890 The results will, another way we can optimize MSPE, 116 00:07:59,250 --> 00:08:03,720 this approach was widely used during Rossmund Competition on Kagle. 117 00:08:03,720 --> 00:08:06,610 It can be proved that if the errors are small, 118 00:08:06,610 --> 00:08:10,252 we can optimize the predictions in logarithmic scale. 119 00:08:10,252 --> 00:08:13,100 Where it is similar to what we will do on the next slide actually. 120 00:08:14,450 --> 00:08:15,970 We will not go into details but 121 00:08:15,970 --> 00:08:19,740 you can find a link to explanation in the reading materials. 122 00:08:20,860 --> 00:08:25,720 And finally, let's get to the last regression metric we have to discuss. 123 00:08:25,720 --> 00:08:27,850 Root, mean, square, logarithmic error. 124 00:08:29,160 --> 00:08:34,200 It turns out quite easy to optimize, because of the connection with MSE loss. 125 00:08:35,290 --> 00:08:40,230 All we need to do is first to apply and transform to our target variables. 126 00:08:40,230 --> 00:08:43,310 In this case, logarithm of the target plus one. 127 00:08:44,480 --> 00:08:47,460 Let's denote the transformed target with a z variable right now. 128 00:08:49,080 --> 00:08:54,750 And then, we need to fit a model with MSE loss to transform target. 129 00:08:54,750 --> 00:08:59,360 To get a prediction for a test subject, we first obtain the prediction, z hat, 130 00:08:59,360 --> 00:09:04,790 in the logarithmic scale just by calling model.predict or something like that. 131 00:09:06,160 --> 00:09:12,210 And next, we do an inverse transform from logarithmic scale back to the original 132 00:09:12,210 --> 00:09:17,700 by expatiating z hat and subtracting one, and 133 00:09:17,700 --> 00:09:22,340 this is how we obtain the predictions y hat for the test set. 134 00:09:22,340 --> 00:09:27,610 In this video, we run through regression matrix and tools to optimize them. 135 00:09:27,610 --> 00:09:32,490 MSE and MAE are very common and implemented in many packages. 136 00:09:32,490 --> 00:09:37,930 RMSPE and MAPE can be optimized by either resampling the data set or 137 00:09:37,930 --> 00:09:40,662 setting proper sample weights. 138 00:09:40,662 --> 00:09:47,345 RMSLE is optimized by optimizing MSE in log space. 139 00:09:47,345 --> 00:09:50,884 In the next video, we will see optimization techniques for 140 00:09:50,884 --> 00:09:52,454 classification matrix. 141 00:09:52,454 --> 00:10:02,454 [MUSIC]