[SOUND] So
far we've discussed different metrics, their definitions, and intuition for them. We've studied the difference between
optimization loss and target metric. In this video, we'll see how we can
efficiently optimize metrics used for regression problems. We've discussed,
we always can use earl stopping. So I won't mention it for ever metrics. But keep it in mind. Let's start with mean squared error. It's the most commonly used metric for
regression tasks. So we should expect it
to be easy to work with. In fact, almost every modelling software
will implement MSE as a loss function. So all you need to do to optimize it is
to turn this on in your favorite library. And here are some of the library that
support mean square error optimization. Both XGBoost and
LightGBM will do it easily. A RandomForestRegresor from a scaler and
also can split based on MSE, thus optimizing individually. A lot of linear models
implemented in siclicar, and most of them are designed to optimize MSE. For example, ordinarily squares,
reach regression, regression and so on. There's also SGRegressor class and
Sklearn. It also implements a linear model but differently to other
linear models in Sklearn. It uses [INAUDIBLE] gradient decent
to train it, and thus very versatile. Well and of course MSE was built in. The library for
online learning of linear models, also accepts MSC as lost function. But every neural net package like PyTorch,
Keras, Flow, has MSE loss implemented. You just need to find an example
on GitHub or wherever, and see what name MSE loss has
in that particular library. For example,
it is sometimes called L two loss, as L to distance in Matt Luke's using. But basically for all the metrics
we consider in this lesson, you may find plaintal flames
since they were used and discovered independently
in different communities. Now, what about mean absolute error. MAE is popular too, so it is easy to
find a model that will optimize it. Unfortunately, the extra boost
cannot optimize MAE because MAE has zero as a second
derivative while LightGBM can. So you still can use gradient boosting
decision trees to this metric. MAE criteria was implemented for
RandomForestRegressor from Sklearn. But note that running time will be
quite high compared with MSE Corte. Unfortunately, linear models
from SKLearn including SG Regressor can not
optimize MAE negatively. But, there is a loss called Huber Loss,
it is implemented in some of the models. Basically, it is very similar to MAE,
especially when the errors are large. We will discuss it in the next slide. In [INAUDIBLE], MAE loss is implemented, but under a different name
that's called quantile loss. In fact, MAE is just a special
case of quantile loss. Although I will not go into the details
here, but just recall that MAE is somehow connected to median values and
median is a particular quantile. What about neural networks? As we've discussed MAE is not
differentiable only when the predictions are equal to target. And it is of a rare case. That is why we may use any model
train to put to optimize MAE. It may be that you will not find MAE
implemented in a neural library, but it is very easy to implement it. In fact, all the models need is a loss function gradient
with respect to predictions. And in this case,
this is just a set function. Different names you may encounter for
MAE is, L1 that fit and a one loss, and sometimes people
refer to that special case of quintile regression as
to median regression. A lot, a lot of,
a lot of ways to make MAE smooth. You can actually make up your own smooth
function that have upload that loops like MAE error. The most famous one is Huber loss. It's basically a mix between MSE and MAE. MSE is computed when the error is small,
so we can safely approach zero error. And MAE is computed for
large errors given robustness. So, to this end, we discuss the libraries
that can optimize mean square error and mean absolute error. Now, let's get to not ask
common relative metrics. MSPE and MAPE. It's much harder to find the model
which can optimize them out of the box. Of course we can always can use,
either, of course we can always either implement a custom loss for
an integer boost or a neural net. It is really easy to do there. Or we can optimize different metric and
do early stopping. But there are several specific
approaches that I want to mention. This approach is based on the fact that
MSP is a weighted version of MSE and MAP is a weighted version of MAE. On the right side,
we've sen expression for MSP and MAP. The summon denominator just
ensures that the weights are summed up to 1, but it's not required. Intuitively, the sample weights are
indicating how important the object is for us while training the model. The smaller the target,
is the more important the object. So, how do we use this knowledge? In fact,
many libraries accept sample weights. Say we want to optimize MSP. So if we can set sample weights to
the ones from the previous slide, we can use MSE laws with it. And, the model will actually
optimize desired MSPE loss. Although most important libraries like
XGBoost, LightGBM, most neural net packages support sample weighting,
not every library implements it. But there is another method which works
whenever a library can optimize MSE or MAE. Nothing else is needed. All we need to do is to create a new
training set by sampling it from the original set that we have and
fit a model with, for example, I'm a secretarian if you
want to optimize MSPE. It is important to set
the probabilities for each object to be sampled to
the weights we've calculated. The size of the new data set is up to you. You can sample for example, twice as many
objects as it was in original train set. And note that we do not need to
do anything with the test set. It stays as is. I would also advise you to
re-sample train set several times. Each time fitting a model. And then average models predictions,
if we'll get the score much better and more stable. The results will,
another way we can optimize MSPE, this approach was widely used during
Rossmund Competition on Kagle. It can be proved that if
the errors are small, we can optimize the predictions
in logarithmic scale. Where it is similar to what we will
do on the next slide actually. We will not go into details but you can find a link to explanation
in the reading materials. And finally, let's get to the last
regression metric we have to discuss. Root, mean, square, logarithmic error. It turns out quite easy to optimize,
because of the connection with MSE loss. All we need to do is first to apply and
transform to our target variables. In this case,
logarithm of the target plus one. Let's denote the transformed target
with a z variable right now. And then, we need to fit a model
with MSE loss to transform target. To get a prediction for a test subject,
we first obtain the prediction, z hat, in the logarithmic scale just by calling
model.predict or something like that. And next, we do an inverse transform from
logarithmic scale back to the original by expatiating z hat and
subtracting one, and this is how we obtain the predictions
y hat for the test set. In this video, we run through regression
matrix and tools to optimize them. MSE and MAE are very common and
implemented in many packages. RMSPE and MAPE can be optimized by
either resampling the data set or setting proper sample weights. RMSLE is optimized by
optimizing MSE in log space. In the next video,
we will see optimization techniques for classification matrix. [MUSIC]