1
00:00:03,210 --> 00:00:09,520
In this video, we will discuss what is the loss and what is a metric,

2
00:00:09,520 --> 00:00:11,905
and what is the difference between them.

3
00:00:11,905 --> 00:00:18,155
And then we'll overview what are the general approaches to metric optimization.

4
00:00:18,155 --> 00:00:23,545
Let's start with a comparison between two notions, loss and metric.

5
00:00:23,545 --> 00:00:27,460
The metric or target metric is a function which we

6
00:00:27,460 --> 00:00:31,690
want to use to evaluate the quality of our model.

7
00:00:31,690 --> 00:00:34,390
For example, for a classification task,

8
00:00:34,390 --> 00:00:38,097
we may want to maximize accuracy of our predictions,

9
00:00:38,097 --> 00:00:41,765
how frequently the model outputs the correct label.

10
00:00:41,765 --> 00:00:47,415
But the problem is that no one really knows how to optimize accuracy efficiently.

11
00:00:47,415 --> 00:00:51,770
Instead, people come up with the proxy loss functions.

12
00:00:51,770 --> 00:00:57,105
They are such evaluation functions that are easy to optimize for a given model.

13
00:00:57,105 --> 00:01:02,325
For example, logarithmic loss is widely used as an optimization loss,

14
00:01:02,325 --> 00:01:07,530
while the accuracy score is how the solution is eventually evaluated.

15
00:01:07,530 --> 00:01:11,220
So, once again, the loss function is a function

16
00:01:11,220 --> 00:01:15,205
that our model optimizes and uses to evaluate the solution,

17
00:01:15,205 --> 00:01:20,455
and the target metric is how we want the solution to be evaluated.

18
00:01:20,455 --> 00:01:24,365
This is kind of expectation versus reality thing.

19
00:01:24,365 --> 00:01:30,670
Sometimes we are lucky and the model can optimize our target metric directly.

20
00:01:30,670 --> 00:01:34,360
For example, for mean square error metric,

21
00:01:34,360 --> 00:01:39,960
most libraries can optimize it from the outset, from the box.

22
00:01:39,960 --> 00:01:43,745
So the loss function is the same as the target metric.

23
00:01:43,745 --> 00:01:46,690
And sometimes we want to optimize metrics that

24
00:01:46,690 --> 00:01:50,845
are really hard or even impossible to optimize directly.

25
00:01:50,845 --> 00:01:53,420
In this case, we usually set the model to optimize

26
00:01:53,420 --> 00:01:56,545
a loss that is different to a target metric,

27
00:01:56,545 --> 00:01:58,420
but after a model is trained,

28
00:01:58,420 --> 00:02:02,290
we use hacks and heuristics to negate the discrepancy

29
00:02:02,290 --> 00:02:07,520
and adjust the model to better fit the target metric.

30
00:02:07,520 --> 00:02:11,810
We will see the examples for both cases in the following videos.

31
00:02:11,810 --> 00:02:14,935
And the last thing to mention is that loss metric,

32
00:02:14,935 --> 00:02:22,055
cost objective and other notions are more or less used as synonyms.

33
00:02:22,055 --> 00:02:26,680
It is completely okay to say target loss and optimization metric,

34
00:02:26,680 --> 00:02:29,895
but we will fix the wording for the clarity now.

35
00:02:29,895 --> 00:02:33,495
Okay, so far, we've understood

36
00:02:33,495 --> 00:02:38,745
why it's important to optimize a metric given in a competition.

37
00:02:38,745 --> 00:02:44,395
And we have discussed the difference between optimization loss and target metric.

38
00:02:44,395 --> 00:02:50,305
Now, let's overview the approaches to target metrics optimization in general.

39
00:02:50,305 --> 00:02:54,600
The approaches can be broadly divided into several categories,

40
00:02:54,600 --> 00:02:57,300
depending on the metric we need to optimize.

41
00:02:57,300 --> 00:03:01,050
Some metrics can be optimized directly.

42
00:03:01,050 --> 00:03:06,825
That is, we should just find a model that optimizes this metric and run it.

43
00:03:06,825 --> 00:03:13,200
In fact, all we need to do is to set the model's loss function to these metric.

44
00:03:13,200 --> 00:03:16,055
The most common metrics like MSE,

45
00:03:16,055 --> 00:03:22,470
Logloss are implemented as loss functions in almost every library.

46
00:03:22,470 --> 00:03:26,090
For some of the metrics that cannot be optimized directly,

47
00:03:26,090 --> 00:03:29,610
we can somehow pre-process the train set and use

48
00:03:29,610 --> 00:03:34,245
a model with a metric or loss function which is easy to optimize.

49
00:03:34,245 --> 00:03:40,265
For example, while MSPE metric cannot be optimized directly with XGBoost,

50
00:03:40,265 --> 00:03:46,539
we will see later that we can resample the train set and optimize MSE loss instead,

51
00:03:46,539 --> 00:03:48,930
which XGBoost can optimize.

52
00:03:48,930 --> 00:03:52,470
Sometimes, we'll optimize incorrect metric,

53
00:03:52,470 --> 00:03:58,890
but we'll post-process the predictions to fit classification,

54
00:03:58,890 --> 00:04:01,850
to fit the communication metric better.

55
00:04:01,850 --> 00:04:03,810
For some models and frameworks,

56
00:04:03,810 --> 00:04:06,765
it's possible to define a custom loss function,

57
00:04:06,765 --> 00:04:10,320
and sometimes it's possible to implement a loss function which will

58
00:04:10,320 --> 00:04:14,345
serve as a nice proxy for the desired metric.

59
00:04:14,345 --> 00:04:19,715
For example, it can be done for quadratic-weighted Kappa, as we will see later.

60
00:04:19,715 --> 00:04:24,750
It's actually quite easy to define a custom loss function for XGBoost.

61
00:04:24,750 --> 00:04:27,735
We only need to implement a single function that

62
00:04:27,735 --> 00:04:30,910
takes predictions and the target values and

63
00:04:30,910 --> 00:04:34,090
computes first and second-order derivatives

64
00:04:34,090 --> 00:04:37,890
of the loss function with respect to the model's predictions.

65
00:04:37,890 --> 00:04:41,275
For example, here you see one for the Logloss.

66
00:04:41,275 --> 00:04:47,485
Of course, the loss function should be smooth enough and have well-behaved derivatives,

67
00:04:47,485 --> 00:04:50,455
otherwise XGBoost will drive crazy.

68
00:04:50,455 --> 00:04:53,965
In this course, we consider only a small set of metrics,

69
00:04:53,965 --> 00:04:56,300
but there are plenty of them in fact.

70
00:04:56,300 --> 00:04:57,960
And for some of them,

71
00:04:57,960 --> 00:05:00,110
it is really hard to come up with

72
00:05:00,110 --> 00:05:05,155
a neat optimization procedure or write a custom loss function.

73
00:05:05,155 --> 00:05:09,020
Thankfully, there is a method that always works.

74
00:05:09,020 --> 00:05:10,955
It is called early stopping,

75
00:05:10,955 --> 00:05:13,310
and it is very simple.

76
00:05:13,310 --> 00:05:16,290
You set a model to optimize any loss function it can

77
00:05:16,290 --> 00:05:21,225
optimize and you monitor the desired metric on a validation set.

78
00:05:21,225 --> 00:05:25,820
And you stop the training when the model starts to fit according to

79
00:05:25,820 --> 00:05:30,815
the desired metric and not according to the metric the model is truly optimizing.

80
00:05:30,815 --> 00:05:33,155
That is important. Of course,

81
00:05:33,155 --> 00:05:36,615
some metrics cannot be even easily evaluated.

82
00:05:36,615 --> 00:05:40,730
For example, if the metric is based on a human assessor's opinions,

83
00:05:40,730 --> 00:05:44,500
you cannot evaluate it on every iteration.

84
00:05:44,500 --> 00:05:47,730
For such metrics, we cannot use early stopping,

85
00:05:47,730 --> 00:05:51,370
but we will never find such metrics in a competition.

86
00:05:51,370 --> 00:05:53,050
So, in this video,

87
00:05:53,050 --> 00:05:56,080
we have discussed the discrepancy between our target

88
00:05:56,080 --> 00:06:00,055
metric and the loss function that our model optimizes.

89
00:06:00,055 --> 00:06:04,150
We've reviewed several approaches to target metric optimization and,

90
00:06:04,150 --> 00:06:06,880
in particular, discussed early stopping.

91
00:06:06,880 --> 00:06:11,480
In the following videos, we will go through the regression and

92
00:06:11,480 --> 00:06:17,390
classification metrics and see the hacks we can use to optimize them.