1
00:00:00,000 --> 00:00:03,585
[MUSIC]

2
00:00:03,585 --> 00:00:06,102
In this and the next video,
we will discuss,

3
00:00:06,102 --> 00:00:10,090
what are the ways to optimize
classification metrics?

4
00:00:10,090 --> 00:00:13,280
In this video,
we will discuss logloss and accuracy, and

5
00:00:13,280 --> 00:00:16,850
in the next one, AUC and
quadratic-weighted kappa.

6
00:00:16,850 --> 00:00:20,480
Let's start with logloss, logloss for

7
00:00:20,480 --> 00:00:26,040
classification is like MSE for
aggression, it is implemented everywhere.

8
00:00:26,040 --> 00:00:30,070
All we need to do is to find out what
arguments should be passed to a library

9
00:00:30,070 --> 00:00:31,740
to make it use logloss for training.

10
00:00:33,380 --> 00:00:38,074
There are a huge number of libraries
to try, like XGBoost, LightGBM,

11
00:00:38,074 --> 00:00:43,489
Logistic Regression, and [INAUDIBLE]
classifier from sklearn, Vowpal Wabbit.

12
00:00:43,489 --> 00:00:47,520
All neural nets, by default,
optimize logloss for classification.

13
00:00:48,550 --> 00:00:54,280
Random forest classifier predictions turn
out to be quite bad in terms of logloss.

14
00:00:54,280 --> 00:00:57,050
But there is a way to make them better,

15
00:00:57,050 --> 00:01:02,530
we can calibrate the predictions
to better fit logloss.

16
00:01:02,530 --> 00:01:06,010
We've mentioned several times that
logloss requires model to output

17
00:01:06,010 --> 00:01:08,610
exterior probabilities,
but what does it mean?

18
00:01:09,850 --> 00:01:14,558
It actually means that if we take all the
points that have a score of, for example,

19
00:01:14,558 --> 00:01:22,171
0.8, then there will be exactly four times
more positive objects than negatives.

20
00:01:22,171 --> 00:01:28,910
That is, 80% of the points will be
from class 1, and 20% from class 0.

21
00:01:28,910 --> 00:01:32,945
If the classifier doesn't
directly optimize logloss,

22
00:01:32,945 --> 00:01:35,420
its predictions should be calibrated.

23
00:01:36,670 --> 00:01:42,080
Take a look at this plot, the blue line
shows sorted by value predictions for

24
00:01:42,080 --> 00:01:43,800
the validation set.

25
00:01:43,800 --> 00:01:48,239
And the red line shows correspondent
target values smoothed

26
00:01:48,239 --> 00:01:49,905
with rolling window.

27
00:01:49,905 --> 00:01:53,430
We clearly see that our predictions
are kind of conservative.

28
00:01:53,430 --> 00:01:57,400
They´re much greater than two
target mean on the left side, and

29
00:01:57,400 --> 00:01:59,800
much lower than they should
be on the right side.

30
00:02:01,000 --> 00:02:04,350
So this classifier is not calibrated, and

31
00:02:04,350 --> 00:02:07,350
the green curve shows
the predictions after calibration.

32
00:02:08,630 --> 00:02:11,340
But if we plot sorted predictions for

33
00:02:11,340 --> 00:02:17,000
calibrated classifier, the curve will
be very similar to target rolling mean.

34
00:02:17,000 --> 00:02:20,840
And in fact, the calibrator
predictions will have lower log loss.

35
00:02:21,950 --> 00:02:26,050
Now, there are several ways to
calibrate predictions, for example,

36
00:02:26,050 --> 00:02:29,340
we can use so-called Platt scaling.

37
00:02:29,340 --> 00:02:33,039
Basically, we just need to fit a logistic
regression to our predictions.

38
00:02:34,280 --> 00:02:38,940
I will not go into the details how to do
that, but it's very similar to how we

39
00:02:38,940 --> 00:02:44,967
stack models, and we will discuss
stacking in detail in a different video.

40
00:02:44,967 --> 00:02:49,490
Second, we can fit isotonic
regression to our predictions, and

41
00:02:49,490 --> 00:02:53,715
again, it is done very similar
to stacking, just another model.

42
00:02:53,715 --> 00:02:58,490
While finally, we can use stacking,

43
00:02:58,490 --> 00:03:01,861
so the idea is, we can fit any classifier.

44
00:03:01,861 --> 00:03:06,460
It doesn't need to optimize logloss,
it just needs to be good, for

45
00:03:06,460 --> 00:03:08,145
example, in terms of AUC.

46
00:03:09,550 --> 00:03:11,940
And then we can fit another model on top

47
00:03:13,020 --> 00:03:17,620
that will take the predictions of our
model, and calibrate them properly.

48
00:03:17,620 --> 00:03:21,990
And that model on top will use
logloss as its optimization loss.

49
00:03:21,990 --> 00:03:27,320
So it will be optimizing indirectly,
and its predictions will be calibrated.

50
00:03:28,575 --> 00:03:32,960
Logloss was the only metric that
is easy to optimize directly.

51
00:03:32,960 --> 00:03:38,930
With accuracy, there is no easy
recipe how to directly optimize it.

52
00:03:38,930 --> 00:03:43,930
In general, the recipe is following,
actually, if it is a binary

53
00:03:43,930 --> 00:03:48,666
classification task, fit any metric, and
tune with the binarization threshold.

54
00:03:48,666 --> 00:03:53,820
For multi-class tasks, fit any metric and

55
00:03:53,820 --> 00:03:58,430
tune parameters comparing
the models by their accuracy score,

56
00:03:58,430 --> 00:04:03,120
not by the metric that the models
were really optimizing.

57
00:04:04,230 --> 00:04:06,370
So this is kind of early stopping and

58
00:04:06,370 --> 00:04:09,840
the cross validation,
where you look at the accuracy score.

59
00:04:11,020 --> 00:04:15,150
Just to get an intuition why accuracy is
hard to optimize, let's look at this plot.

60
00:04:16,210 --> 00:04:19,900
So on the vertical axis we
will show the loss, and

61
00:04:19,900 --> 00:04:24,890
the horizontal axis shows signed distance
to the decision boundary, for example,

62
00:04:24,890 --> 00:04:27,830
to a hyper plane or for a linear model.

63
00:04:27,830 --> 00:04:32,920
The distance is considered to be positive
if the class is predicted correctly.

64
00:04:32,920 --> 00:04:37,370
And negative if the object is located at
the wrong side of the decision boundary.

65
00:04:38,530 --> 00:04:41,730
The blue line here shows zero-one loss,

66
00:04:41,730 --> 00:04:44,970
this is the loss that
corresponds to accuracy score.

67
00:04:44,970 --> 00:04:49,130
We pay 1 if the object is misclassified,
that is,

68
00:04:49,130 --> 00:04:53,130
the object has negative distance,
and we pay nothing otherwise.

69
00:04:54,150 --> 00:04:58,610
The problem is that, this loss has
zero almost everywhere gradient,

70
00:04:58,610 --> 00:05:00,812
with respect to the predictions.

71
00:05:00,812 --> 00:05:06,537
And most learning algorithms require
a nonzero gradient to fit, otherwise

72
00:05:06,537 --> 00:05:12,719
it's not clear how we need to change the
predictions such that loss is decreased.

73
00:05:14,050 --> 00:05:16,520
And so people came up with proxy losses

74
00:05:17,710 --> 00:05:22,120
that are upper bounds for
these zero-one loss.

75
00:05:22,120 --> 00:05:26,740
So if you perfectly fit the proxy loss,
the accuracy will be perfect too,

76
00:05:28,120 --> 00:05:33,120
but differently to zero-one loss,
they are differentiable.

77
00:05:33,120 --> 00:05:38,169
For example, you see here logistic loss,
the red curve used

78
00:05:38,169 --> 00:05:43,218
in logistic regression, and
hinge loss, loss used in SVM.

79
00:05:45,300 --> 00:05:50,645
Now recall that to obtain hard labels for
a test object, we usually take

80
00:05:50,645 --> 00:05:56,185
argmax of our soft predictions,
picking the class with a maximum score.

81
00:05:56,185 --> 00:05:59,822
If our task is binary and
soft predictions sum up to 1,

82
00:05:59,822 --> 00:06:03,780
argmax is equivalent
to threshold function.

83
00:06:03,780 --> 00:06:08,880
Output 1 when the predictions for
the class one is higher than 0.5,

84
00:06:08,880 --> 00:06:12,110
and output 0 when the prediction's lower.

85
00:06:13,490 --> 00:06:18,620
So we've already seen this example
where threshold 0.5 is not optimal,

86
00:06:20,280 --> 00:06:22,550
so what can we do?

87
00:06:22,550 --> 00:06:24,290
We can tune the threshold we apply,

88
00:06:24,290 --> 00:06:29,095
we can do it with a simple grid
search implemented with a for loop.

89
00:06:29,095 --> 00:06:35,650
Well, it means that we can basically
fit any sufficiently powerful model.

90
00:06:35,650 --> 00:06:39,840
It will not matter much what loss exactly,
say, hinge or

91
00:06:39,840 --> 00:06:42,400
log loss the model will optimize.

92
00:06:42,400 --> 00:06:44,910
All we want from our
model's predictions is

93
00:06:44,910 --> 00:06:48,270
the existence of a good threshold
that will separate the classes.

94
00:06:49,620 --> 00:06:53,670
Also, if our classifier
is ideally calibrated,

95
00:06:53,670 --> 00:06:58,414
then it is really returning
posterior probabilities.

96
00:06:58,414 --> 00:07:04,016
And for such a classifier,
threshold 0.5 would be optimal,

97
00:07:04,016 --> 00:07:10,877
but such classifiers are rarely the case,
and threshold tuning helps often.

98
00:07:10,877 --> 00:07:15,907
So in this video, we discussed logloss and

99
00:07:15,907 --> 00:07:21,916
accuracy, in the next video
we will discuss AUC and

100
00:07:21,916 --> 00:07:25,285
quadratic weighted kappa.

101
00:07:25,285 --> 00:07:29,166
[MUSIC]