1
00:00:03,200 --> 00:00:05,600
In this video, we will review

2
00:00:05,600 --> 00:00:10,260
the most common ranking metrics and establish an intuition about them.

3
00:00:10,260 --> 00:00:12,039
Although in a competition,

4
00:00:12,039 --> 00:00:13,785
the metric is fixed for us,

5
00:00:13,785 --> 00:00:20,210
it is still useful to understand in what cases one metric could be preferred to another.

6
00:00:20,210 --> 00:00:24,400
In this course, we concentrate on regression and classification,

7
00:00:24,400 --> 00:00:27,310
so we will only discuss related metrics.

8
00:00:27,310 --> 00:00:30,670
For a better understanding, for each metric,

9
00:00:30,670 --> 00:00:36,290
we will also build the most simple baseline we could imagine, the constant model.

10
00:00:36,290 --> 00:00:41,380
That is, if we are only allowed to predict the same value for every object,

11
00:00:41,380 --> 00:00:45,875
what value is optimal to predict according to the chosen metric?

12
00:00:45,875 --> 00:00:48,745
Let's start with regression task and related metrics.

13
00:00:48,745 --> 00:00:50,015
In the following videos,

14
00:00:50,015 --> 00:00:53,125
we'll talk about metrics for classification.

15
00:00:53,125 --> 00:00:57,475
First, let us clarify the notation we're going to use throughout the lesson.

16
00:00:57,475 --> 00:01:02,070
N will be the number of samples in our training data set,

17
00:01:02,070 --> 00:01:04,315
y is that the target,

18
00:01:04,315 --> 00:01:07,575
and y-hat is our model's predictions.

19
00:01:07,575 --> 00:01:13,305
And y-hat and y with index i are the predictions,

20
00:01:13,305 --> 00:01:16,700
and target value respectively for i-th object.

21
00:01:16,700 --> 00:01:20,975
The first metric we will discuss is Mean Square Error.

22
00:01:20,975 --> 00:01:24,985
It is for sure the most common metric for regression type of problems.

23
00:01:24,985 --> 00:01:28,150
In data science, people use it when they don't have

24
00:01:28,150 --> 00:01:31,465
any specific preferences for the solution to their problem,

25
00:01:31,465 --> 00:01:35,335
or when they don't know other metric.

26
00:01:35,335 --> 00:01:40,465
MSE basically measures average squared error of our predictions.

27
00:01:40,465 --> 00:01:44,185
For each point, we calculate square difference between

28
00:01:44,185 --> 00:01:49,520
the predictions of the target and then average those values over the objects.

29
00:01:49,520 --> 00:01:52,255
Let's introduce a simple data set now.

30
00:01:52,255 --> 00:01:54,565
Say, we have five objects,

31
00:01:54,565 --> 00:01:57,000
and each object has some features,

32
00:01:57,000 --> 00:02:00,980
X, and the target is shown in the column Y.

33
00:02:00,980 --> 00:02:03,475
Let's ask ourselves a question.

34
00:02:03,475 --> 00:02:10,414
How will the error change if we fix all the predictions but want to be perfect,

35
00:02:10,414 --> 00:02:12,915
and we'll derive the value of the remaining one?

36
00:02:12,915 --> 00:02:15,135
To answer this question,

37
00:02:15,135 --> 00:02:16,805
take a look at this plot.

38
00:02:16,805 --> 00:02:18,680
On the horizontal line,

39
00:02:18,680 --> 00:02:23,025
we will first put points to the positions of the target values.

40
00:02:23,025 --> 00:02:27,975
The points are colored according to the corresponding rows in our data table.

41
00:02:27,975 --> 00:02:30,495
And on the Y-axis,

42
00:02:30,495 --> 00:02:33,535
we will show the mean square error.

43
00:02:33,535 --> 00:02:39,155
So, let's now assume that our predictions for the first four objects are perfect,

44
00:02:39,155 --> 00:02:40,965
and let's draw a curve.

45
00:02:40,965 --> 00:02:46,740
How the metric value will change if we change the prediction for the last object?

46
00:02:46,740 --> 00:02:50,980
For MSE metric, it looks like that.

47
00:02:50,980 --> 00:02:52,800
In fact, if we predict 25,

48
00:02:52,800 --> 00:02:54,390
the error is zero,

49
00:02:54,390 --> 00:02:56,070
and if we predict something else,

50
00:02:56,070 --> 00:02:58,430
then it is greater than zero.

51
00:02:58,430 --> 00:03:02,600
And the error curve looks like parabola.

52
00:03:02,600 --> 00:03:06,805
Let's now draw analogous curves for other objects.

53
00:03:06,805 --> 00:03:12,130
Well, right now it's hard to make any conclusions but we will build the same kind

54
00:03:12,130 --> 00:03:17,410
of plot for every metric and we will note the difference between them.

55
00:03:17,410 --> 00:03:20,885
Now, let's build the simplest baseline model.

56
00:03:20,885 --> 00:03:27,270
We'll not use the features X at all and we will always predict a constant value Alpha.

57
00:03:27,270 --> 00:03:30,181
But, what is the optimal constant?

58
00:03:30,181 --> 00:03:34,885
What constant minimizes the mean square error for our data set?

59
00:03:34,885 --> 00:03:37,750
In fact, it is easier to set the derivative

60
00:03:37,750 --> 00:03:41,335
of our total error with respect to that constant to zero,

61
00:03:41,335 --> 00:03:43,990
and find it from this equation.

62
00:03:43,990 --> 00:03:49,570
What we'll find is that the best constant is the mean value of the target column.

63
00:03:49,570 --> 00:03:51,890
If you think you don't know how to derive it,

64
00:03:51,890 --> 00:03:53,885
take a look at the reading materials.

65
00:03:53,885 --> 00:03:57,755
There is a fine explanation and links to related books.

66
00:03:57,755 --> 00:04:01,015
But let us constructively check it.

67
00:04:01,015 --> 00:04:03,835
Once again, on the horizontal axis,

68
00:04:03,835 --> 00:04:07,990
let's denote our target values with dot and draw a function.

69
00:04:07,990 --> 00:04:13,420
How the error changes is if we change the value of that constant Alpha?

70
00:04:13,420 --> 00:04:16,570
We can do it with a simple grid search over a given range

71
00:04:16,570 --> 00:04:20,586
by changing Alpha intuitively and recomputing an error.

72
00:04:20,586 --> 00:04:26,685
Now, the green square shows a minimum value for our metric.

73
00:04:26,685 --> 00:04:31,165
The constant we found is 10.99,

74
00:04:31,165 --> 00:04:35,970
and it's quite close to the true mean of the target which is 11.

75
00:04:35,970 --> 00:04:38,380
In fact, the value we got deviates from

76
00:04:38,380 --> 00:04:42,425
the true mean value only because with the grid search,

77
00:04:42,425 --> 00:04:44,800
we get only approximate answer.

78
00:04:44,800 --> 00:04:48,525
Also note that the red curve on the second plot is

79
00:04:48,525 --> 00:04:52,970
uniformly same and average of the curves from the first plot.

80
00:04:52,970 --> 00:04:55,935
We are finished discussing MSE metric itself,

81
00:04:55,935 --> 00:05:03,475
but there are two more related metrics used frequently, RMSE and R_squared.

82
00:05:03,475 --> 00:05:05,780
And we will briefly study them now.

83
00:05:05,780 --> 00:05:09,315
RMSE, Root Mean Square Error,

84
00:05:09,315 --> 00:05:12,705
is a very similar metric to MSE.

85
00:05:12,705 --> 00:05:15,555
In fact, it is calculated in two steps.

86
00:05:15,555 --> 00:05:20,295
First, we calculate regular mean square error and then,

87
00:05:20,295 --> 00:05:23,270
we take a square root of it.

88
00:05:23,270 --> 00:05:26,150
The square root is introduced to make scale

89
00:05:26,150 --> 00:05:30,135
of the errors to be the same as the scale of the targets.

90
00:05:30,135 --> 00:05:33,600
For MSE, the error is squared,

91
00:05:33,600 --> 00:05:36,770
so taking a root out of it makes total error

92
00:05:36,770 --> 00:05:41,400
a little bit easier to comprehend because it is linear now.

93
00:05:41,400 --> 00:05:48,950
Now, it is very important to understand in what sense RMSE is similar to MSE,

94
00:05:48,950 --> 00:05:51,120
and what is the difference.

95
00:05:51,120 --> 00:05:55,119
First, they are similar in terms of their minimizers.

96
00:05:55,119 --> 00:06:01,030
Every minimizer of MSE is a minimizer of RMSE and vice versa.

97
00:06:01,030 --> 00:06:05,510
But generally, if we have two sets of predictions, A and B,

98
00:06:05,510 --> 00:06:09,277
and say MSE of A is greater than MSE of B,

99
00:06:09,277 --> 00:06:15,960
then we can be sure that RMSE of A is greater RMSE of B.

100
00:06:15,960 --> 00:06:19,020
And it also works in the opposite direction.

101
00:06:19,020 --> 00:06:25,365
This is actually true only because square root function is non-decreasing.

102
00:06:25,365 --> 00:06:27,605
What does it mean for us?

103
00:06:27,605 --> 00:06:30,965
It means that, if our target the metric is RMSE,

104
00:06:30,965 --> 00:06:34,500
we still can compare our models using MSE,

105
00:06:34,500 --> 00:06:39,203
since MSE will order the models in the same way as RMSE.

106
00:06:39,203 --> 00:06:44,625
And we can optimize MSE instead of RMSE.

107
00:06:44,625 --> 00:06:47,340
In fact, MSE is a little bit easier to work with,

108
00:06:47,340 --> 00:06:53,485
so everybody uses MSE instead of RMSE.

109
00:06:53,485 --> 00:06:59,510
But there is a little bit of difference between the two for gradient-based models.

110
00:06:59,510 --> 00:07:05,060
Take a look at the gradient of RMSE with respect to i-th prediction.

111
00:07:05,060 --> 00:07:12,035
It is basically equal to gradient of MSE multiplied by some value.

112
00:07:12,035 --> 00:07:15,560
The value doesn't depend on the index I.

113
00:07:15,560 --> 00:07:21,615
It means that travelling along MSE gradient is equivalent to traveling along

114
00:07:21,615 --> 00:07:23,680
RMSE gradient but with

115
00:07:23,680 --> 00:07:28,880
a different flowing rate and the flowing rate depends on MSE score itself.

116
00:07:28,880 --> 00:07:31,355
So, it is kind of dynamic.

117
00:07:31,355 --> 00:07:36,340
So even though RMSE and MSE are really similar in terms of models scoring,

118
00:07:36,340 --> 00:07:41,320
they can be not immediately interchangeable for gradient based methods.

119
00:07:41,320 --> 00:07:46,666
We will probably need to adjust some parameters like the learning rate.

120
00:07:46,666 --> 00:07:53,305
Now, what if I told you that MSE for my models predictions is 32?

121
00:07:53,305 --> 00:07:56,880
Should I improve my model or is it good enough?

122
00:07:56,880 --> 00:08:01,550
Or what if my MSE was 0.4?

123
00:08:01,550 --> 00:08:05,830
Actually, it's hard to realize if our model is good or

124
00:08:05,830 --> 00:08:11,050
not by looking at the absolute values of MSE or RMSE.

125
00:08:11,050 --> 00:08:15,390
It really depends on the properties of the dataset and their target vector.

126
00:08:15,390 --> 00:08:18,490
How much variation is there in the target vector.

127
00:08:18,490 --> 00:08:21,070
We would probably want to measure how much

128
00:08:21,070 --> 00:08:24,415
our model is better than the constant baseline.

129
00:08:24,415 --> 00:08:28,690
And say, the desired metrics should give us zero if we

130
00:08:28,690 --> 00:08:33,045
are no better than the baseline and one if the predictions are perfect.

131
00:08:33,045 --> 00:08:38,180
For that purpose, R_squared metric is usually used. Take a look.

132
00:08:38,180 --> 00:08:41,265
When MSE of our predictions is zero,

133
00:08:41,265 --> 00:08:42,845
the R_squared is 1,

134
00:08:42,845 --> 00:08:47,655
and when our MSE is equal to MSE over constant model,

135
00:08:47,655 --> 00:08:50,610
then R_squared is zero.

136
00:08:50,610 --> 00:08:56,030
Well, because the values in numerator and denominator are the same.

137
00:08:56,030 --> 00:09:01,575
And all reasonable models will score between 0 and 1.

138
00:09:01,575 --> 00:09:06,300
The most important thing for us is that to optimize R_squared,

139
00:09:06,300 --> 00:09:09,065
we can optimize MSE.

140
00:09:09,065 --> 00:09:13,055
It will be absolutely equivalent since R_squared is basically

141
00:09:13,055 --> 00:09:19,635
MSE score divided by a constant and subtracted from another constant.

142
00:09:19,635 --> 00:09:24,031
These constants doesn't matter for optimization.

143
00:09:24,031 --> 00:09:28,370
Lets move on and discuss another metric called Mean Absolute Error,

144
00:09:28,370 --> 00:09:30,565
or MAE in short.

145
00:09:30,565 --> 00:09:33,655
The error is calculated as an average of

146
00:09:33,655 --> 00:09:38,510
absolute differences between the target values and the predictions.

147
00:09:38,510 --> 00:09:41,765
What is important about this metric is that it penalizes

148
00:09:41,765 --> 00:09:47,670
huge errors that not as that badly as MSE does.

149
00:09:47,670 --> 00:09:52,825
Thus, it's not that sensitive to outliers as mean square error.

150
00:09:52,825 --> 00:09:57,335
It also has a little bit different applications than MSE.

151
00:09:57,335 --> 00:09:59,975
MAE is widely used in finance,

152
00:09:59,975 --> 00:10:07,340
where $10 error is usually exactly two times worse than $5 error.

153
00:10:07,340 --> 00:10:15,650
On the other hand, MSE metric thinks that $10 error is four times worse than $5 error.

154
00:10:15,650 --> 00:10:19,152
MAE is easier to justify.

155
00:10:19,152 --> 00:10:21,290
And if you used RMSE,

156
00:10:21,290 --> 00:10:28,200
it would become really hard to explain to your boss how you evaluated your model.

157
00:10:28,200 --> 00:10:31,980
What constant is optimal for MAE?

158
00:10:31,980 --> 00:10:36,765
It's quite easy to find that its a median of the target values.

159
00:10:36,765 --> 00:10:39,050
In this case, it is eight.

160
00:10:39,050 --> 00:10:42,405
See reading materials for a proof.

161
00:10:42,405 --> 00:10:45,960
Just to verify that everything is correct,

162
00:10:45,960 --> 00:10:50,735
we again can try to Greek search for an optimal value with a simple loop.

163
00:10:50,735 --> 00:10:53,130
And in fact, the value we found is 7.98,

164
00:10:53,130 --> 00:10:58,905
which indicates we were right.

165
00:10:58,905 --> 00:11:03,495
Here, we see that MAE is more robust than MSE,

166
00:11:03,495 --> 00:11:07,765
that is, it is not that influenced by the outliers.

167
00:11:07,765 --> 00:11:15,445
In fact, recall that the optimal constant for MSE was about 11 while for MAE it is eight.

168
00:11:15,445 --> 00:11:21,965
And eight looks like a much better prediction for the points on the left side.

169
00:11:21,965 --> 00:11:25,490
If we assume that point with a target 27 is

170
00:11:25,490 --> 00:11:29,215
an outlier and we should not care about the prediction for it.

171
00:11:29,215 --> 00:11:35,465
Another important thing about MAE is its gradients with respect to the predictions.

172
00:11:35,465 --> 00:11:39,860
The grid end is a step function and it takes

173
00:11:39,860 --> 00:11:45,935
-1 when Y_hat is smaller than the target and +1 when it is larger.

174
00:11:45,935 --> 00:11:50,883
Now, the gradient is not defined when the prediction is perfect,

175
00:11:50,883 --> 00:11:54,037
because when Y_hat is equal to Y,

176
00:11:54,037 --> 00:11:56,000
we can not evaluate gradient.

177
00:11:56,000 --> 00:11:59,055
It is not defined. So formally,

178
00:11:59,055 --> 00:12:01,175
MAE is not differentiable,

179
00:12:01,175 --> 00:12:06,940
but in fact, how often your predictions perfectly measure the target.

180
00:12:06,940 --> 00:12:11,180
Even if they do, we can write a simple IF condition and return

181
00:12:11,180 --> 00:12:16,393
zero when it is the case and through gradient otherwise.

182
00:12:16,393 --> 00:12:23,623
Also know that second derivative is zero everywhere and not defined in the point zero.

183
00:12:23,623 --> 00:12:27,045
I want to end the discussion with the last note.

184
00:12:27,045 --> 00:12:29,715
Well, it has nothing to do with competitions

185
00:12:29,715 --> 00:12:33,315
but every data scientists should understand this.

186
00:12:33,315 --> 00:12:38,465
We said that MAE is more robust than MSE.

187
00:12:38,465 --> 00:12:41,320
That is, it is less sensitive to outliers,

188
00:12:41,320 --> 00:12:44,860
but it doesnt mean it is always better to use MAE.

189
00:12:44,860 --> 00:12:48,155
No, it does not.

190
00:12:48,155 --> 00:12:49,775
It is basically a question.

191
00:12:49,775 --> 00:12:55,360
Are there any real outliers in the dataset or there are just,

192
00:12:55,360 --> 00:13:01,135
let's say, unexpectedly high values that we should treat just as others?

193
00:13:01,135 --> 00:13:03,865
Outliers have usually mistakes,

194
00:13:03,865 --> 00:13:05,825
measurement errors, and so on,

195
00:13:05,825 --> 00:13:07,525
but at the same time,

196
00:13:07,525 --> 00:13:11,060
similarly looking objects can be of natural kind.

197
00:13:11,060 --> 00:13:18,464
So, if you think these unusual objects are normal in the sense that they're just rare,

198
00:13:18,464 --> 00:13:22,720
you should not use a metric which will ignore them.

199
00:13:22,720 --> 00:13:25,125
And it is better to use MSE.

200
00:13:25,125 --> 00:13:28,855
Otherwise, if you think that they are really outliers,

201
00:13:28,855 --> 00:13:32,353
like mistakes, you should use MAE.

202
00:13:32,353 --> 00:13:33,640
So in this video,

203
00:13:33,640 --> 00:13:36,865
we have discussed several important metrics.

204
00:13:36,865 --> 00:13:40,145
We first discussed, mean square error and realized

205
00:13:40,145 --> 00:13:44,290
that the best constant for it is the mean targeted value.

206
00:13:44,290 --> 00:13:46,803
Root Mean Square Error, RMSE,

207
00:13:46,803 --> 00:13:52,165
and R_squared are very similar to MSE from optimization perspective.

208
00:13:52,165 --> 00:14:00,630
We then discussed Mean Absolute Error and when people prefer to use MAE over MSE.

209
00:14:00,630 --> 00:14:03,740
In the next video, we will continue to study

210
00:14:03,740 --> 00:14:08,710
regression metrics and then we'll get to classification ones.