1
00:00:00,030 --> 00:00:03,898
[MUSIC]

2
00:00:03,898 --> 00:00:05,740
Hi, everyone.

3
00:00:05,740 --> 00:00:10,880
In this section, we'll cover a very
powerful technique, mean encoding.

4
00:00:10,880 --> 00:00:13,210
It actually has a number of names.

5
00:00:13,210 --> 00:00:17,120
Some call it likelihood encoding,
some target encoding, but

6
00:00:17,120 --> 00:00:20,780
in this course,
we'll stick with plain mean encoding.

7
00:00:20,780 --> 00:00:23,850
The general idea of this
technique is to add

8
00:00:23,850 --> 00:00:28,180
new variables based on some
feature to get where we started,.

9
00:00:28,180 --> 00:00:32,510
In simplest case, we encode each
level of categorical variable with

10
00:00:32,510 --> 00:00:36,020
corresponding target mean.

11
00:00:36,020 --> 00:00:38,710
Let's take a look at
the following example.

12
00:00:38,710 --> 00:00:43,029
Here, we have some binary
classification task in which we have

13
00:00:43,029 --> 00:00:45,650
a categorical variable, some city.

14
00:00:45,650 --> 00:00:49,180
And of course,
we want to numerically encode it.

15
00:00:49,180 --> 00:00:53,810
The most obvious way and
what people usually use is label encoding.

16
00:00:55,120 --> 00:00:56,750
It's what we have in second column.

17
00:00:58,050 --> 00:01:00,446
Mean encoding is done differently,

18
00:01:00,446 --> 00:01:04,920
via encoding every city with
corresponding mean target.

19
00:01:04,920 --> 00:01:11,660
For example, for Moscow, we have
five rows with three 0s and two 1s.

20
00:01:11,660 --> 00:01:17,550
So we encode it with 2 divided by 5 or
0.4.

21
00:01:17,550 --> 00:01:23,560
Similarly, we deal with the rest
of cities, pretty straightforward.

22
00:01:23,560 --> 00:01:27,380
What I've described here
is a very high level idea.

23
00:01:27,380 --> 00:01:33,280
There are a huge number of pitfalls one
should overcome in actual competition.

24
00:01:33,280 --> 00:01:36,880
We went deep into details for
now, just keep it in mind.

25
00:01:38,390 --> 00:01:40,130
At first, let me explain.

26
00:01:40,130 --> 00:01:42,130
Why does it even work?

27
00:01:42,130 --> 00:01:48,900
Imagine, that our dataset is much bigger
and contains hundreds of different cities.

28
00:01:48,900 --> 00:01:52,334
Well, let's try to compare,
of course, very abstractly,

29
00:01:52,334 --> 00:01:54,390
mean encoding with label encoding.

30
00:01:55,860 --> 00:02:00,320
We plot future histograms for
class 0 and class 1.

31
00:02:00,320 --> 00:02:03,720
In case of label encoding,
we'll always get total and

32
00:02:03,720 --> 00:02:07,970
random picture because
there's no logical order, but

33
00:02:07,970 --> 00:02:14,350
when we use mean target to encode the
feature, classes look way more separable.

34
00:02:14,350 --> 00:02:16,010
The plot looks kind of sorted.

35
00:02:17,350 --> 00:02:22,520
It turns out that this sorting quality
of mean encoding is quite helpful.

36
00:02:22,520 --> 00:02:25,410
Remember, what is the most popular and

37
00:02:25,410 --> 00:02:27,654
effective way to solve
machine learning problem?

38
00:02:27,654 --> 00:02:33,404
Is grading using trees, [INAUDIBLE] OIGBM.

39
00:02:33,404 --> 00:02:37,854
One of the few downsides is
an inability to handle high cardinality

40
00:02:37,854 --> 00:02:39,609
categorical variables.

41
00:02:40,970 --> 00:02:46,400
Trees have limited depth,
with mean encoding, we can compensate it,

42
00:02:47,950 --> 00:02:51,990
we can reach better loss
with shorter trees.

43
00:02:51,990 --> 00:02:54,690
Cross validation loss
might even look like this.

44
00:02:55,780 --> 00:03:01,710
In general, the more complicated and
non linear feature target dependency,

45
00:03:01,710 --> 00:03:07,150
the more effective is mean encoding, okay.

46
00:03:07,150 --> 00:03:12,030
Further in this section, you will
learn how to construct mean encodings.

47
00:03:12,030 --> 00:03:14,320
There are actually a lot of ways.

48
00:03:14,320 --> 00:03:19,070
Also keep in mind that we use
classification tests only as an example.

49
00:03:19,070 --> 00:03:22,110
We can use mathematics
on other tests as well.

50
00:03:22,110 --> 00:03:24,160
The main idea remains the same.

51
00:03:25,520 --> 00:03:31,680
Despite the simplicity of the idea, you
need to be very careful with validation.

52
00:03:31,680 --> 00:03:34,120
It's got to be impeccable.

53
00:03:34,120 --> 00:03:36,570
It's probably the most important part.

54
00:03:36,570 --> 00:03:41,600
Understanding the correct linkless
validation is also a basis for staking.

55
00:03:42,600 --> 00:03:46,090
The last, but not least, are extensions.

56
00:03:46,090 --> 00:03:50,860
There are countless possibilities to
derive new features from target variable.

57
00:03:50,860 --> 00:03:54,860
Sometimes, they produce significant
improvement for your models.

58
00:03:56,350 --> 00:03:59,410
Let's start with some
characteristics of data sets,

59
00:03:59,410 --> 00:04:02,170
that indicate the usefulness
of main encoding.

60
00:04:03,310 --> 00:04:07,010
The presence of categorical
variables with a lot of levels

61
00:04:07,010 --> 00:04:10,600
is already a good indicator, but
we need to go a little deeper.

62
00:04:11,820 --> 00:04:16,170
Let's take a look at each of these
learning logs from Springleaf competition.

63
00:04:17,300 --> 00:04:22,337
I ran three models with different depths,
7, 9, and 11.

64
00:04:22,337 --> 00:04:25,150
Train logs are on the top plot.

65
00:04:25,150 --> 00:04:27,070
Validation logs are on the bottom one.

66
00:04:28,160 --> 00:04:32,840
As you can see, with increasing the depths
of trees, our training care becomes

67
00:04:32,840 --> 00:04:37,300
better and better, nearly perfect and
that's a normal part.

68
00:04:38,420 --> 00:04:42,260
But we don't actually over feed and
that's weird.

69
00:04:42,260 --> 00:04:47,260
Our validation score also increase,
it's a sign that

70
00:04:47,260 --> 00:04:53,260
trees need a huge number of splits to
extract information from some variables.

71
00:04:53,260 --> 00:04:55,160
And we can check it for mortal dump.

72
00:04:56,680 --> 00:05:01,770
It turns out that some features have
a tremendous amount of split points,

73
00:05:01,770 --> 00:05:06,680
like 1200 or 1600 and that's a lot.

74
00:05:06,680 --> 00:05:11,170
Our model tries to treat all
those categories differently and

75
00:05:11,170 --> 00:05:14,847
they are also very important for
predicting the target.

76
00:05:14,847 --> 00:05:18,735
We can help our model via mean encodings.

77
00:05:20,115 --> 00:05:22,955
There is a number of ways
to calculate encodings.

78
00:05:22,955 --> 00:05:26,315
The first one is the one
we've been discussing so far.

79
00:05:26,315 --> 00:05:28,825
Simply taking mean of target variable.

80
00:05:30,035 --> 00:05:34,805
Another popular option is to take
initial logarithm of this value,

81
00:05:34,805 --> 00:05:36,335
it's called weight of evidence.

82
00:05:37,470 --> 00:05:40,400
Or you can calculate all
of the numbers of ones.

83
00:05:41,530 --> 00:05:46,150
Or the difference between number
of ones and the number of zeros.

84
00:05:46,150 --> 00:05:48,190
All of these are variable options.

85
00:05:49,360 --> 00:05:51,960
Now, let's actually
construct the features.

86
00:05:51,960 --> 00:05:54,060
We will do it on sprinkled data set,

87
00:05:55,750 --> 00:05:59,750
suppose we've already separated
the data for train and

88
00:05:59,750 --> 00:06:04,553
validation, X_tr and X val data frames.

89
00:06:04,553 --> 00:06:10,045
These called snippet shows how
to construct mean encoding for

90
00:06:10,045 --> 00:06:16,708
an arbitrary column and map it into
a new data frame, train new and val new.

91
00:06:16,708 --> 00:06:22,199
We simply do group by on that column and
use target as a map.

92
00:06:22,199 --> 00:06:26,627
Resulting commands were able [INAUDIBLE].

93
00:06:26,627 --> 00:06:33,420
It is then mapped to tree and
validation data sets by a map operator.

94
00:06:33,420 --> 00:06:36,523
After we've repeated this process for
every call,

95
00:06:36,523 --> 00:06:39,280
we can fit each of those
model on this new data.

96
00:06:40,710 --> 00:06:43,803
But something's definitely not right,

97
00:06:43,803 --> 00:06:49,191
after several efforts training AOC
is nearly 1, while on validation,

98
00:06:49,191 --> 00:06:54,250
the score set rates around 0.55,
which is practically noise.

99
00:06:55,820 --> 00:06:58,395
It's a clear sign of terrible overfitting.

100
00:06:59,570 --> 00:07:02,190
I'll explain what happened
in a few moments.

101
00:07:02,190 --> 00:07:07,430
Right now, I want to point out that
at least we validated correctly.

102
00:07:07,430 --> 00:07:09,060
We separated train and

103
00:07:09,060 --> 00:07:14,570
validation, and used all the train
data to estimate mean encodings.

104
00:07:14,570 --> 00:07:19,550
If, for instance, we would have
estimated mean encodings before

105
00:07:19,550 --> 00:07:24,060
train validation split, then we would
not notice such an overfitting.

106
00:07:25,510 --> 00:07:27,670
Now, let's figure out
the reason of overfitting.

107
00:07:29,100 --> 00:07:34,260
When they are categorized, it's pretty
common to get results like in an example,

108
00:07:34,260 --> 00:07:37,708
target 0 in train and
target 1 in validation.

109
00:07:37,708 --> 00:07:45,190
Mean encodings turns into a perfect
feature for such categories.

110
00:07:45,190 --> 00:07:49,460
That's why we immediately get
very good scores on train and

111
00:07:49,460 --> 00:07:51,310
fail hardly on validation.

112
00:07:52,640 --> 00:07:58,050
So far, we've grasped the concept of mean
encodings and walked through some trivial

113
00:07:58,050 --> 00:08:04,270
examples, that obviously can not use
mean encodings like this in practice.

114
00:08:04,270 --> 00:08:09,530
We need to deal with overfitting first,
we need some kind of regularization.

115
00:08:10,720 --> 00:08:18,289
And I will tell you about different

116
00:08:20,116 --> 00:08:26,119
methods in the next video.

117
00:08:26,119 --> 00:08:29,699
[MUSIC]