1
00:00:00,126 --> 00:00:04,170
[MUSIC]

2
00:00:04,170 --> 00:00:08,961
In previous video, we realized that
mean encodings cannot be used as is and

3
00:00:08,961 --> 00:00:13,970
requires some kind of regularization
on training part of data.

4
00:00:13,970 --> 00:00:18,980
Now, we'll carry out four different
methods of regularization, namely,

5
00:00:18,980 --> 00:00:24,070
doing a cross-validation loop
to construct mean encodings.

6
00:00:24,070 --> 00:00:27,910
Then, smoothing based on
the size of category.

7
00:00:27,910 --> 00:00:30,650
Then, adding random noise.

8
00:00:30,650 --> 00:00:36,120
And finally, calculating expanding
mean on some parametrization of data.

9
00:00:36,120 --> 00:00:39,100
We will go through all of
these methods one by one.

10
00:00:40,380 --> 00:00:43,570
Let's start with CV loop regularization.

11
00:00:43,570 --> 00:00:46,430
It's a very intuitive and robust method.

12
00:00:46,430 --> 00:00:51,780
For a given data point, we don't want to
use target variable of that data point.

13
00:00:51,780 --> 00:00:57,040
So we separate the data into
K-node intersecting subsets,

14
00:00:57,040 --> 00:00:59,390
or in other words, folds.

15
00:00:59,390 --> 00:01:04,750
To get mean encoding value for
some subset, we don't use data points

16
00:01:04,750 --> 00:01:10,210
from that subset and estimate
the encoding only on the rest of subset.

17
00:01:11,650 --> 00:01:14,530
We iteratively walk through
all the data subsets.

18
00:01:15,750 --> 00:01:19,550
Usually, four or five folds
are enough to get decent results.

19
00:01:19,550 --> 00:01:21,250
You don't need to tune this number.

20
00:01:22,320 --> 00:01:27,910
It may seem that we have completely
avoided leakage from target variable.

21
00:01:27,910 --> 00:01:29,670
Unfortunately, it's not true.

22
00:01:31,040 --> 00:01:36,270
It will become apparent if we perform
leave one out scheme to separate the data.

23
00:01:36,270 --> 00:01:38,576
I'll return to it a little later,

24
00:01:38,576 --> 00:01:42,642
but first let's learn how to
apply this method in practice.

25
00:01:42,642 --> 00:01:49,820
Suppose that our training
data is in a DFTR data frame.

26
00:01:51,150 --> 00:01:55,570
We will add mean encoded features
into another train new data frame.

27
00:01:56,990 --> 00:02:02,330
In the outer loop,
we iterate through stratified K-fold

28
00:02:02,330 --> 00:02:06,345
iterator in order to separate
training data into chunks.

29
00:02:06,345 --> 00:02:09,905
X_tr is used to estimate the encoding.

30
00:02:09,905 --> 00:02:12,940
X_val is used to apply
estimating encoding.

31
00:02:14,020 --> 00:02:18,471
After that,
we iterate through all the columns and

32
00:02:18,471 --> 00:02:22,630
map estimated encodings
to X_val data frame.

33
00:02:22,630 --> 00:02:27,820
At the end of the outer loop we fill
train new data frame with the result.

34
00:02:27,820 --> 00:02:33,170
Finally, some rare categories may
be present only in a single fold.

35
00:02:33,170 --> 00:02:37,330
So we don't have the data to
estimate target mean for them.

36
00:02:37,330 --> 00:02:40,840
That's why we end up with some nans.

37
00:02:40,840 --> 00:02:42,620
We can fill them with global mean.

38
00:02:43,900 --> 00:02:46,770
As you can see,
the whole process is very simple.

39
00:02:47,900 --> 00:02:51,940
Now, let's return to
the question of whether

40
00:02:51,940 --> 00:02:56,290
we leak information about
target variable or not.

41
00:02:56,290 --> 00:02:57,770
Consider the following example.

42
00:02:58,860 --> 00:03:02,350
Here we want to encode Moscow
via leave-one-out scheme.

43
00:03:03,440 --> 00:03:07,872
For the first row, we get 0.5,
because there are two 1s and

44
00:03:07,872 --> 00:03:10,310
two 0s in the rest of rows.

45
00:03:10,310 --> 00:03:16,540
Similarly, for
the second row we get 0.25 and so on.

46
00:03:16,540 --> 00:03:21,060
But look closely, all the resulting and
the resulting features.

47
00:03:21,060 --> 00:03:25,664
It perfect splits the data,
rows with feature mean equal or

48
00:03:25,664 --> 00:03:31,980
greater than 0.5 have target 0 and
the rest of rows has target 1.

49
00:03:31,980 --> 00:03:38,040
We didn't explicitly use target variable,
but our encoding is biased.

50
00:03:38,040 --> 00:03:43,700
Furthermore, this effect remains valid
even for the KFold scheme, just milder.

51
00:03:45,030 --> 00:03:47,729
So is this type of regularization useless?

52
00:03:48,800 --> 00:03:50,200
Definitely not.

53
00:03:50,200 --> 00:03:53,340
In practice,
if you have enough data and use four or

54
00:03:53,340 --> 00:03:58,660
five folds, the encodings will work
fine with this regularization strategy.

55
00:03:58,660 --> 00:04:01,080
Just be careful and
use correct validation.

56
00:04:02,210 --> 00:04:05,390
Another regularization
method is smoothing.

57
00:04:05,390 --> 00:04:07,750
It's based on the following idea.

58
00:04:07,750 --> 00:04:13,340
If category is big,
has a lot of data points, then we can

59
00:04:13,340 --> 00:04:19,295
trust this to [INAUDIBLE] encoding, but
if category is rare it's the opposite.

60
00:04:19,295 --> 00:04:22,450
Formula on the slide uses this idea.

61
00:04:22,450 --> 00:04:27,170
It has hyper parameter alpha that
controls the amount of regularization.

62
00:04:27,170 --> 00:04:31,280
When alpha is zero,
we have no regularization, and

63
00:04:31,280 --> 00:04:35,370
when alpha approaches infinity
everything turns into globalmean.

64
00:04:36,520 --> 00:04:42,240
In some sense alpha is equal to
the category size we can trust.

65
00:04:42,240 --> 00:04:47,250
It's also possible to use some other
formula, basically anything that punishes

66
00:04:47,250 --> 00:04:51,360
encoding software categories
can be considered smoothing.

67
00:04:51,360 --> 00:04:54,520
Smoothing obviously won't
work on its own but

68
00:04:54,520 --> 00:04:59,250
we can combine it with for
example, CD loop regularization.

69
00:04:59,250 --> 00:05:05,880
Another way to regularize encodence is to
add some noise without regularization.

70
00:05:05,880 --> 00:05:07,930
Meaning codings have better quality for

71
00:05:07,930 --> 00:05:11,040
the [INAUDIBLE] data than for
the test data.

72
00:05:11,040 --> 00:05:15,970
And by adding noise, we simply degrade
the quality of encoding on training data.

73
00:05:17,080 --> 00:05:21,460
This method is pretty unstable,
it's hard to make it work.

74
00:05:21,460 --> 00:05:25,135
The main problem is the amount
of noise we need to add.

75
00:05:25,135 --> 00:05:28,876
Too much noise will turn
the feature into garbage,

76
00:05:28,876 --> 00:05:32,800
while too little noise
means worse regularization.

77
00:05:33,900 --> 00:05:38,940
This method is usually used together
with leave one out regularization.

78
00:05:38,940 --> 00:05:41,340
You need to diligently fine tune it.

79
00:05:42,370 --> 00:05:46,650
So, it's probably not the best option
if you don't have a lot of time.

80
00:05:47,830 --> 00:05:53,330
The last regularization method I'm going
to cover is based on expanding mean.

81
00:05:53,330 --> 00:05:54,840
The idea is very simple.

82
00:05:55,850 --> 00:05:59,310
We fix some sorting order of our data and

83
00:05:59,310 --> 00:06:05,160
use only rows from zero to n minus
one to calculate encoding for row n.

84
00:06:06,600 --> 00:06:10,070
You can check simple implementation
in the code snippet.

85
00:06:11,530 --> 00:06:16,230
Cumsum stores cumulative sum
of target variable up to

86
00:06:16,230 --> 00:06:20,185
the given row and
cumcnt stores cumulative count.

87
00:06:21,270 --> 00:06:25,604
This method introduces the least amount
of leakage from target variable and

88
00:06:25,604 --> 00:06:28,483
it requires no hyper parameter tuning.

89
00:06:28,483 --> 00:06:31,920
The only downside is that
feature quality is not uniform.

90
00:06:33,320 --> 00:06:34,430
But it's not a big deal.

91
00:06:34,430 --> 00:06:37,090
We can average models

92
00:06:37,090 --> 00:06:40,940
on encodings calculated from
different data permutations.

93
00:06:42,180 --> 00:06:46,290
It's also worth noting that
it is expanding mean method

94
00:06:46,290 --> 00:06:50,800
that is used in CatBoost grading,
boosting to it's library,

95
00:06:50,800 --> 00:06:55,609
which proves to perform magnificently
on data sets with categorical features.

96
00:06:56,660 --> 00:07:00,910
Okay, let's summarize what
we've discussed in this video.

97
00:07:00,910 --> 00:07:03,880
We covered four different
types of regularization.

98
00:07:04,980 --> 00:07:09,900
Each of them has its own advantages and
disadvantages.

99
00:07:09,900 --> 00:07:13,870
Sometimes unintuitively we
introduce target variable leakage.

100
00:07:13,870 --> 00:07:17,090
But in practice, we can bear with it.

101
00:07:17,090 --> 00:07:21,240
Personally, I recommend CV loop or
expanding mean methods for

102
00:07:21,240 --> 00:07:22,920
practical tasks.

103
00:07:22,920 --> 00:07:24,976
They are the most robust and easy to tune.

104
00:07:24,976 --> 00:07:28,210
This is was regularization.

105
00:07:28,210 --> 00:07:32,270
In the next video, I will tell
you about various extensions and

106
00:07:32,270 --> 00:07:35,700
practical applications of mean encodings.

107
00:07:35,700 --> 00:07:38,002
Thank you.

108
00:07:38,002 --> 00:07:48,002
[MUSIC]