[MUSIC] Hi, everyone. In this section, we'll cover a very
powerful technique, mean encoding. It actually has a number of names. Some call it likelihood encoding,
some target encoding, but in this course,
we'll stick with plain mean encoding. The general idea of this
technique is to add new variables based on some
feature to get where we started,. In simplest case, we encode each
level of categorical variable with corresponding target mean. Let's take a look at
the following example. Here, we have some binary
classification task in which we have a categorical variable, some city. And of course,
we want to numerically encode it. The most obvious way and
what people usually use is label encoding. It's what we have in second column. Mean encoding is done differently, via encoding every city with
corresponding mean target. For example, for Moscow, we have
five rows with three 0s and two 1s. So we encode it with 2 divided by 5 or
0.4. Similarly, we deal with the rest
of cities, pretty straightforward. What I've described here
is a very high level idea. There are a huge number of pitfalls one
should overcome in actual competition. We went deep into details for
now, just keep it in mind. At first, let me explain. Why does it even work? Imagine, that our dataset is much bigger
and contains hundreds of different cities. Well, let's try to compare,
of course, very abstractly, mean encoding with label encoding. We plot future histograms for
class 0 and class 1. In case of label encoding,
we'll always get total and random picture because
there's no logical order, but when we use mean target to encode the
feature, classes look way more separable. The plot looks kind of sorted. It turns out that this sorting quality
of mean encoding is quite helpful. Remember, what is the most popular and effective way to solve
machine learning problem? Is grading using trees, [INAUDIBLE] OIGBM. One of the few downsides is
an inability to handle high cardinality categorical variables. Trees have limited depth,
with mean encoding, we can compensate it, we can reach better loss
with shorter trees. Cross validation loss
might even look like this. In general, the more complicated and
non linear feature target dependency, the more effective is mean encoding, okay. Further in this section, you will
learn how to construct mean encodings. There are actually a lot of ways. Also keep in mind that we use
classification tests only as an example. We can use mathematics
on other tests as well. The main idea remains the same. Despite the simplicity of the idea, you
need to be very careful with validation. It's got to be impeccable. It's probably the most important part. Understanding the correct linkless
validation is also a basis for staking. The last, but not least, are extensions. There are countless possibilities to
derive new features from target variable. Sometimes, they produce significant
improvement for your models. Let's start with some
characteristics of data sets, that indicate the usefulness
of main encoding. The presence of categorical
variables with a lot of levels is already a good indicator, but
we need to go a little deeper. Let's take a look at each of these
learning logs from Springleaf competition. I ran three models with different depths,
7, 9, and 11. Train logs are on the top plot. Validation logs are on the bottom one. As you can see, with increasing the depths
of trees, our training care becomes better and better, nearly perfect and
that's a normal part. But we don't actually over feed and
that's weird. Our validation score also increase,
it's a sign that trees need a huge number of splits to
extract information from some variables. And we can check it for mortal dump. It turns out that some features have
a tremendous amount of split points, like 1200 or 1600 and that's a lot. Our model tries to treat all
those categories differently and they are also very important for
predicting the target. We can help our model via mean encodings. There is a number of ways
to calculate encodings. The first one is the one
we've been discussing so far. Simply taking mean of target variable. Another popular option is to take
initial logarithm of this value, it's called weight of evidence. Or you can calculate all
of the numbers of ones. Or the difference between number
of ones and the number of zeros. All of these are variable options. Now, let's actually
construct the features. We will do it on sprinkled data set, suppose we've already separated
the data for train and validation, X_tr and X val data frames. These called snippet shows how
to construct mean encoding for an arbitrary column and map it into
a new data frame, train new and val new. We simply do group by on that column and
use target as a map. Resulting commands were able [INAUDIBLE]. It is then mapped to tree and
validation data sets by a map operator. After we've repeated this process for
every call, we can fit each of those
model on this new data. But something's definitely not right, after several efforts training AOC
is nearly 1, while on validation, the score set rates around 0.55,
which is practically noise. It's a clear sign of terrible overfitting. I'll explain what happened
in a few moments. Right now, I want to point out that
at least we validated correctly. We separated train and validation, and used all the train
data to estimate mean encodings. If, for instance, we would have
estimated mean encodings before train validation split, then we would
not notice such an overfitting. Now, let's figure out
the reason of overfitting. When they are categorized, it's pretty
common to get results like in an example, target 0 in train and
target 1 in validation. Mean encodings turns into a perfect
feature for such categories. That's why we immediately get
very good scores on train and fail hardly on validation. So far, we've grasped the concept of mean
encodings and walked through some trivial examples, that obviously can not use
mean encodings like this in practice. We need to deal with overfitting first,
we need some kind of regularization. And I will tell you about different methods in the next video. [MUSIC]