[SOUND] In the final video,
we will cover various generalizations and extensions of mean encodings. Namely how to do meaning coding in
regression and multiclass tasks. How can we apply encoding to domains
with many-to-many relations. What features can we build based on
target we're able in time series. And finally, how to encode numerical
features and interactions of features. Let's start with regression tasks. They are actually more flexible for
feature encoding. Unlike binary classification where
a mean is frankly the only meaningful statistic we can extract
from target variable. In regression tasks, we can try
a variety of statistics, like medium, percentile, standard
deviation of target variable. We can even calculate
some distribution bins. For example, if target variable
is distributed between 1 and 100, we can create 10 bin features. In the first feature, we'll count how many
data points have targeted between 1 and 10, in the second between 10 and
20 and so on. Of course,
we need to realize all of these features. In a nutshell,
regression tasks are like classification. Just more flexible in terms
of feature engineering. Men encoding for multi-class tasks
is also pretty straightforward. For every feature we want to encode, we will have n different encodings
where n is the number of classes. It actually has non obvious advantage. Three models for example, usually solve multi-class
task in one versus old fashion. So every class had a different model, and when we feed that model, it doesn't
have any information about structure of other classes because they
are merge into one entity. Therefore, together with mean encodings, we introduce some additional information
about the structure of other classes. The domains with many-to-many
relations are usually very complex and require special approaches
to create mean encodings. I will give you only a very high
level idea, consider an example. Binary classification task for users based
on apps installed on their smartphones. Each user may have multiple apps and
each app is used by multiple users. Hence, many-to-many relation. We want to mean encode apps. The hard part we need to deal with is
that the user may have a lot of apps. So let's take a cross product of user and
app entities. It will result in a so
called long representation of data. We will have a role for
each user app pair. Using this table, we can naturally
calculate mean encoding for apps. So now every app is encoded with target
mean, but how to map it back to users. Every user has a number of apps, so instead of app1, app2, app3, we will now have a vector like 0.1,
0.2, 0.1. That was pretty simple. We can collect various statistics
from those vectors, like mean, minimal, maximum,
standard deviation, and so on. So far we assume that our data
has no inner structure, but with time series we can obviously
use future information. On one hand, it's a limitation, on the other hand, it actually allows
us to make some complicated features. In data sets without time component
when encoding the category, we are forced to use all the rules
to calculate the statistic. It makes no sense to choose
some subset of rules. Presence of time changes it. For a given category, we can't. For example, calculate the mean from
previous day, previous two days, previous week, etc. Consider an example. We need to predict which
categories users spends money. In these two example we have
a period of two days, two users, and three spending categories. Some good features would be
the total amount of money users spent in previous day. An average amount of money spent
by all users in given category. So, in day 1, user 101 spends $6, user 102, $3. Therefore, we feel those numbers
as future values for day 2. Similarly, with the average
amount by category. The more data we have, the more
complicated features we can create. In practice, it is often been official
to mean encode numeric features and some combination of features. To encode a numeric feature, we only need
to bin it and then treat as categorical. Now, we need to answer two questions. First, how to bin numeric feature, and second how to select useful
combination of features. Well, we can find it out from a model
structure by analyzing the trees. So at first, we take for
example, [INAUDIBLE] model and raw features without any encodings. Let's start with numeric features. If numeric feature has a lot of
[INAUDIBLE] points, it means that it has some complicated dependency with target
and its was trying to mean encode it. Furthermore, these exact split points
may be used to bin the feature. So by analyzing model structure, we both identify suspicious numeric
feature and found a good way to bin it. It's going to be a little harder
with selecting interactions, but nothing extraordinary. First, let's define how to extract to
way interaction from decision tree. The process will be similar for three way,
four way arbitrary way interactions. So two features interact in a tree if
they are in two neighbouring notes. With that in mind, we can iterate
through all the trees in the model and calculate how many times each
feature interaction appeared. The most frequent interactions
are probably worthy of mean encoding. For example, if we found that feature one
and feature two pair is most frequent, then we can concatenate that
those feature values in our data. And mean encode resulting interaction. Now let me illustrate how important
interaction encoding may be. Amazon Employee Access Challenge
Competition has a very specific data set. There are only nine categorical features. If we blindly fit say like GBM
model on the raw features, then no matter how we
return the parameters, we'll score in a 0.87 AUC range. Which will place roughly on 700
position on the leaderboard. Furthermore, even if we mean encode all
the labels, we won't have any progress. But if we fit cat boost model,
which internally mean encodes some feature interactions,
we will immediately score in 0.91 range, which will place us
onto win this position. The difference in both
absolute AUC values and relative leaderboard
positions is tremendous. Also note that cat boost
is no silver bullet. In order to get even higher
on the leader board, would still need to manually add
more mean encoded interactions. In general, if you participate in
a competition with a lot of categorical variables, it's always worth trying to
work with interactions and mean encodings. I also want to remind you about
correct validation process. During all local experiments, you should at first split data in X_tr and
X_val parts. Estimate encodings on X_tr,
map them to X_tr and X_val, and
then regularize them on X_tr and only after that validate your
model on X_tr / X_val split. Don't even think about estimating
encodings before splitting the data. And at submission stage, you can
estimate encodings on whole train data. Map it to train and test, then apply regularization on training
data and finally fit a model. And note that you should have already
decided on regularization method and its strength in local experiments. At the end of this section,
let's summarize main advantages and disadvantages of mean encodings. First of all, mean encoding allows us to make a compact
transformation of categorical variables. It is also a powerful basis for
feature engineering. Then the main disadvantage
is target rebel leakage. We need to be very careful with
validation and irregularization. It also works only on specific data sets. It definitely won't help
in every competition. But keep in mind, when this method works,
it may produce significant improvements. Thank you for your attention. [MUSIC]