[MUSIC] In previous video, we realized that mean encodings cannot be used as is and requires some kind of regularization on training part of data. Now, we'll carry out four different methods of regularization, namely, doing a cross-validation loop to construct mean encodings. Then, smoothing based on the size of category. Then, adding random noise. And finally, calculating expanding mean on some parametrization of data. We will go through all of these methods one by one. Let's start with CV loop regularization. It's a very intuitive and robust method. For a given data point, we don't want to use target variable of that data point. So we separate the data into K-node intersecting subsets, or in other words, folds. To get mean encoding value for some subset, we don't use data points from that subset and estimate the encoding only on the rest of subset. We iteratively walk through all the data subsets. Usually, four or five folds are enough to get decent results. You don't need to tune this number. It may seem that we have completely avoided leakage from target variable. Unfortunately, it's not true. It will become apparent if we perform leave one out scheme to separate the data. I'll return to it a little later, but first let's learn how to apply this method in practice. Suppose that our training data is in a DFTR data frame. We will add mean encoded features into another train new data frame. In the outer loop, we iterate through stratified K-fold iterator in order to separate training data into chunks. X_tr is used to estimate the encoding. X_val is used to apply estimating encoding. After that, we iterate through all the columns and map estimated encodings to X_val data frame. At the end of the outer loop we fill train new data frame with the result. Finally, some rare categories may be present only in a single fold. So we don't have the data to estimate target mean for them. That's why we end up with some nans. We can fill them with global mean. As you can see, the whole process is very simple. Now, let's return to the question of whether we leak information about target variable or not. Consider the following example. Here we want to encode Moscow via leave-one-out scheme. For the first row, we get 0.5, because there are two 1s and two 0s in the rest of rows. Similarly, for the second row we get 0.25 and so on. But look closely, all the resulting and the resulting features. It perfect splits the data, rows with feature mean equal or greater than 0.5 have target 0 and the rest of rows has target 1. We didn't explicitly use target variable, but our encoding is biased. Furthermore, this effect remains valid even for the KFold scheme, just milder. So is this type of regularization useless? Definitely not. In practice, if you have enough data and use four or five folds, the encodings will work fine with this regularization strategy. Just be careful and use correct validation. Another regularization method is smoothing. It's based on the following idea. If category is big, has a lot of data points, then we can trust this to [INAUDIBLE] encoding, but if category is rare it's the opposite. Formula on the slide uses this idea. It has hyper parameter alpha that controls the amount of regularization. When alpha is zero, we have no regularization, and when alpha approaches infinity everything turns into globalmean. In some sense alpha is equal to the category size we can trust. It's also possible to use some other formula, basically anything that punishes encoding software categories can be considered smoothing. Smoothing obviously won't work on its own but we can combine it with for example, CD loop regularization. Another way to regularize encodence is to add some noise without regularization. Meaning codings have better quality for the [INAUDIBLE] data than for the test data. And by adding noise, we simply degrade the quality of encoding on training data. This method is pretty unstable, it's hard to make it work. The main problem is the amount of noise we need to add. Too much noise will turn the feature into garbage, while too little noise means worse regularization. This method is usually used together with leave one out regularization. You need to diligently fine tune it. So, it's probably not the best option if you don't have a lot of time. The last regularization method I'm going to cover is based on expanding mean. The idea is very simple. We fix some sorting order of our data and use only rows from zero to n minus one to calculate encoding for row n. You can check simple implementation in the code snippet. Cumsum stores cumulative sum of target variable up to the given row and cumcnt stores cumulative count. This method introduces the least amount of leakage from target variable and it requires no hyper parameter tuning. The only downside is that feature quality is not uniform. But it's not a big deal. We can average models on encodings calculated from different data permutations. It's also worth noting that it is expanding mean method that is used in CatBoost grading, boosting to it's library, which proves to perform magnificently on data sets with categorical features. Okay, let's summarize what we've discussed in this video. We covered four different types of regularization. Each of them has its own advantages and disadvantages. Sometimes unintuitively we introduce target variable leakage. But in practice, we can bear with it. Personally, I recommend CV loop or expanding mean methods for practical tasks. They are the most robust and easy to tune. This is was regularization. In the next video, I will tell you about various extensions and practical applications of mean encodings. Thank you. [MUSIC]