1 00:00:02,016 --> 00:00:07,827 [SOUND] In the final video, we will cover various generalizations and 2 00:00:07,827 --> 00:00:10,544 extensions of mean encodings. 3 00:00:10,544 --> 00:00:15,250 Namely how to do meaning coding in regression and multiclass tasks. 4 00:00:16,600 --> 00:00:20,409 How can we apply encoding to domains with many-to-many relations. 5 00:00:21,640 --> 00:00:26,070 What features can we build based on target we're able in time series. 6 00:00:26,070 --> 00:00:30,770 And finally, how to encode numerical features and interactions of features. 7 00:00:31,790 --> 00:00:34,275 Let's start with regression tasks. 8 00:00:34,275 --> 00:00:38,405 They are actually more flexible for feature encoding. 9 00:00:38,405 --> 00:00:42,945 Unlike binary classification where a mean is frankly the only meaningful 10 00:00:42,945 --> 00:00:46,095 statistic we can extract from target variable. 11 00:00:46,095 --> 00:00:51,365 In regression tasks, we can try a variety of statistics, like medium, 12 00:00:51,365 --> 00:00:55,240 percentile, standard deviation of target variable. 13 00:00:55,240 --> 00:00:58,770 We can even calculate some distribution bins. 14 00:00:58,770 --> 00:01:03,620 For example, if target variable is distributed between 1 and 100, 15 00:01:03,620 --> 00:01:07,410 we can create 10 bin features. 16 00:01:07,410 --> 00:01:13,755 In the first feature, we'll count how many data points have targeted between 1 and 17 00:01:13,755 --> 00:01:17,830 10, in the second between 10 and 20 and so on. 18 00:01:19,180 --> 00:01:22,500 Of course, we need to realize all of these features. 19 00:01:23,640 --> 00:01:28,390 In a nutshell, regression tasks are like classification. 20 00:01:28,390 --> 00:01:31,310 Just more flexible in terms of feature engineering. 21 00:01:32,500 --> 00:01:37,410 Men encoding for multi-class tasks is also pretty straightforward. 22 00:01:37,410 --> 00:01:39,090 For every feature we want to encode, 23 00:01:39,090 --> 00:01:45,360 we will have n different encodings where n is the number of classes. 24 00:01:45,360 --> 00:01:48,520 It actually has non obvious advantage. 25 00:01:48,520 --> 00:01:49,320 Three models for 26 00:01:49,320 --> 00:01:54,640 example, usually solve multi-class task in one versus old fashion. 27 00:01:54,640 --> 00:01:58,260 So every class had a different model, and 28 00:01:58,260 --> 00:02:03,190 when we feed that model, it doesn't have any information about structure 29 00:02:03,190 --> 00:02:07,760 of other classes because they are merge into one entity. 30 00:02:08,780 --> 00:02:11,676 Therefore, together with mean encodings, 31 00:02:11,676 --> 00:02:16,945 we introduce some additional information about the structure of other classes. 32 00:02:16,945 --> 00:02:21,581 The domains with many-to-many relations are usually very complex and 33 00:02:21,581 --> 00:02:26,030 require special approaches to create mean encodings. 34 00:02:26,030 --> 00:02:30,760 I will give you only a very high level idea, consider an example. 35 00:02:30,760 --> 00:02:36,895 Binary classification task for users based on apps installed on their smartphones. 36 00:02:37,970 --> 00:02:44,480 Each user may have multiple apps and each app is used by multiple users. 37 00:02:44,480 --> 00:02:46,030 Hence, many-to-many relation. 38 00:02:47,310 --> 00:02:49,960 We want to mean encode apps. 39 00:02:49,960 --> 00:02:54,630 The hard part we need to deal with is that the user may have a lot of apps. 40 00:02:56,482 --> 00:03:01,690 So let's take a cross product of user and app entities. 41 00:03:01,690 --> 00:03:06,420 It will result in a so called long representation of data. 42 00:03:06,420 --> 00:03:09,500 We will have a role for each user app pair. 43 00:03:10,750 --> 00:03:14,830 Using this table, we can naturally calculate mean encoding for apps. 44 00:03:16,280 --> 00:03:24,210 So now every app is encoded with target mean, but how to map it back to users. 45 00:03:24,210 --> 00:03:28,028 Every user has a number of apps, so 46 00:03:28,028 --> 00:03:31,979 instead of app1, app2, app3, 47 00:03:31,979 --> 00:03:38,341 we will now have a vector like 0.1, 0.2, 0.1. 48 00:03:38,341 --> 00:03:40,601 That was pretty simple. 49 00:03:40,601 --> 00:03:45,259 We can collect various statistics from those vectors, like mean, 50 00:03:45,259 --> 00:03:49,030 minimal, maximum, standard deviation, and so on. 51 00:03:50,330 --> 00:03:55,040 So far we assume that our data has no inner structure, but 52 00:03:55,040 --> 00:03:58,950 with time series we can obviously use future information. 53 00:03:59,950 --> 00:04:02,860 On one hand, it's a limitation, 54 00:04:02,860 --> 00:04:07,640 on the other hand, it actually allows us to make some complicated features. 55 00:04:08,970 --> 00:04:13,050 In data sets without time component when encoding the category, 56 00:04:13,050 --> 00:04:18,350 we are forced to use all the rules to calculate the statistic. 57 00:04:18,350 --> 00:04:22,300 It makes no sense to choose some subset of rules. 58 00:04:22,300 --> 00:04:24,450 Presence of time changes it. 59 00:04:24,450 --> 00:04:26,600 For a given category, we can't. 60 00:04:26,600 --> 00:04:31,220 For example, calculate the mean from previous day, previous two days, 61 00:04:31,220 --> 00:04:32,570 previous week, etc. 62 00:04:34,170 --> 00:04:35,190 Consider an example. 63 00:04:36,290 --> 00:04:42,048 We need to predict which categories users spends money. 64 00:04:42,048 --> 00:04:47,540 In these two example we have a period of two days, two users, 65 00:04:47,540 --> 00:04:49,676 and three spending categories. 66 00:04:49,676 --> 00:04:54,582 Some good features would be the total amount of money users 67 00:04:54,582 --> 00:04:57,130 spent in previous day. 68 00:04:57,130 --> 00:05:02,810 An average amount of money spent by all users in given category. 69 00:05:02,810 --> 00:05:08,132 So, in day 1, user 101 spends $6, 70 00:05:08,132 --> 00:05:10,876 user 102, $3. 71 00:05:10,876 --> 00:05:16,350 Therefore, we feel those numbers as future values for day 2. 72 00:05:16,350 --> 00:05:20,900 Similarly, with the average amount by category. 73 00:05:22,070 --> 00:05:27,260 The more data we have, the more complicated features we can create. 74 00:05:28,330 --> 00:05:32,750 In practice, it is often been official to mean encode numeric features and 75 00:05:32,750 --> 00:05:35,580 some combination of features. 76 00:05:35,580 --> 00:05:40,550 To encode a numeric feature, we only need to bin it and then treat as categorical. 77 00:05:42,000 --> 00:05:45,170 Now, we need to answer two questions. 78 00:05:45,170 --> 00:05:48,142 First, how to bin numeric feature, and 79 00:05:48,142 --> 00:05:53,080 second how to select useful combination of features. 80 00:05:53,080 --> 00:06:01,235 Well, we can find it out from a model structure by analyzing the trees. 81 00:06:01,235 --> 00:06:05,491 So at first, we take for example, [INAUDIBLE] model and 82 00:06:05,491 --> 00:06:08,459 raw features without any encodings. 83 00:06:08,459 --> 00:06:11,340 Let's start with numeric features. 84 00:06:12,890 --> 00:06:17,780 If numeric feature has a lot of [INAUDIBLE] points, it means that it has 85 00:06:17,780 --> 00:06:23,650 some complicated dependency with target and its was trying to mean encode it. 86 00:06:23,650 --> 00:06:28,530 Furthermore, these exact split points may be used to bin the feature. 87 00:06:29,690 --> 00:06:32,280 So by analyzing model structure, 88 00:06:32,280 --> 00:06:37,700 we both identify suspicious numeric feature and found a good way to bin it. 89 00:06:39,470 --> 00:06:43,350 It's going to be a little harder with selecting interactions, but 90 00:06:43,350 --> 00:06:44,380 nothing extraordinary. 91 00:06:45,790 --> 00:06:51,000 First, let's define how to extract to way interaction from decision tree. 92 00:06:52,030 --> 00:06:57,750 The process will be similar for three way, four way arbitrary way interactions. 93 00:06:59,360 --> 00:07:06,810 So two features interact in a tree if they are in two neighbouring notes. 94 00:07:06,810 --> 00:07:11,880 With that in mind, we can iterate through all the trees in the model and 95 00:07:11,880 --> 00:07:15,850 calculate how many times each feature interaction appeared. 96 00:07:16,850 --> 00:07:21,225 The most frequent interactions are probably worthy of mean encoding. 97 00:07:21,225 --> 00:07:27,105 For example, if we found that feature one and feature two pair is most frequent, 98 00:07:27,105 --> 00:07:31,735 then we can concatenate that those feature values in our data. 99 00:07:31,735 --> 00:07:34,015 And mean encode resulting interaction. 100 00:07:35,905 --> 00:07:39,695 Now let me illustrate how important interaction encoding may be. 101 00:07:40,840 --> 00:07:46,370 Amazon Employee Access Challenge Competition has a very specific data set. 102 00:07:46,370 --> 00:07:49,565 There are only nine categorical features. 103 00:07:49,565 --> 00:07:54,387 If we blindly fit say like GBM model on the raw features, 104 00:07:54,387 --> 00:07:58,287 then no matter how we return the parameters, 105 00:07:58,287 --> 00:08:02,490 we'll score in a 0.87 AUC range. 106 00:08:02,490 --> 00:08:07,070 Which will place roughly on 700 position on the leaderboard. 107 00:08:08,160 --> 00:08:13,970 Furthermore, even if we mean encode all the labels, we won't have any progress. 108 00:08:14,981 --> 00:08:20,755 But if we fit cat boost model, which internally 109 00:08:20,755 --> 00:08:25,975 mean encodes some feature interactions, we will immediately score in 0.91 range, 110 00:08:25,975 --> 00:08:31,765 which will place us onto win this position. 111 00:08:31,765 --> 00:08:35,555 The difference in both absolute AUC values and 112 00:08:35,555 --> 00:08:38,600 relative leaderboard positions is tremendous. 113 00:08:39,910 --> 00:08:43,650 Also note that cat boost is no silver bullet. 114 00:08:43,650 --> 00:08:46,720 In order to get even higher on the leader board, 115 00:08:46,720 --> 00:08:51,140 would still need to manually add more mean encoded interactions. 116 00:08:52,190 --> 00:08:56,770 In general, if you participate in a competition with a lot of categorical 117 00:08:56,770 --> 00:09:04,160 variables, it's always worth trying to work with interactions and mean encodings. 118 00:09:04,160 --> 00:09:07,520 I also want to remind you about correct validation process. 119 00:09:08,550 --> 00:09:11,164 During all local experiments, 120 00:09:11,164 --> 00:09:15,820 you should at first split data in X_tr and X_val parts. 121 00:09:16,830 --> 00:09:21,717 Estimate encodings on X_tr, map them to X_tr and 122 00:09:21,717 --> 00:09:26,150 X_val, and then regularize them on X_tr and 123 00:09:26,150 --> 00:09:32,080 only after that validate your model on X_tr / X_val split. 124 00:09:32,080 --> 00:09:36,490 Don't even think about estimating encodings before splitting the data. 125 00:09:37,860 --> 00:09:43,900 And at submission stage, you can estimate encodings on whole train data. 126 00:09:43,900 --> 00:09:47,040 Map it to train and test, 127 00:09:47,040 --> 00:09:52,790 then apply regularization on training data and finally fit a model. 128 00:09:52,790 --> 00:09:57,710 And note that you should have already decided on regularization method and 129 00:09:57,710 --> 00:09:59,670 its strength in local experiments. 130 00:10:01,190 --> 00:10:05,040 At the end of this section, let's summarize main advantages and 131 00:10:05,040 --> 00:10:06,710 disadvantages of mean encodings. 132 00:10:08,350 --> 00:10:08,955 First of all, 133 00:10:08,955 --> 00:10:15,490 mean encoding allows us to make a compact transformation of categorical variables. 134 00:10:15,490 --> 00:10:18,530 It is also a powerful basis for feature engineering. 135 00:10:19,620 --> 00:10:23,510 Then the main disadvantage is target rebel leakage. 136 00:10:23,510 --> 00:10:27,320 We need to be very careful with validation and irregularization. 137 00:10:29,100 --> 00:10:32,470 It also works only on specific data sets. 138 00:10:32,470 --> 00:10:36,270 It definitely won't help in every competition. 139 00:10:36,270 --> 00:10:42,740 But keep in mind, when this method works, it may produce significant improvements. 140 00:10:42,740 --> 00:10:44,038 Thank you for your attention. 141 00:10:44,038 --> 00:10:54,038 [MUSIC]