1 00:00:02,990 --> 00:00:05,183 Hi, everyone. 2 00:00:05,183 --> 00:00:08,535 The main topic of this video is Feature Interactions. 3 00:00:08,535 --> 00:00:12,040 You will learn how to construct them and use in problem solving. 4 00:00:12,040 --> 00:00:16,405 Additionally, we will discuss them for feature extraction from decision trees. 5 00:00:16,405 --> 00:00:18,100 Let's start with an example. 6 00:00:18,100 --> 00:00:20,160 Suppose that we are building a model to predict 7 00:00:20,160 --> 00:00:23,245 the best advertisement banner to display on a website. 8 00:00:23,245 --> 00:00:27,760 Among available features, there are two categorical ones that we will concentrate on. 9 00:00:27,760 --> 00:00:30,810 The category of the advertising banner itself and 10 00:00:30,810 --> 00:00:34,150 the category of the site the banner will be showing on. 11 00:00:34,150 --> 00:00:37,603 Certainly, we can use the features as two independent ones, 12 00:00:37,603 --> 00:00:41,525 but a really important feature is indeed the combination of them. 13 00:00:41,525 --> 00:00:43,770 We can explicitly construct the combination in 14 00:00:43,770 --> 00:00:47,015 order to incorporate our knowledge into a model. 15 00:00:47,015 --> 00:00:52,195 Let's construct new feature named ad_site that represents the combination. 16 00:00:52,195 --> 00:00:54,551 It will be categorical as the old ones, 17 00:00:54,551 --> 00:01:00,270 but set of its values will be all possible combinations of two original values. 18 00:01:00,270 --> 00:01:01,905 From a technical point of view, 19 00:01:01,905 --> 00:01:04,785 there are two ways to construct such interaction. 20 00:01:04,785 --> 00:01:07,170 Let's look at a simple example. 21 00:01:07,170 --> 00:01:08,700 Consider our first feature, 22 00:01:08,700 --> 00:01:10,610 f1, has values A or B. 23 00:01:10,610 --> 00:01:13,714 Another feature, f2, has values X or Y or Z, 24 00:01:13,714 --> 00:01:17,870 and our data set consist of four data points. 25 00:01:17,870 --> 00:01:21,810 The first approach is to concatenate the text values of f1 and f2, 26 00:01:21,810 --> 00:01:25,710 and use the result as a new categorical feature f_join. 27 00:01:25,710 --> 00:01:28,520 We can then apply the OneHot according to it. 28 00:01:28,520 --> 00:01:30,840 The second approach consist of two steps. 29 00:01:30,840 --> 00:01:35,025 Firstly, apply OneHot and connect to features f1 and f2. 30 00:01:35,025 --> 00:01:38,940 Secondly, construct new metrics by multiplying each column from 31 00:01:38,940 --> 00:01:43,390 f1 encoded metrics to each column from f2 encoded metrics. 32 00:01:43,390 --> 00:01:46,068 It was nothing that both methods results in 33 00:01:46,068 --> 00:01:49,410 practically the same new feature representations. 34 00:01:49,410 --> 00:01:51,075 In the above example, 35 00:01:51,075 --> 00:01:54,570 we can consider as interactions between categorical features, 36 00:01:54,570 --> 00:01:58,060 but similar ideas can be applied to real valued features. 37 00:01:58,060 --> 00:02:01,230 For example, having two real valued features f1 and f2, 38 00:02:01,230 --> 00:02:07,375 interactions between them can be obtained by multiplications of f1 and f2. 39 00:02:07,375 --> 00:02:11,035 In fact, we are not limited to use only multiply operation. 40 00:02:11,035 --> 00:02:14,070 Any function taking two arguments like sum, 41 00:02:14,070 --> 00:02:16,735 difference, or division is okay. 42 00:02:16,735 --> 00:02:19,320 The following transformations significantly enlarge 43 00:02:19,320 --> 00:02:22,695 feature space and makes learning easier, 44 00:02:22,695 --> 00:02:26,205 but keep in mind that it makes or frequent easier too. 45 00:02:26,205 --> 00:02:29,610 It should be emphasized that for three ways algorithms such as 46 00:02:29,610 --> 00:02:32,280 the random forest or gradient boost decision trees 47 00:02:32,280 --> 00:02:35,530 it's difficult to extract such kind of dependencies. 48 00:02:35,530 --> 00:02:40,265 That's why they're buffer transformation are very efficient for three based methods. 49 00:02:40,265 --> 00:02:42,755 Let's discuss practical details now. 50 00:02:42,755 --> 00:02:47,520 Where wise future generation approaches greatly increase the number of the features. 51 00:02:47,520 --> 00:02:49,190 If there were any original features, 52 00:02:49,190 --> 00:02:51,150 there will be n square. 53 00:02:51,150 --> 00:02:55,240 And will be even more features if several types of interaction are used. 54 00:02:55,240 --> 00:02:57,550 There are two ways to moderate this, 55 00:02:57,550 --> 00:03:01,100 either do feature selection or dimensionality reduction. 56 00:03:01,100 --> 00:03:03,060 I prefer doing the selection since 57 00:03:03,060 --> 00:03:05,615 not all but only a few interactions often 58 00:03:05,615 --> 00:03:09,000 achieve the same quality as all combinations of features. 59 00:03:09,000 --> 00:03:10,830 For each type of interaction, 60 00:03:10,830 --> 00:03:13,555 I construct all piecewise feature interactions. 61 00:03:13,555 --> 00:03:18,150 Feature random forests over them and select several most important features. 62 00:03:18,150 --> 00:03:22,265 Because number of resulting features for each type is relatively small. 63 00:03:22,265 --> 00:03:25,800 It's possible to join them together along with original features and 64 00:03:25,800 --> 00:03:29,975 use as input for any machine learning algorithm usually to be by use method. 65 00:03:29,975 --> 00:03:34,660 During the video, we have examined the method to construct second order interactions. 66 00:03:34,660 --> 00:03:38,750 But you can similarly produce throned order or higher. 67 00:03:38,750 --> 00:03:42,680 Due to the fact that number of features grow rapidly with order, 68 00:03:42,680 --> 00:03:45,225 it has become difficult to work with them. 69 00:03:45,225 --> 00:03:49,440 Therefore high order directions are often constructed semi-manually. 70 00:03:49,440 --> 00:03:52,165 And this is an art in some ways. 71 00:03:52,165 --> 00:03:54,690 Additionally, I would like to talk about methods to 72 00:03:54,690 --> 00:03:57,880 construct categorical features from decision trees. 73 00:03:57,880 --> 00:03:59,840 Take a look at the decision tree. 74 00:03:59,840 --> 00:04:03,475 Let's map each leaf into a binary feature. 75 00:04:03,475 --> 00:04:09,215 The index of the object's leaf can be used as a value for a new categorical feature. 76 00:04:09,215 --> 00:04:12,565 If we use not a single tree but an ensemble of them. 77 00:04:12,565 --> 00:04:14,260 For example, a random forest, 78 00:04:14,260 --> 00:04:18,070 then such operation can be applied to each of entries. 79 00:04:18,070 --> 00:04:22,270 This is a powerful way to extract high order interactions. 80 00:04:22,270 --> 00:04:24,895 This technique is quite simple to implement. 81 00:04:24,895 --> 00:04:27,970 Tree-based poodles from sklearn library have 82 00:04:27,970 --> 00:04:30,190 an apply method which takes as 83 00:04:30,190 --> 00:04:33,830 input feature metrics and rituals corresponding indices of leaves. 84 00:04:33,830 --> 00:04:39,840 In xgboost, also support to why a parameter breed leaf in predict method. 85 00:04:39,840 --> 00:04:42,730 I suggest we need to collaborate documentations in order to 86 00:04:42,730 --> 00:04:46,420 get more information about these methods and IPIs. 87 00:04:46,420 --> 00:04:48,210 In the end of this video, 88 00:04:48,210 --> 00:04:50,250 I will tackle the main points. 89 00:04:50,250 --> 00:04:54,960 We examined method to construct an interactions of categorical features. 90 00:04:54,960 --> 00:04:58,135 Also, we extend the approach to real-valued features. 91 00:04:58,135 --> 00:05:00,610 And we have learned how to use trees to extract 92 00:05:00,610 --> 00:05:04,510 high order interactions. Thank you for your attention.