1
00:00:02,990 --> 00:00:05,183
Hi, everyone.

2
00:00:05,183 --> 00:00:08,535
The main topic of this video is Feature Interactions.

3
00:00:08,535 --> 00:00:12,040
You will learn how to construct them and use in problem solving.

4
00:00:12,040 --> 00:00:16,405
Additionally, we will discuss them for feature extraction from decision trees.

5
00:00:16,405 --> 00:00:18,100
Let's start with an example.

6
00:00:18,100 --> 00:00:20,160
Suppose that we are building a model to predict

7
00:00:20,160 --> 00:00:23,245
the best advertisement banner to display on a website.

8
00:00:23,245 --> 00:00:27,760
Among available features, there are two categorical ones that we will concentrate on.

9
00:00:27,760 --> 00:00:30,810
The category of the advertising banner itself and

10
00:00:30,810 --> 00:00:34,150
the category of the site the banner will be showing on.

11
00:00:34,150 --> 00:00:37,603
Certainly, we can use the features as two independent ones,

12
00:00:37,603 --> 00:00:41,525
but a really important feature is indeed the combination of them.

13
00:00:41,525 --> 00:00:43,770
We can explicitly construct the combination in

14
00:00:43,770 --> 00:00:47,015
order to incorporate our knowledge into a model.

15
00:00:47,015 --> 00:00:52,195
Let's construct new feature named ad_site that represents the combination.

16
00:00:52,195 --> 00:00:54,551
It will be categorical as the old ones,

17
00:00:54,551 --> 00:01:00,270
but set of its values will be all possible combinations of two original values.

18
00:01:00,270 --> 00:01:01,905
From a technical point of view,

19
00:01:01,905 --> 00:01:04,785
there are two ways to construct such interaction.

20
00:01:04,785 --> 00:01:07,170
Let's look at a simple example.

21
00:01:07,170 --> 00:01:08,700
Consider our first feature,

22
00:01:08,700 --> 00:01:10,610
f1, has values A or B.

23
00:01:10,610 --> 00:01:13,714
Another feature, f2, has values X or Y or Z,

24
00:01:13,714 --> 00:01:17,870
and our data set consist of four data points.

25
00:01:17,870 --> 00:01:21,810
The first approach is to concatenate the text values of f1 and f2,

26
00:01:21,810 --> 00:01:25,710
and use the result as a new categorical feature f_join.

27
00:01:25,710 --> 00:01:28,520
We can then apply the OneHot according to it.

28
00:01:28,520 --> 00:01:30,840
The second approach consist of two steps.

29
00:01:30,840 --> 00:01:35,025
Firstly, apply OneHot and connect to features f1 and f2.

30
00:01:35,025 --> 00:01:38,940
Secondly, construct new metrics by multiplying each column from

31
00:01:38,940 --> 00:01:43,390
f1 encoded metrics to each column from f2 encoded metrics.

32
00:01:43,390 --> 00:01:46,068
It was nothing that both methods results in

33
00:01:46,068 --> 00:01:49,410
practically the same new feature representations.

34
00:01:49,410 --> 00:01:51,075
In the above example,

35
00:01:51,075 --> 00:01:54,570
we can consider as interactions between categorical features,

36
00:01:54,570 --> 00:01:58,060
but similar ideas can be applied to real valued features.

37
00:01:58,060 --> 00:02:01,230
For example, having two real valued features f1 and f2,

38
00:02:01,230 --> 00:02:07,375
interactions between them can be obtained by multiplications of f1 and f2.

39
00:02:07,375 --> 00:02:11,035
In fact, we are not limited to use only multiply operation.

40
00:02:11,035 --> 00:02:14,070
Any function taking two arguments like sum,

41
00:02:14,070 --> 00:02:16,735
difference, or division is okay.

42
00:02:16,735 --> 00:02:19,320
The following transformations significantly enlarge

43
00:02:19,320 --> 00:02:22,695
feature space and makes learning easier,

44
00:02:22,695 --> 00:02:26,205
but keep in mind that it makes or frequent easier too.

45
00:02:26,205 --> 00:02:29,610
It should be emphasized that for three ways algorithms such as

46
00:02:29,610 --> 00:02:32,280
the random forest or gradient boost decision trees

47
00:02:32,280 --> 00:02:35,530
it's difficult to extract such kind of dependencies.

48
00:02:35,530 --> 00:02:40,265
That's why they're buffer transformation are very efficient for three based methods.

49
00:02:40,265 --> 00:02:42,755
Let's discuss practical details now.

50
00:02:42,755 --> 00:02:47,520
Where wise future generation approaches greatly increase the number of the features.

51
00:02:47,520 --> 00:02:49,190
If there were any original features,

52
00:02:49,190 --> 00:02:51,150
there will be n square.

53
00:02:51,150 --> 00:02:55,240
And will be even more features if several types of interaction are used.

54
00:02:55,240 --> 00:02:57,550
There are two ways to moderate this,

55
00:02:57,550 --> 00:03:01,100
either do feature selection or dimensionality reduction.

56
00:03:01,100 --> 00:03:03,060
I prefer doing the selection since

57
00:03:03,060 --> 00:03:05,615
not all but only a few interactions often

58
00:03:05,615 --> 00:03:09,000
achieve the same quality as all combinations of features.

59
00:03:09,000 --> 00:03:10,830
For each type of interaction,

60
00:03:10,830 --> 00:03:13,555
I construct all piecewise feature interactions.

61
00:03:13,555 --> 00:03:18,150
Feature random forests over them and select several most important features.

62
00:03:18,150 --> 00:03:22,265
Because number of resulting features for each type is relatively small.

63
00:03:22,265 --> 00:03:25,800
It's possible to join them together along with original features and

64
00:03:25,800 --> 00:03:29,975
use as input for any machine learning algorithm usually to be by use method.

65
00:03:29,975 --> 00:03:34,660
During the video, we have examined the method to construct second order interactions.

66
00:03:34,660 --> 00:03:38,750
But you can similarly produce throned order or higher.

67
00:03:38,750 --> 00:03:42,680
Due to the fact that number of features grow rapidly with order,

68
00:03:42,680 --> 00:03:45,225
it has become difficult to work with them.

69
00:03:45,225 --> 00:03:49,440
Therefore high order directions are often constructed semi-manually.

70
00:03:49,440 --> 00:03:52,165
And this is an art in some ways.

71
00:03:52,165 --> 00:03:54,690
Additionally, I would like to talk about methods to

72
00:03:54,690 --> 00:03:57,880
construct categorical features from decision trees.

73
00:03:57,880 --> 00:03:59,840
Take a look at the decision tree.

74
00:03:59,840 --> 00:04:03,475
Let's map each leaf into a binary feature.

75
00:04:03,475 --> 00:04:09,215
The index of the object's leaf can be used as a value for a new categorical feature.

76
00:04:09,215 --> 00:04:12,565
If we use not a single tree but an ensemble of them.

77
00:04:12,565 --> 00:04:14,260
For example, a random forest,

78
00:04:14,260 --> 00:04:18,070
then such operation can be applied to each of entries.

79
00:04:18,070 --> 00:04:22,270
This is a powerful way to extract high order interactions.

80
00:04:22,270 --> 00:04:24,895
This technique is quite simple to implement.

81
00:04:24,895 --> 00:04:27,970
Tree-based poodles from sklearn library have

82
00:04:27,970 --> 00:04:30,190
an apply method which takes as

83
00:04:30,190 --> 00:04:33,830
input feature metrics and rituals corresponding indices of leaves.

84
00:04:33,830 --> 00:04:39,840
In xgboost, also support to why a parameter breed leaf in predict method.

85
00:04:39,840 --> 00:04:42,730
I suggest we need to collaborate documentations in order to

86
00:04:42,730 --> 00:04:46,420
get more information about these methods and IPIs.

87
00:04:46,420 --> 00:04:48,210
In the end of this video,

88
00:04:48,210 --> 00:04:50,250
I will tackle the main points.

89
00:04:50,250 --> 00:04:54,960
We examined method to construct an interactions of categorical features.

90
00:04:54,960 --> 00:04:58,135
Also, we extend the approach to real-valued features.

91
00:04:58,135 --> 00:05:00,610
And we have learned how to use trees to extract

92
00:05:00,610 --> 00:05:04,510
high order interactions. Thank you for your attention.