1 00:00:02,670 --> 00:00:05,630 Hi everyone. 2 00:00:05,630 --> 00:00:10,770 This video is dedicated to the following advanced feature engineering techniques. 3 00:00:10,770 --> 00:00:14,400 Calculating various statistics of one feature grouped by 4 00:00:14,400 --> 00:00:19,425 another and features derived from neighborhood analysis of a given point. 5 00:00:19,425 --> 00:00:21,660 To make it a little bit clearer, 6 00:00:21,660 --> 00:00:24,000 let's consider a simple example. 7 00:00:24,000 --> 00:00:27,590 Here we have a chunk of data for some CTR task. 8 00:00:27,590 --> 00:00:31,735 Let's forget about target variable and focus on human features. 9 00:00:31,735 --> 00:00:35,865 Namely, User_ID, unique identifier of a user, 10 00:00:35,865 --> 00:00:40,120 Page_ID, an identifier of a page user visited, 11 00:00:40,120 --> 00:00:43,890 Ad_price, item prices in the ad, 12 00:00:43,890 --> 00:00:49,645 and Ad_position, relative position of an ad on the web page. 13 00:00:49,645 --> 00:00:53,310 The most straightforward way to solve this problem is to label 14 00:00:53,310 --> 00:00:57,450 and call the Ad_position and feed some classifier. 15 00:00:57,450 --> 00:01:00,480 It would be a very good classifier that could take into 16 00:01:00,480 --> 00:01:04,395 account all the hidden relations between variables. 17 00:01:04,395 --> 00:01:06,720 But no matter how good it is, 18 00:01:06,720 --> 00:01:10,670 it still treats all the data points independently. 19 00:01:10,670 --> 00:01:13,855 And this is where we can apply feature engineering. 20 00:01:13,855 --> 00:01:16,460 We can imply that an ad with 21 00:01:16,460 --> 00:01:20,580 the lowest price on the page will catch most of the attention. 22 00:01:20,580 --> 00:01:24,450 The rest of the ads on the page won't be very attractive. 23 00:01:24,450 --> 00:01:29,165 It's pretty easy to calculate the features relevant to such an implication. 24 00:01:29,165 --> 00:01:34,930 We can add lowest and highest prices for every user and page per ad. 25 00:01:34,930 --> 00:01:40,115 Position of an ad with the lowest price could also be of use in such case. 26 00:01:40,115 --> 00:01:44,753 Here's one of the ways to implement statistical features with paid ads. 27 00:01:44,753 --> 00:01:48,615 If our data is stored in the data frame df, 28 00:01:48,615 --> 00:01:55,550 we call groupby method like this to get maximum and minimum price values. 29 00:01:55,550 --> 00:01:59,160 Then store this object in gb variable, 30 00:01:59,160 --> 00:02:04,627 and then join it back to the data frame df. This is it. 31 00:02:04,627 --> 00:02:09,325 I want to emphasize that you should not stop at this point. 32 00:02:09,325 --> 00:02:12,210 It's possible to add other useful features not 33 00:02:12,210 --> 00:02:16,200 necessarily calculated within user and page per. 34 00:02:16,200 --> 00:02:19,410 It could be how many pages user has visited, 35 00:02:19,410 --> 00:02:23,455 how many pages user has visited during the given session, 36 00:02:23,455 --> 00:02:26,280 and ID of the most visited page, 37 00:02:26,280 --> 00:02:28,965 how many users have visited that page, 38 00:02:28,965 --> 00:02:31,670 and many, many more features. 39 00:02:31,670 --> 00:02:35,215 The main idea is to introduce new information. 40 00:02:35,215 --> 00:02:40,210 By that means, we can drastically increase the quality of the models. 41 00:02:40,210 --> 00:02:44,090 But what if there is no features to use groupby on? 42 00:02:44,090 --> 00:02:45,960 Well, in such case, 43 00:02:45,960 --> 00:02:50,535 we can replace grouping operations with finding the nearest neighbors. 44 00:02:50,535 --> 00:02:56,370 On the one hand, it's much harder to implement and collect useful information. 45 00:02:56,370 --> 00:02:59,455 On the other hand, the method is more flexible. 46 00:02:59,455 --> 00:03:05,370 We can fine tune things like the size of relevant neighborhood or metric. 47 00:03:05,370 --> 00:03:07,740 The most common and natural example of 48 00:03:07,740 --> 00:03:12,050 neighborhood analysis arises from purposive pricing. 49 00:03:12,050 --> 00:03:14,970 Imagine that you need to predict rental prices. 50 00:03:14,970 --> 00:03:19,150 You would probably have some characteristics like floor space, 51 00:03:19,150 --> 00:03:22,050 number of rooms, presence of a bus stop. 52 00:03:22,050 --> 00:03:26,665 But you need something more than that to create a really good model. 53 00:03:26,665 --> 00:03:30,090 It could be the number of other houses in 54 00:03:30,090 --> 00:03:35,370 different neighborhoods like in 500 meters, 1,000 meters, 55 00:03:35,370 --> 00:03:41,080 or 1,500 meters, or average price per square meter in such neighborhoods, 56 00:03:41,080 --> 00:03:43,140 or the number of schools, 57 00:03:43,140 --> 00:03:47,190 supermarkets, and parking lots in such neighborhoods. 58 00:03:47,190 --> 00:03:50,835 The distances to the closest objects of interest 59 00:03:50,835 --> 00:03:54,950 like subway stations or gyms could also be of use. 60 00:03:54,950 --> 00:03:56,835 I think you've got the idea. 61 00:03:56,835 --> 00:04:00,705 In the example, we've used a very simple case, 62 00:04:00,705 --> 00:04:04,980 where neighborhoods were calculated in geographical space. 63 00:04:04,980 --> 00:04:08,040 But don't be afraid to apply this method to 64 00:04:08,040 --> 00:04:11,710 some abstract or even anonymized feature space. 65 00:04:11,710 --> 00:04:14,055 It still could be very useful. 66 00:04:14,055 --> 00:04:18,350 My team and I used this method in Spring Leaf competition. 67 00:04:18,350 --> 00:04:22,910 Furthermore, we did it in supervised fashion. 68 00:04:22,910 --> 00:04:24,405 Here is how we have done it. 69 00:04:24,405 --> 00:04:28,260 First of all, we applied mean encoding to all variables. 70 00:04:28,260 --> 00:04:32,940 By doing so, we created homogeneous feature space so we 71 00:04:32,940 --> 00:04:38,325 did not worry about scaling and importance of each particular feature. 72 00:04:38,325 --> 00:04:44,595 After that, we calculated 2,000 nearest neighbors with Bray-Curtis metric. 73 00:04:44,595 --> 00:04:48,810 Then we evaluated various features from 74 00:04:48,810 --> 00:04:53,740 those neighbors like mean target of nearest 5, 10, 15, 500, 75 00:04:53,740 --> 00:04:59,540 2,000 neighbors, mean distance to 10 closest neighbors, 76 00:04:59,540 --> 00:05:03,713 mean distance to 10 closest neighbors with target 1, 77 00:05:03,713 --> 00:05:08,240 and mean distance to 10 closest neighbors with target 0, 78 00:05:08,240 --> 00:05:10,845 and, it worked great. 79 00:05:10,845 --> 00:05:16,125 In conclusion, I hope you embrace the main ideas of 80 00:05:16,125 --> 00:05:20,085 both groupby and nearest neighbor methods 81 00:05:20,085 --> 00:05:24,935 and you would be able to apply them in practice. 82 00:05:24,935 --> 00:05:28,510 Thank you for your attention.