1 00:00:03,310 --> 00:00:06,950 Hi, everyone. In this video, 2 00:00:06,950 --> 00:00:09,515 I will tell you how I and my teammates, 3 00:00:09,515 --> 00:00:15,045 Stanislav Smirnov solved Kaggle Expedia hotel recommendations competition. 4 00:00:15,045 --> 00:00:17,405 Personally, one of my favorites, 5 00:00:17,405 --> 00:00:23,750 probably among top five most interesting competitions I've ever participated in. 6 00:00:23,750 --> 00:00:25,420 I'll state the problem now. 7 00:00:25,420 --> 00:00:29,045 So, if you came here right after Data Leaks lesson, 8 00:00:29,045 --> 00:00:31,940 it should already be familiar to you. 9 00:00:31,940 --> 00:00:34,700 Anyway, in that competition, 10 00:00:34,700 --> 00:00:37,740 we worked with lots of customer behavior. 11 00:00:37,740 --> 00:00:40,925 These include what customers searched for, 12 00:00:40,925 --> 00:00:45,005 how they interacted with search results, clicks or books, 13 00:00:45,005 --> 00:00:48,200 and whether or not the search result was a travel package, 14 00:00:48,200 --> 00:00:55,790 and Expedia was interested in predicting which hotel group a user is going to book. 15 00:00:55,790 --> 00:01:00,810 Important thing here is prediction target the hotel group. 16 00:01:00,810 --> 00:01:06,315 In other words, characteristics of actual hotel, remember it. 17 00:01:06,315 --> 00:01:09,890 As it turned out, this competition had a very 18 00:01:09,890 --> 00:01:14,225 non-trivial and extremely hard to exploit data leak. 19 00:01:14,225 --> 00:01:15,980 From the first glance, 20 00:01:15,980 --> 00:01:18,575 data leak was pretty straightforward. 21 00:01:18,575 --> 00:01:22,430 We had a destination distance among the feature. 22 00:01:22,430 --> 00:01:28,140 It's a distance from user city to an actual hotel he clicked on booked. 23 00:01:28,140 --> 00:01:29,815 And, as I said earlier, 24 00:01:29,815 --> 00:01:34,025 our prediction target is a characteristic of an actual hotel. 25 00:01:34,025 --> 00:01:36,840 Furthermore, destination distance was 26 00:01:36,840 --> 00:01:41,845 very precise so unique user city and destination distance pairs 27 00:01:41,845 --> 00:01:44,695 corresponded to unique hotels. 28 00:01:44,695 --> 00:01:47,455 Putting two and two together, 29 00:01:47,455 --> 00:01:54,305 we can treat user city and destination distance pair as a proxy to our target. 30 00:01:54,305 --> 00:01:59,995 When in this set, we encountered such pair already present in train set, 31 00:01:59,995 --> 00:02:03,745 we could simply take a label from there as our prediction. 32 00:02:03,745 --> 00:02:08,845 It worked nearly perfect for the pairs present in both train and test. 33 00:02:08,845 --> 00:02:15,935 However, nearly half of test set consisted from new pairs without a match from train set. 34 00:02:15,935 --> 00:02:18,145 This way we had to go deeper. 35 00:02:18,145 --> 00:02:21,425 But, how exactly can we improve our solution? 36 00:02:21,425 --> 00:02:24,165 Well, there are two different ways. 37 00:02:24,165 --> 00:02:27,560 First, one is to create current features on 38 00:02:27,560 --> 00:02:32,495 corteges similar to user city and destination distance pair. 39 00:02:32,495 --> 00:02:38,620 For example, like how many hotels of which group there are for user city, 40 00:02:38,620 --> 00:02:41,540 hotel country, hotel city triplet. 41 00:02:41,540 --> 00:02:46,685 Then, we could train some machine learning model on such features. 42 00:02:46,685 --> 00:02:51,295 Another way is to somehow find more matches. 43 00:02:51,295 --> 00:02:58,320 For that purpose, we need to find true coordinates of users cities and hotel cities. 44 00:02:58,320 --> 00:03:02,055 From that, to guess it was destination distance feature, 45 00:03:02,055 --> 00:03:08,130 it was possible to find good approximation for the coordinates of actual hotels. 46 00:03:08,130 --> 00:03:10,275 Let's find out how to do it. 47 00:03:10,275 --> 00:03:15,770 First of all, we need to understand how to calculate the distance. 48 00:03:15,770 --> 00:03:22,865 Here, we work with geographical coordinates so the distances are geodesic. 49 00:03:22,865 --> 00:03:26,015 It's done via Haversine formula, 50 00:03:26,015 --> 00:03:28,224 not a pleasant one. 51 00:03:28,224 --> 00:03:32,565 Now, suppose that we know true coordinates of 52 00:03:32,565 --> 00:03:39,620 three points and distances from fourth point with unknown coordinates to each of them, 53 00:03:39,620 --> 00:03:42,740 if you write down a system of three equations, 54 00:03:42,740 --> 00:03:44,540 one for each distance, 55 00:03:44,540 --> 00:03:50,320 we can unambiguously solve it and get true coordinates for the fourth point. 56 00:03:50,320 --> 00:03:54,692 Now, we have four points with known coordinates. 57 00:03:54,692 --> 00:03:56,525 I think you get the idea. 58 00:03:56,525 --> 00:04:00,205 So, at first, by hook or by crook, 59 00:04:00,205 --> 00:04:04,325 we reverse engineer true coordinate of three big cities. 60 00:04:04,325 --> 00:04:09,855 After that, we can iteratively find coordinates of more and more cities. 61 00:04:09,855 --> 00:04:12,010 But as you can see from the picture, 62 00:04:12,010 --> 00:04:15,080 some cities ended up in oceans. 63 00:04:15,080 --> 00:04:18,820 It means that our algorithm is not very precise. 64 00:04:18,820 --> 00:04:25,790 A rounding error accumulates after every iteration and everything starts to fall apart. 65 00:04:25,790 --> 00:04:30,220 We get some different method and indeed we can do better. 66 00:04:30,220 --> 00:04:33,555 Just compare this picture with the previous one. 67 00:04:33,555 --> 00:04:36,510 It's obviously much more accurate. 68 00:04:36,510 --> 00:04:39,910 Remember how in iterative method we solved a system of 69 00:04:39,910 --> 00:04:45,605 three equations to unambiguously find coordinates or fourth unknown point. 70 00:04:45,605 --> 00:04:48,865 But why limit ourselves with three equations? 71 00:04:48,865 --> 00:04:51,790 Let's create a giant system of equations from 72 00:04:51,790 --> 00:04:57,290 all known distances with true coordinates being done on variables. 73 00:04:57,290 --> 00:05:03,120 We end up with literally hundreds or thousands of equations and tens of thousands 74 00:05:03,120 --> 00:05:05,120 of unknown variables. 75 00:05:05,120 --> 00:05:07,415 Good thing it's very sparse. 76 00:05:07,415 --> 00:05:13,210 We can apply special methods from SciPy to efficiently solve such a system. 77 00:05:13,210 --> 00:05:17,685 In the end, after solving that system of equations, 78 00:05:17,685 --> 00:05:24,960 we end up with a very precise coordinates for both hotel cities and user cities. 79 00:05:24,960 --> 00:05:26,190 But as you remember, 80 00:05:26,190 --> 00:05:29,235 we're predicting a type of a hotel. 81 00:05:29,235 --> 00:05:32,960 Using city coordinates and destination distance, 82 00:05:32,960 --> 00:05:38,525 it's possible to find an approximation of true coordinates of an actual hotel. 83 00:05:38,525 --> 00:05:42,175 When we fix user city and draw a circumference 84 00:05:42,175 --> 00:05:46,185 around it with the radius of destination distance, 85 00:05:46,185 --> 00:05:52,795 it's obvious that true hotel location must be somewhere on that circumference. 86 00:05:52,795 --> 00:05:58,895 Now, let's fix some hotel city and draw such circumferences from 87 00:05:58,895 --> 00:06:00,410 all users cities to 88 00:06:00,410 --> 00:06:06,505 that fixed hotel cities and draw them for every given destination distance. 89 00:06:06,505 --> 00:06:12,260 After doing so, we end up with pictures like the ones on the slide. 90 00:06:12,260 --> 00:06:18,215 A city contains a limited number of hotels so the intuition here is that hotels 91 00:06:18,215 --> 00:06:21,765 actually are on the intersection points 92 00:06:21,765 --> 00:06:25,760 and the more circumferences intersect in such point, 93 00:06:25,760 --> 00:06:30,335 the higher the probability of a hotel being in that point. 94 00:06:30,335 --> 00:06:35,470 As you can see, the pictures are beautiful but pretty messy. 95 00:06:35,470 --> 00:06:39,860 It's impossible to operate in terms of singular points. 96 00:06:39,860 --> 00:06:46,895 However, there are explicit clusters of points and this information can be of use. 97 00:06:46,895 --> 00:06:50,700 We can do some kind of integration. 98 00:06:50,700 --> 00:06:55,670 For every city, let's create a grid around its center. 99 00:06:55,670 --> 00:07:02,480 Something like 10 kilometers times 10 kilometers with step size of 100 meters. 100 00:07:02,480 --> 00:07:05,465 Now, using training data, 101 00:07:05,465 --> 00:07:07,700 for every cell in the grid, 102 00:07:07,700 --> 00:07:12,835 we can count how many hotels of which type are present there. 103 00:07:12,835 --> 00:07:15,845 If a circumference goes through a cell, 104 00:07:15,845 --> 00:07:20,705 we give plus one to the hotel type corresponding to that circumference. 105 00:07:20,705 --> 00:07:27,735 During inference, we also draw a circumference based on destination distance feature. 106 00:07:27,735 --> 00:07:33,020 We see from what degree its cells it went through and use information 107 00:07:33,020 --> 00:07:37,950 from those cells to create features like a sum of all counters, 108 00:07:37,950 --> 00:07:39,430 average of all counters, 109 00:07:39,430 --> 00:07:42,995 maximum of all counters and so on. 110 00:07:42,995 --> 00:07:48,220 Great. We have covered the part of feature engineering. 111 00:07:48,220 --> 00:07:51,640 Note that all the features directly used target label. 112 00:07:51,640 --> 00:07:56,795 We cannot use them as is in training. 113 00:07:56,795 --> 00:08:01,005 We should generate them in out-of-fold fashion for train data. 114 00:08:01,005 --> 00:08:07,865 So we had training data for years 2013 and 2014. 115 00:08:07,865 --> 00:08:11,365 To generate features for year 2014, 116 00:08:11,365 --> 00:08:16,935 we used labelled data from year 2013 and vice versa, 117 00:08:16,935 --> 00:08:22,785 used the year 2014 to generate features for the year 2013. 118 00:08:22,785 --> 00:08:25,355 For the test features, 119 00:08:25,355 --> 00:08:28,360 which was from year 2015, 120 00:08:28,360 --> 00:08:31,590 we naturally used all training data. 121 00:08:31,590 --> 00:08:39,070 In the end, we calculated a lot of features and serve them into Xgboost model. 122 00:08:39,070 --> 00:08:42,555 After 16 hours of training for the course, 123 00:08:42,555 --> 00:08:44,466 we got our results. 124 00:08:44,466 --> 00:08:49,510 We ended up on third position on public leader-boards and forth on private. 125 00:08:49,510 --> 00:08:54,595 We did good, but we still did not fully exploit data leakage. 126 00:08:54,595 --> 00:08:56,130 If you check the leaderboard, 127 00:08:56,130 --> 00:09:00,525 you'll notice the difference in scores between first place and the rest. 128 00:09:00,525 --> 00:09:04,995 Under speculation, the winner did extraordinary. 129 00:09:04,995 --> 00:09:09,950 Although, in general, his methods were very similar to ours. 130 00:09:09,950 --> 00:09:13,675 He was able to extract way more signal. 131 00:09:13,675 --> 00:09:18,190 Finally, I hope you enjoyed my story. 132 00:09:18,190 --> 00:09:20,740 As you can see, sometimes working with 133 00:09:20,740 --> 00:09:24,285 data leakage could be very interesting and challenging. 134 00:09:24,285 --> 00:09:28,865 You may develop some unusual skills and broaden your horizons. 135 00:09:28,865 --> 00:09:31,410 Thank you for your attention.