1
00:00:03,310 --> 00:00:06,950
Hi, everyone. In this video,

2
00:00:06,950 --> 00:00:09,515
I will tell you how I and my teammates,

3
00:00:09,515 --> 00:00:15,045
Stanislav Smirnov solved Kaggle Expedia hotel recommendations competition.

4
00:00:15,045 --> 00:00:17,405
Personally, one of my favorites,

5
00:00:17,405 --> 00:00:23,750
probably among top five most interesting competitions I've ever participated in.

6
00:00:23,750 --> 00:00:25,420
I'll state the problem now.

7
00:00:25,420 --> 00:00:29,045
So, if you came here right after Data Leaks lesson,

8
00:00:29,045 --> 00:00:31,940
it should already be familiar to you.

9
00:00:31,940 --> 00:00:34,700
Anyway, in that competition,

10
00:00:34,700 --> 00:00:37,740
we worked with lots of customer behavior.

11
00:00:37,740 --> 00:00:40,925
These include what customers searched for,

12
00:00:40,925 --> 00:00:45,005
how they interacted with search results, clicks or books,

13
00:00:45,005 --> 00:00:48,200
and whether or not the search result was a travel package,

14
00:00:48,200 --> 00:00:55,790
and Expedia was interested in predicting which hotel group a user is going to book.

15
00:00:55,790 --> 00:01:00,810
Important thing here is prediction target the hotel group.

16
00:01:00,810 --> 00:01:06,315
In other words, characteristics of actual hotel, remember it.

17
00:01:06,315 --> 00:01:09,890
As it turned out, this competition had a very

18
00:01:09,890 --> 00:01:14,225
non-trivial and extremely hard to exploit data leak.

19
00:01:14,225 --> 00:01:15,980
From the first glance,

20
00:01:15,980 --> 00:01:18,575
data leak was pretty straightforward.

21
00:01:18,575 --> 00:01:22,430
We had a destination distance among the feature.

22
00:01:22,430 --> 00:01:28,140
It's a distance from user city to an actual hotel he clicked on booked.

23
00:01:28,140 --> 00:01:29,815
And, as I said earlier,

24
00:01:29,815 --> 00:01:34,025
our prediction target is a characteristic of an actual hotel.

25
00:01:34,025 --> 00:01:36,840
Furthermore, destination distance was

26
00:01:36,840 --> 00:01:41,845
very precise so unique user city and destination distance pairs

27
00:01:41,845 --> 00:01:44,695
corresponded to unique hotels.

28
00:01:44,695 --> 00:01:47,455
Putting two and two together,

29
00:01:47,455 --> 00:01:54,305
we can treat user city and destination distance pair as a proxy to our target.

30
00:01:54,305 --> 00:01:59,995
When in this set, we encountered such pair already present in train set,

31
00:01:59,995 --> 00:02:03,745
we could simply take a label from there as our prediction.

32
00:02:03,745 --> 00:02:08,845
It worked nearly perfect for the pairs present in both train and test.

33
00:02:08,845 --> 00:02:15,935
However, nearly half of test set consisted from new pairs without a match from train set.

34
00:02:15,935 --> 00:02:18,145
This way we had to go deeper.

35
00:02:18,145 --> 00:02:21,425
But, how exactly can we improve our solution?

36
00:02:21,425 --> 00:02:24,165
Well, there are two different ways.

37
00:02:24,165 --> 00:02:27,560
First, one is to create current features on

38
00:02:27,560 --> 00:02:32,495
corteges similar to user city and destination distance pair.

39
00:02:32,495 --> 00:02:38,620
For example, like how many hotels of which group there are for user city,

40
00:02:38,620 --> 00:02:41,540
hotel country, hotel city triplet.

41
00:02:41,540 --> 00:02:46,685
Then, we could train some machine learning model on such features.

42
00:02:46,685 --> 00:02:51,295
Another way is to somehow find more matches.

43
00:02:51,295 --> 00:02:58,320
For that purpose, we need to find true coordinates of users cities and hotel cities.

44
00:02:58,320 --> 00:03:02,055
From that, to guess it was destination distance feature,

45
00:03:02,055 --> 00:03:08,130
it was possible to find good approximation for the coordinates of actual hotels.

46
00:03:08,130 --> 00:03:10,275
Let's find out how to do it.

47
00:03:10,275 --> 00:03:15,770
First of all, we need to understand how to calculate the distance.

48
00:03:15,770 --> 00:03:22,865
Here, we work with geographical coordinates so the distances are geodesic.

49
00:03:22,865 --> 00:03:26,015
It's done via Haversine formula,

50
00:03:26,015 --> 00:03:28,224
not a pleasant one.

51
00:03:28,224 --> 00:03:32,565
Now, suppose that we know true coordinates of

52
00:03:32,565 --> 00:03:39,620
three points and distances from fourth point with unknown coordinates to each of them,

53
00:03:39,620 --> 00:03:42,740
if you write down a system of three equations,

54
00:03:42,740 --> 00:03:44,540
one for each distance,

55
00:03:44,540 --> 00:03:50,320
we can unambiguously solve it and get true coordinates for the fourth point.

56
00:03:50,320 --> 00:03:54,692
Now, we have four points with known coordinates.

57
00:03:54,692 --> 00:03:56,525
I think you get the idea.

58
00:03:56,525 --> 00:04:00,205
So, at first, by hook or by crook,

59
00:04:00,205 --> 00:04:04,325
we reverse engineer true coordinate of three big cities.

60
00:04:04,325 --> 00:04:09,855
After that, we can iteratively find coordinates of more and more cities.

61
00:04:09,855 --> 00:04:12,010
But as you can see from the picture,

62
00:04:12,010 --> 00:04:15,080
some cities ended up in oceans.

63
00:04:15,080 --> 00:04:18,820
It means that our algorithm is not very precise.

64
00:04:18,820 --> 00:04:25,790
A rounding error accumulates after every iteration and everything starts to fall apart.

65
00:04:25,790 --> 00:04:30,220
We get some different method and indeed we can do better.

66
00:04:30,220 --> 00:04:33,555
Just compare this picture with the previous one.

67
00:04:33,555 --> 00:04:36,510
It's obviously much more accurate.

68
00:04:36,510 --> 00:04:39,910
Remember how in iterative method we solved a system of

69
00:04:39,910 --> 00:04:45,605
three equations to unambiguously find coordinates or fourth unknown point.

70
00:04:45,605 --> 00:04:48,865
But why limit ourselves with three equations?

71
00:04:48,865 --> 00:04:51,790
Let's create a giant system of equations from

72
00:04:51,790 --> 00:04:57,290
all known distances with true coordinates being done on variables.

73
00:04:57,290 --> 00:05:03,120
We end up with literally hundreds or thousands of equations and tens of thousands

74
00:05:03,120 --> 00:05:05,120
of unknown variables.

75
00:05:05,120 --> 00:05:07,415
Good thing it's very sparse.

76
00:05:07,415 --> 00:05:13,210
We can apply special methods from SciPy to efficiently solve such a system.

77
00:05:13,210 --> 00:05:17,685
In the end, after solving that system of equations,

78
00:05:17,685 --> 00:05:24,960
we end up with a very precise coordinates for both hotel cities and user cities.

79
00:05:24,960 --> 00:05:26,190
But as you remember,

80
00:05:26,190 --> 00:05:29,235
we're predicting a type of a hotel.

81
00:05:29,235 --> 00:05:32,960
Using city coordinates and destination distance,

82
00:05:32,960 --> 00:05:38,525
it's possible to find an approximation of true coordinates of an actual hotel.

83
00:05:38,525 --> 00:05:42,175
When we fix user city and draw a circumference

84
00:05:42,175 --> 00:05:46,185
around it with the radius of destination distance,

85
00:05:46,185 --> 00:05:52,795
it's obvious that true hotel location must be somewhere on that circumference.

86
00:05:52,795 --> 00:05:58,895
Now, let's fix some hotel city and draw such circumferences from

87
00:05:58,895 --> 00:06:00,410
all users cities to

88
00:06:00,410 --> 00:06:06,505
that fixed hotel cities and draw them for every given destination distance.

89
00:06:06,505 --> 00:06:12,260
After doing so, we end up with pictures like the ones on the slide.

90
00:06:12,260 --> 00:06:18,215
A city contains a limited number of hotels so the intuition here is that hotels

91
00:06:18,215 --> 00:06:21,765
actually are on the intersection points

92
00:06:21,765 --> 00:06:25,760
and the more circumferences intersect in such point,

93
00:06:25,760 --> 00:06:30,335
the higher the probability of a hotel being in that point.

94
00:06:30,335 --> 00:06:35,470
As you can see, the pictures are beautiful but pretty messy.

95
00:06:35,470 --> 00:06:39,860
It's impossible to operate in terms of singular points.

96
00:06:39,860 --> 00:06:46,895
However, there are explicit clusters of points and this information can be of use.

97
00:06:46,895 --> 00:06:50,700
We can do some kind of integration.

98
00:06:50,700 --> 00:06:55,670
For every city, let's create a grid around its center.

99
00:06:55,670 --> 00:07:02,480
Something like 10 kilometers times 10 kilometers with step size of 100 meters.

100
00:07:02,480 --> 00:07:05,465
Now, using training data,

101
00:07:05,465 --> 00:07:07,700
for every cell in the grid,

102
00:07:07,700 --> 00:07:12,835
we can count how many hotels of which type are present there.

103
00:07:12,835 --> 00:07:15,845
If a circumference goes through a cell,

104
00:07:15,845 --> 00:07:20,705
we give plus one to the hotel type corresponding to that circumference.

105
00:07:20,705 --> 00:07:27,735
During inference, we also draw a circumference based on destination distance feature.

106
00:07:27,735 --> 00:07:33,020
We see from what degree its cells it went through and use information

107
00:07:33,020 --> 00:07:37,950
from those cells to create features like a sum of all counters,

108
00:07:37,950 --> 00:07:39,430
average of all counters,

109
00:07:39,430 --> 00:07:42,995
maximum of all counters and so on.

110
00:07:42,995 --> 00:07:48,220
Great. We have covered the part of feature engineering.

111
00:07:48,220 --> 00:07:51,640
Note that all the features directly used target label.

112
00:07:51,640 --> 00:07:56,795
We cannot use them as is in training.

113
00:07:56,795 --> 00:08:01,005
We should generate them in out-of-fold fashion for train data.

114
00:08:01,005 --> 00:08:07,865
So we had training data for years 2013 and 2014.

115
00:08:07,865 --> 00:08:11,365
To generate features for year 2014,

116
00:08:11,365 --> 00:08:16,935
we used labelled data from year 2013 and vice versa,

117
00:08:16,935 --> 00:08:22,785
used the year 2014 to generate features for the year 2013.

118
00:08:22,785 --> 00:08:25,355
For the test features,

119
00:08:25,355 --> 00:08:28,360
which was from year 2015,

120
00:08:28,360 --> 00:08:31,590
we naturally used all training data.

121
00:08:31,590 --> 00:08:39,070
In the end, we calculated a lot of features and serve them into Xgboost model.

122
00:08:39,070 --> 00:08:42,555
After 16 hours of training for the course,

123
00:08:42,555 --> 00:08:44,466
we got our results.

124
00:08:44,466 --> 00:08:49,510
We ended up on third position on public leader-boards and forth on private.

125
00:08:49,510 --> 00:08:54,595
We did good, but we still did not fully exploit data leakage.

126
00:08:54,595 --> 00:08:56,130
If you check the leaderboard,

127
00:08:56,130 --> 00:09:00,525
you'll notice the difference in scores between first place and the rest.

128
00:09:00,525 --> 00:09:04,995
Under speculation, the winner did extraordinary.

129
00:09:04,995 --> 00:09:09,950
Although, in general, his methods were very similar to ours.

130
00:09:09,950 --> 00:09:13,675
He was able to extract way more signal.

131
00:09:13,675 --> 00:09:18,190
Finally, I hope you enjoyed my story.

132
00:09:18,190 --> 00:09:20,740
As you can see, sometimes working with

133
00:09:20,740 --> 00:09:24,285
data leakage could be very interesting and challenging.

134
00:09:24,285 --> 00:09:28,865
You may develop some unusual skills and broaden your horizons.

135
00:09:28,865 --> 00:09:31,410
Thank you for your attention.