1 00:00:00,025 --> 00:00:05,349 [SOUND] Now, I will tell you about a competition-specific 2 00:00:05,349 --> 00:00:09,704 technique tightly connected with data leaks. 3 00:00:09,704 --> 00:00:12,110 It's leaderboard probing. 4 00:00:12,110 --> 00:00:14,970 There are actually two types of leaderboard probing. 5 00:00:14,970 --> 00:00:18,842 The first one is simply extracting all ground truth from public part of 6 00:00:18,842 --> 00:00:20,670 the leaderboard. 7 00:00:20,670 --> 00:00:25,030 It's usually pretty harmless, only a little more of straining data. 8 00:00:25,030 --> 00:00:27,710 It is also a relatively easy to do and 9 00:00:27,710 --> 00:00:31,920 I have a submission change on the small set of rows so that you can 10 00:00:31,920 --> 00:00:36,230 unambiguously calculate ground truth for those rows from leaderboard score. 11 00:00:37,310 --> 00:00:42,360 I suggest checking out the link to Alek Trott's post in additional materials. 12 00:00:42,360 --> 00:00:45,320 He thoroughly explains how to do it very efficiently 13 00:00:45,320 --> 00:00:47,180 with minimum amount of submissions. 14 00:00:48,285 --> 00:00:52,585 Our main focus will be on another type of leaderboard probing. 15 00:00:52,585 --> 00:00:56,015 Remember the purpose of public, private split. 16 00:00:56,015 --> 00:01:01,305 It's supposed to protect private part of test set from information extraction. 17 00:01:01,305 --> 00:01:05,055 It turns out that it's still vulnerable. 18 00:01:05,055 --> 00:01:08,715 Sometimes, it's possible to submit predictions in such a way 19 00:01:08,715 --> 00:01:12,640 that will give out information about private data. 20 00:01:12,640 --> 00:01:14,817 It's all about consistent categories. 21 00:01:14,817 --> 00:01:19,750 Imagine, a chunk of data with the same target for every row. 22 00:01:19,750 --> 00:01:24,920 Like in the example, rows with the same IDs have the same target. 23 00:01:24,920 --> 00:01:28,080 Organizers split it into public and private parts. 24 00:01:29,120 --> 00:01:35,280 But we still know that that particular chunk has the same label for every role. 25 00:01:35,280 --> 00:01:38,980 After setting all the predictions close to 0 in our submission for 26 00:01:38,980 --> 00:01:43,870 that particular chunk of data, we can expect two outcomes. 27 00:01:43,870 --> 00:01:50,180 The first one is when score improved, it means that ground truth in public is 0. 28 00:01:50,180 --> 00:01:54,750 And it also means that ground truth in private is 0 as well. 29 00:01:54,750 --> 00:01:57,670 Remember, our chunk has the same labels. 30 00:01:58,670 --> 00:02:02,210 The second outcome is when the score became worse. 31 00:02:02,210 --> 00:02:09,060 Similarly, it means that ground truth in both public and private is 1. 32 00:02:09,060 --> 00:02:12,760 Some competitions indeed have that kind of categories. 33 00:02:12,760 --> 00:02:16,270 Categories that with high certainty have the same label. 34 00:02:17,290 --> 00:02:21,960 You could have encountered those type of categories in Red Hat and 35 00:02:21,960 --> 00:02:23,860 West Nile competitions. 36 00:02:23,860 --> 00:02:24,970 It was a key for winning. 37 00:02:26,050 --> 00:02:30,330 With a lot of submissions, one can explore a good part of private test set. 38 00:02:31,600 --> 00:02:34,770 It's probably the most annoying type of data leak. 39 00:02:34,770 --> 00:02:39,330 It's mostly technical and even if it's released close to the competition 40 00:02:39,330 --> 00:02:44,190 deadline, you simply won't have enough submissions to fully exploit it. 41 00:02:45,240 --> 00:02:48,790 Furthermore, this is on the tip of the iceberg. 42 00:02:48,790 --> 00:02:51,380 When I say consistent category, 43 00:02:51,380 --> 00:02:56,330 I do not necessarily mean that this category has the same target. 44 00:02:56,330 --> 00:03:00,500 It could be consistent in different ways. 45 00:03:00,500 --> 00:03:02,280 The definition is quite broad. 46 00:03:02,280 --> 00:03:07,190 For example, target label could simply have the same distribution for public and 47 00:03:07,190 --> 00:03:09,440 private parts of data. 48 00:03:09,440 --> 00:03:13,080 It was the case in Quora Question Pairs competition. 49 00:03:13,080 --> 00:03:16,770 In that competition there was a binary classification 50 00:03:16,770 --> 00:03:20,370 task being evaluated by log loss metric. 51 00:03:20,370 --> 00:03:25,150 What's important target were able had different distributions in train and 52 00:03:25,150 --> 00:03:30,120 test, but allegedly the same and private and public parts of these data. 53 00:03:30,120 --> 00:03:35,370 And because of that, we could benefit a lot via leaderboard probing. 54 00:03:35,370 --> 00:03:39,690 Treating the whole test set as a consistent category. 55 00:03:39,690 --> 00:03:42,378 Take a look at the formula on the slide. 56 00:03:42,378 --> 00:03:47,608 This logarithmic loss for submission with constant predictions C big. 57 00:03:47,608 --> 00:03:52,064 Where N big is the real number of rows, 58 00:03:52,064 --> 00:03:57,708 N1 big is the number of rows with target one. 59 00:03:57,708 --> 00:04:02,970 And L big is the leader board score given by that constant prediction. 60 00:04:02,970 --> 00:04:07,990 From this equation, we can calculate N1 divided by N or 61 00:04:07,990 --> 00:04:13,250 in other words, the true ratio of once in the test set. 62 00:04:13,250 --> 00:04:15,860 That knowledge was very beneficial. 63 00:04:15,860 --> 00:04:19,130 We could use it rebalance training data points 64 00:04:19,130 --> 00:04:23,750 to have the same distribution of target variable as in the test set. 65 00:04:23,750 --> 00:04:28,300 This little trick gave a huge boost in leaderboard score. 66 00:04:28,300 --> 00:04:32,560 As you can see, leaderboard probing is a very serious problem 67 00:04:32,560 --> 00:04:36,730 that could occur under a lot of different circumstances. 68 00:04:36,730 --> 00:04:40,800 I hope that someday it will become complete the eradicated from competitive 69 00:04:40,800 --> 00:04:41,620 machine learning. 70 00:04:42,830 --> 00:04:46,745 Now, finally, I like to briefly walk through the most peculiar and 71 00:04:46,745 --> 00:04:49,500 interesting competitions with data leakage. 72 00:04:50,530 --> 00:04:51,140 And first, 73 00:04:51,140 --> 00:04:55,540 let's take a look at Truly Native competition from different point of view. 74 00:04:55,540 --> 00:05:00,450 In this competition, participants were asked to predict whether the content 75 00:05:00,450 --> 00:05:05,360 in an HTML file is sponsored or not. 76 00:05:05,360 --> 00:05:09,690 As was already discussed in previous video, 77 00:05:09,690 --> 00:05:12,760 there was a data leak in archive dates. 78 00:05:12,760 --> 00:05:14,765 We can assume that sponsored and 79 00:05:14,765 --> 00:05:18,890 non-sponsored HTML files were gotten during different periods of time. 80 00:05:20,310 --> 00:05:25,310 So do we really get rid of data leak after erasing archive dates? 81 00:05:26,740 --> 00:05:28,300 The answer is no. 82 00:05:28,300 --> 00:05:33,870 Texts in HTML files may be connected to dates in a lot of ways. 83 00:05:33,870 --> 00:05:40,330 From explicit timestamps to much more subtle things, like news contents. 84 00:05:40,330 --> 00:05:42,720 As you've probably already realized, 85 00:05:42,720 --> 00:05:49,020 the real problem was not metadata leak, but rather data collection. 86 00:05:49,020 --> 00:05:51,000 Even without metainformation, 87 00:05:51,000 --> 00:05:55,930 machine learning algorithms will focus on actually useless features. 88 00:05:55,930 --> 00:05:59,730 The features that only act as proxies for the date. 89 00:06:00,780 --> 00:06:06,330 The next example is Expedia Hotel Recommendations, 90 00:06:06,330 --> 00:06:12,800 and that competitions, participants worked with logs of customer behavior. 91 00:06:12,800 --> 00:06:18,370 These include what customers searched for, how they interacted with search results, 92 00:06:18,370 --> 00:06:23,340 and clicks or books, and whether or not the search result was a travel package. 93 00:06:24,630 --> 00:06:30,250 Expedia was interested in predicting which hotel group a user is going to book. 94 00:06:31,260 --> 00:06:35,975 Within the logs of customer behavior, there was a very tricky feature. 95 00:06:35,975 --> 00:06:39,590 At distance from users seeking their hotel. 96 00:06:39,590 --> 00:06:44,010 Turned out, that this feature is actually a huge data leak. 97 00:06:44,010 --> 00:06:49,140 Using this distance, it was possible to reverse engineer two coordinates, and 98 00:06:49,140 --> 00:06:52,780 simply map ground truth from train set to the test set. 99 00:06:53,920 --> 00:06:57,400 I strongly suggest you to check out the special video 100 00:06:57,400 --> 00:06:59,970 dedicated to this competition. 101 00:06:59,970 --> 00:07:03,650 I hope that you will find it very useful because the approaches and 102 00:07:03,650 --> 00:07:08,160 methods of exploiting data leak were extremely nontrivial. 103 00:07:08,160 --> 00:07:10,970 And you will find a lot of interesting tricks in it. 104 00:07:12,120 --> 00:07:16,400 The next example is from Flavours of Physics competition. 105 00:07:16,400 --> 00:07:19,990 It was a pretty complicated problem dealing with physics at 106 00:07:19,990 --> 00:07:21,515 Large Hadron Collider. 107 00:07:21,515 --> 00:07:26,241 The special thing about that competition was that 108 00:07:26,241 --> 00:07:30,730 signal was artificially simulated. 109 00:07:30,730 --> 00:07:34,230 Organizers wanted a machine learning solution for 110 00:07:34,230 --> 00:07:38,140 something that has never been observed. 111 00:07:38,140 --> 00:07:39,970 That's why the signal was simulated. 112 00:07:41,250 --> 00:07:46,770 But simulation cannot be perfect and it's possible to reverse engineer it. 113 00:07:46,770 --> 00:07:50,230 Organizers even created special statistical tests 114 00:07:50,230 --> 00:07:54,830 in order to punish the models that exploit simulation flaws. 115 00:07:54,830 --> 00:07:57,298 However, it was in vain. 116 00:07:57,298 --> 00:08:01,870 One could bypass the tests, fully exploit simulation flaws, and 117 00:08:01,870 --> 00:08:04,980 get a perfect score on the leaderboard. 118 00:08:04,980 --> 00:08:08,240 The last example is going to cover pairwise tasks. 119 00:08:08,240 --> 00:08:12,850 Where one needs to predict whether the given pair of 120 00:08:12,850 --> 00:08:17,580 items are duplicates or not, like in Quora question pairs competition. 121 00:08:18,780 --> 00:08:23,830 There is one thing common to all the competitions with pairwise tasks. 122 00:08:23,830 --> 00:08:28,440 Participants are not asked to evaluate all possible pairs. 123 00:08:28,440 --> 00:08:32,400 There is always some nonrandom subsampling, and 124 00:08:32,400 --> 00:08:36,280 this subsampling is the cause of data leakage. 125 00:08:36,280 --> 00:08:40,850 Usually, organizers sample mostly hard-to-distinguish pairs. 126 00:08:40,850 --> 00:08:46,040 Because of that, of course, imbalance in item frequencies. 127 00:08:46,040 --> 00:08:50,000 It results in more frequent items having the higher 128 00:08:50,000 --> 00:08:52,470 possibility of being duplicates. 129 00:08:52,470 --> 00:08:54,230 But that's not all. 130 00:08:54,230 --> 00:08:57,740 We can create a connectivity matrix N times N, 131 00:08:57,740 --> 00:09:01,180 where N is the total number of items. 132 00:09:01,180 --> 00:09:07,066 If item I and item J appeared in a pair then we place 1 in I, 133 00:09:07,066 --> 00:09:10,060 J and J, I positions. 134 00:09:10,060 --> 00:09:15,480 Now, we can treat the rows in connectivity matrix as vector representations for 135 00:09:15,480 --> 00:09:17,040 every item. 136 00:09:17,040 --> 00:09:21,720 This means that we can compute similarities between those vectors. 137 00:09:21,720 --> 00:09:24,360 This tricks works for a very simple reason. 138 00:09:25,420 --> 00:09:29,480 When two items have similar sets of neighbors 139 00:09:29,480 --> 00:09:32,980 they have a high possibility of being duplicates. 140 00:09:34,160 --> 00:09:36,260 This is it with data leaks. 141 00:09:36,260 --> 00:09:40,980 I hope you got the concept and found a lot of interesting examples. 142 00:09:40,980 --> 00:09:42,315 Thank you for your attention. 143 00:09:42,315 --> 00:09:52,315 [SOUND]