1 00:00:03,640 --> 00:00:06,470 Hi everyone. In this section, 2 00:00:06,470 --> 00:00:12,570 we will talk about a very sensitive topic data leakage or more simply, leaks. 3 00:00:12,570 --> 00:00:16,870 We'll define leakage in a very general sense as 4 00:00:16,870 --> 00:00:19,450 an unexpected information in the data that 5 00:00:19,450 --> 00:00:22,615 allows us to make unrealistically good predictions. 6 00:00:22,615 --> 00:00:24,080 For the time being, 7 00:00:24,080 --> 00:00:26,485 you may have think of it as of directly or 8 00:00:26,485 --> 00:00:31,490 indirectly adding ground truths into the test data. 9 00:00:31,490 --> 00:00:33,765 Data leaks are very, very bad. 10 00:00:33,765 --> 00:00:36,835 They are completely unusable in real world. 11 00:00:36,835 --> 00:00:43,245 They usually provide way too much signal and thus make competitions lose its main point, 12 00:00:43,245 --> 00:00:47,560 and quickly turn them into a leak hunt increase. 13 00:00:47,560 --> 00:00:50,915 People often are very sensitive about this matter. 14 00:00:50,915 --> 00:00:52,960 They tend to overreact. 15 00:00:52,960 --> 00:00:54,755 That's completely understandable. 16 00:00:54,755 --> 00:00:57,825 After spending a lot of time on solving the problem, 17 00:00:57,825 --> 00:01:01,845 a sudden data leak may render all of that useless. 18 00:01:01,845 --> 00:01:04,640 It is not a pleasant position to be in. 19 00:01:04,640 --> 00:01:09,100 I cannot force you to turn the blind eye but keep in mind, 20 00:01:09,100 --> 00:01:13,030 there is no ill intent whatsoever. 21 00:01:13,030 --> 00:01:17,515 Data leaks are the result of unintentional errors, accidents. 22 00:01:17,515 --> 00:01:19,270 Even if you find yourself in 23 00:01:19,270 --> 00:01:23,770 a competition with an unexpected data leak close to the deadline, 24 00:01:23,770 --> 00:01:26,520 please be more tolerant. 25 00:01:26,520 --> 00:01:29,295 The question of whether to exploit the data leak 26 00:01:29,295 --> 00:01:33,055 or not is exclusive to machine learning competitions. 27 00:01:33,055 --> 00:01:37,875 In real world, the answer is obviously a no, nothing to discuss. 28 00:01:37,875 --> 00:01:39,400 But in a competition, 29 00:01:39,400 --> 00:01:43,325 the ultimate goal is to get a higher leaderboard position. 30 00:01:43,325 --> 00:01:45,355 And if you truly pursue that goal, 31 00:01:45,355 --> 00:01:49,045 then exploit the leak in every way possible. 32 00:01:49,045 --> 00:01:50,440 Further in this section, 33 00:01:50,440 --> 00:01:53,285 I will show you the main types of data leaks 34 00:01:53,285 --> 00:01:56,790 that could appear during solving a machine learning problem. 35 00:01:56,790 --> 00:02:03,550 Also focus on a competition specific leak exploitation technique leaderboard probing. 36 00:02:03,550 --> 00:02:06,190 Finally, you will find special videos 37 00:02:06,190 --> 00:02:11,040 dedicated to the most interesting and non-trivial data leaks. 38 00:02:11,040 --> 00:02:17,910 I will start with the most typical data leaks that may occur in almost every problem. 39 00:02:17,910 --> 00:02:20,125 Time series is our first target. 40 00:02:20,125 --> 00:02:23,015 Typically, future picking. 41 00:02:23,015 --> 00:02:26,780 It is common sense not to pick into the future like, 42 00:02:26,780 --> 00:02:32,570 can we use stock market's price from day after tomorrow to predict price for tomorrow? 43 00:02:32,570 --> 00:02:36,215 Of course not. However, direct usage of 44 00:02:36,215 --> 00:02:41,240 future information in incorrect time splits still exist. 45 00:02:41,240 --> 00:02:44,830 When you enter a time serious competition at first, 46 00:02:44,830 --> 00:02:48,005 check train, public, and private splits. 47 00:02:48,005 --> 00:02:50,630 If even one of them is not on time, 48 00:02:50,630 --> 00:02:53,105 then you found a data leak. 49 00:02:53,105 --> 00:03:01,165 In such case, unrealistic features like prices next week will be the most important. 50 00:03:01,165 --> 00:03:03,210 But even when split by time, 51 00:03:03,210 --> 00:03:06,245 data still contains information about future. 52 00:03:06,245 --> 00:03:09,800 We still can access the rows from the test set. 53 00:03:09,800 --> 00:03:13,790 We can have future user history in CTR task, 54 00:03:13,790 --> 00:03:20,145 some fundamental indicators in stock market predictions tasks, and so on. 55 00:03:20,145 --> 00:03:24,510 There are only two ways to eliminate the possibility of data leakage. 56 00:03:24,510 --> 00:03:29,090 It's called competitions, where one can not access 57 00:03:29,090 --> 00:03:34,150 rows from future or a test set with no features at all, only IDs. 58 00:03:34,150 --> 00:03:39,740 For example, just the number and instrument ID in stock market prediction, 59 00:03:39,740 --> 00:03:45,420 so participants create features based on past and join them themselves. 60 00:03:45,420 --> 00:03:48,610 Now, let's discuss something more unusual. 61 00:03:48,610 --> 00:03:52,820 Those types of data leaks are much harder to find. 62 00:03:52,820 --> 00:03:56,810 We often have more than just train and test files. 63 00:03:56,810 --> 00:04:01,140 For example, a lot of images or text in archive. 64 00:04:01,140 --> 00:04:04,970 In such case, we can't access some meta information, 65 00:04:04,970 --> 00:04:08,950 file creation date, image resolution etcetera. 66 00:04:08,950 --> 00:04:13,890 It turns out that this meta information may be connected to target variable. 67 00:04:13,890 --> 00:04:18,535 Imagine classic cats versus dogs classification. 68 00:04:18,535 --> 00:04:20,640 What if cat pictures were taken before dog? 69 00:04:20,640 --> 00:04:24,010 Or taken with a different camera? 70 00:04:24,010 --> 00:04:29,510 Because of that, a good practice from organizers is to erase the meta data, 71 00:04:29,510 --> 00:04:32,750 resize the pictures, and change creation date. 72 00:04:32,750 --> 00:04:36,195 Unfortunately, sometimes we will forget about it. 73 00:04:36,195 --> 00:04:39,210 A good example is Truly Native competition, 74 00:04:39,210 --> 00:04:44,505 where one could get nearly perfect scores using just the dates from zip archives. 75 00:04:44,505 --> 00:04:48,380 Another type of leakage could be found in IDs. 76 00:04:48,380 --> 00:04:54,285 IDs are unique identifiers of every row usually used for convenience. 77 00:04:54,285 --> 00:04:57,410 It makes no sense to include them into the model. 78 00:04:57,410 --> 00:05:00,905 It is assumed that they are automatically generated. 79 00:05:00,905 --> 00:05:04,060 In reality, that's not always true. 80 00:05:04,060 --> 00:05:06,510 ID may be a hash of something, 81 00:05:06,510 --> 00:05:09,295 probably not intended for disclosure. 82 00:05:09,295 --> 00:05:14,075 It may contain traces of information connected to target variable. 83 00:05:14,075 --> 00:05:16,605 It was a case in Caterpillar competition. 84 00:05:16,605 --> 00:05:20,270 A link ID as a feature slightly improve the result. 85 00:05:20,270 --> 00:05:23,930 So I advise you to pay close attention to IDs and 86 00:05:23,930 --> 00:05:27,875 always check whether they are useful or not. 87 00:05:27,875 --> 00:05:29,965 Next is row order. 88 00:05:29,965 --> 00:05:35,230 In trivial case, data may be shuffled by target variable. 89 00:05:35,230 --> 00:05:39,040 Sometimes simply adding row number or relative number, 90 00:05:39,040 --> 00:05:41,200 suddenly improves this course. 91 00:05:41,200 --> 00:05:44,680 Like, in Telstra Network Disruptions competition. 92 00:05:44,680 --> 00:05:47,995 It's also possible to find something way more interesting 93 00:05:47,995 --> 00:05:52,420 like in TalkingData Mobile User Demographics competition. 94 00:05:52,420 --> 00:05:55,220 There was some kind of row duplication, 95 00:05:55,220 --> 00:05:59,610 rows next to each other usually have the same label. 96 00:05:59,610 --> 00:06:02,500 This is it with a regular type of leaks. 97 00:06:02,500 --> 00:06:05,050 To sum things up, in this video, 98 00:06:05,050 --> 00:06:12,780 we embrace the concept of data leak and cover data leaks from future picking, 99 00:06:12,780 --> 00:06:16,380 meta data, IDs, and row order.