[SOUND] Now, I will tell you about a competition-specific technique tightly connected with data leaks. It's leaderboard probing. There are actually two types of leaderboard probing. The first one is simply extracting all ground truth from public part of the leaderboard. It's usually pretty harmless, only a little more of straining data. It is also a relatively easy to do and I have a submission change on the small set of rows so that you can unambiguously calculate ground truth for those rows from leaderboard score. I suggest checking out the link to Alek Trott's post in additional materials. He thoroughly explains how to do it very efficiently with minimum amount of submissions. Our main focus will be on another type of leaderboard probing. Remember the purpose of public, private split. It's supposed to protect private part of test set from information extraction. It turns out that it's still vulnerable. Sometimes, it's possible to submit predictions in such a way that will give out information about private data. It's all about consistent categories. Imagine, a chunk of data with the same target for every row. Like in the example, rows with the same IDs have the same target. Organizers split it into public and private parts. But we still know that that particular chunk has the same label for every role. After setting all the predictions close to 0 in our submission for that particular chunk of data, we can expect two outcomes. The first one is when score improved, it means that ground truth in public is 0. And it also means that ground truth in private is 0 as well. Remember, our chunk has the same labels. The second outcome is when the score became worse. Similarly, it means that ground truth in both public and private is 1. Some competitions indeed have that kind of categories. Categories that with high certainty have the same label. You could have encountered those type of categories in Red Hat and West Nile competitions. It was a key for winning. With a lot of submissions, one can explore a good part of private test set. It's probably the most annoying type of data leak. It's mostly technical and even if it's released close to the competition deadline, you simply won't have enough submissions to fully exploit it. Furthermore, this is on the tip of the iceberg. When I say consistent category, I do not necessarily mean that this category has the same target. It could be consistent in different ways. The definition is quite broad. For example, target label could simply have the same distribution for public and private parts of data. It was the case in Quora Question Pairs competition. In that competition there was a binary classification task being evaluated by log loss metric. What's important target were able had different distributions in train and test, but allegedly the same and private and public parts of these data. And because of that, we could benefit a lot via leaderboard probing. Treating the whole test set as a consistent category. Take a look at the formula on the slide. This logarithmic loss for submission with constant predictions C big. Where N big is the real number of rows, N1 big is the number of rows with target one. And L big is the leader board score given by that constant prediction. From this equation, we can calculate N1 divided by N or in other words, the true ratio of once in the test set. That knowledge was very beneficial. We could use it rebalance training data points to have the same distribution of target variable as in the test set. This little trick gave a huge boost in leaderboard score. As you can see, leaderboard probing is a very serious problem that could occur under a lot of different circumstances. I hope that someday it will become complete the eradicated from competitive machine learning. Now, finally, I like to briefly walk through the most peculiar and interesting competitions with data leakage. And first, let's take a look at Truly Native competition from different point of view. In this competition, participants were asked to predict whether the content in an HTML file is sponsored or not. As was already discussed in previous video, there was a data leak in archive dates. We can assume that sponsored and non-sponsored HTML files were gotten during different periods of time. So do we really get rid of data leak after erasing archive dates? The answer is no. Texts in HTML files may be connected to dates in a lot of ways. From explicit timestamps to much more subtle things, like news contents. As you've probably already realized, the real problem was not metadata leak, but rather data collection. Even without metainformation, machine learning algorithms will focus on actually useless features. The features that only act as proxies for the date. The next example is Expedia Hotel Recommendations, and that competitions, participants worked with logs of customer behavior. These include what customers searched for, how they interacted with search results, and clicks or books, and whether or not the search result was a travel package. Expedia was interested in predicting which hotel group a user is going to book. Within the logs of customer behavior, there was a very tricky feature. At distance from users seeking their hotel. Turned out, that this feature is actually a huge data leak. Using this distance, it was possible to reverse engineer two coordinates, and simply map ground truth from train set to the test set. I strongly suggest you to check out the special video dedicated to this competition. I hope that you will find it very useful because the approaches and methods of exploiting data leak were extremely nontrivial. And you will find a lot of interesting tricks in it. The next example is from Flavours of Physics competition. It was a pretty complicated problem dealing with physics at Large Hadron Collider. The special thing about that competition was that signal was artificially simulated. Organizers wanted a machine learning solution for something that has never been observed. That's why the signal was simulated. But simulation cannot be perfect and it's possible to reverse engineer it. Organizers even created special statistical tests in order to punish the models that exploit simulation flaws. However, it was in vain. One could bypass the tests, fully exploit simulation flaws, and get a perfect score on the leaderboard. The last example is going to cover pairwise tasks. Where one needs to predict whether the given pair of items are duplicates or not, like in Quora question pairs competition. There is one thing common to all the competitions with pairwise tasks. Participants are not asked to evaluate all possible pairs. There is always some nonrandom subsampling, and this subsampling is the cause of data leakage. Usually, organizers sample mostly hard-to-distinguish pairs. Because of that, of course, imbalance in item frequencies. It results in more frequent items having the higher possibility of being duplicates. But that's not all. We can create a connectivity matrix N times N, where N is the total number of items. If item I and item J appeared in a pair then we place 1 in I, J and J, I positions. Now, we can treat the rows in connectivity matrix as vector representations for every item. This means that we can compute similarities between those vectors. This tricks works for a very simple reason. When two items have similar sets of neighbors they have a high possibility of being duplicates. This is it with data leaks. I hope you got the concept and found a lot of interesting examples. Thank you for your attention. [SOUND]