[SOUND] Now, I will tell you
about a competition-specific technique tightly
connected with data leaks. It's leaderboard probing. There are actually two types
of leaderboard probing. The first one is simply extracting
all ground truth from public part of the leaderboard. It's usually pretty harmless,
only a little more of straining data. It is also a relatively easy to do and I have a submission change on
the small set of rows so that you can unambiguously calculate ground truth for
those rows from leaderboard score. I suggest checking out the link to
Alek Trott's post in additional materials. He thoroughly explains how
to do it very efficiently with minimum amount of submissions. Our main focus will be on another
type of leaderboard probing. Remember the purpose of public,
private split. It's supposed to protect private part of
test set from information extraction. It turns out that it's still vulnerable. Sometimes, it's possible to
submit predictions in such a way that will give out information
about private data. It's all about consistent categories. Imagine, a chunk of data with
the same target for every row. Like in the example, rows with
the same IDs have the same target. Organizers split it into public and
private parts. But we still know that that particular
chunk has the same label for every role. After setting all the predictions
close to 0 in our submission for that particular chunk of data,
we can expect two outcomes. The first one is when score improved,
it means that ground truth in public is 0. And it also means that ground
truth in private is 0 as well. Remember, our chunk has the same labels. The second outcome is when
the score became worse. Similarly, it means that ground truth
in both public and private is 1. Some competitions indeed have
that kind of categories. Categories that with high
certainty have the same label. You could have encountered those
type of categories in Red Hat and West Nile competitions. It was a key for winning. With a lot of submissions, one can
explore a good part of private test set. It's probably the most
annoying type of data leak. It's mostly technical and even if it's
released close to the competition deadline, you simply won't have enough
submissions to fully exploit it. Furthermore, this is on
the tip of the iceberg. When I say consistent category, I do not necessarily mean that
this category has the same target. It could be consistent in different ways. The definition is quite broad. For example, target label could simply
have the same distribution for public and private parts of data. It was the case in
Quora Question Pairs competition. In that competition there
was a binary classification task being evaluated by log loss metric. What's important target were able had
different distributions in train and test, but allegedly the same and
private and public parts of these data. And because of that, we could benefit
a lot via leaderboard probing. Treating the whole test set
as a consistent category. Take a look at the formula on the slide. This logarithmic loss for submission
with constant predictions C big. Where N big is the real number of rows, N1 big is the number of
rows with target one. And L big is the leader board score
given by that constant prediction. From this equation,
we can calculate N1 divided by N or in other words,
the true ratio of once in the test set. That knowledge was very beneficial. We could use it rebalance
training data points to have the same distribution of
target variable as in the test set. This little trick gave a huge
boost in leaderboard score. As you can see, leaderboard
probing is a very serious problem that could occur under a lot
of different circumstances. I hope that someday it will become
complete the eradicated from competitive machine learning. Now, finally, I like to briefly
walk through the most peculiar and interesting competitions
with data leakage. And first, let's take a look at Truly Native
competition from different point of view. In this competition, participants were
asked to predict whether the content in an HTML file is sponsored or not. As was already discussed
in previous video, there was a data leak in archive dates. We can assume that sponsored and non-sponsored HTML files were gotten
during different periods of time. So do we really get rid of data
leak after erasing archive dates? The answer is no. Texts in HTML files may be connected
to dates in a lot of ways. From explicit timestamps to much more
subtle things, like news contents. As you've probably already realized, the real problem was not metadata leak,
but rather data collection. Even without metainformation, machine learning algorithms will
focus on actually useless features. The features that only act as proxies for
the date. The next example is
Expedia Hotel Recommendations, and that competitions, participants
worked with logs of customer behavior. These include what customers searched for,
how they interacted with search results, and clicks or books, and whether or not
the search result was a travel package. Expedia was interested in predicting which
hotel group a user is going to book. Within the logs of customer behavior,
there was a very tricky feature. At distance from users
seeking their hotel. Turned out, that this feature
is actually a huge data leak. Using this distance, it was possible to
reverse engineer two coordinates, and simply map ground truth from
train set to the test set. I strongly suggest you to
check out the special video dedicated to this competition. I hope that you will find it very
useful because the approaches and methods of exploiting data leak
were extremely nontrivial. And you will find a lot of
interesting tricks in it. The next example is from
Flavours of Physics competition. It was a pretty complicated
problem dealing with physics at Large Hadron Collider. The special thing about
that competition was that signal was artificially simulated. Organizers wanted a machine
learning solution for something that has never been observed. That's why the signal was simulated. But simulation cannot be perfect and
it's possible to reverse engineer it. Organizers even created
special statistical tests in order to punish the models
that exploit simulation flaws. However, it was in vain. One could bypass the tests,
fully exploit simulation flaws, and get a perfect score on the leaderboard. The last example is going
to cover pairwise tasks. Where one needs to predict
whether the given pair of items are duplicates or not,
like in Quora question pairs competition. There is one thing common to all
the competitions with pairwise tasks. Participants are not asked to
evaluate all possible pairs. There is always some
nonrandom subsampling, and this subsampling is
the cause of data leakage. Usually, organizers sample mostly
hard-to-distinguish pairs. Because of that, of course,
imbalance in item frequencies. It results in more frequent
items having the higher possibility of being duplicates. But that's not all. We can create a connectivity
matrix N times N, where N is the total number of items. If item I and item J appeared
in a pair then we place 1 in I, J and J, I positions. Now, we can treat the rows in connectivity
matrix as vector representations for every item. This means that we can compute
similarities between those vectors. This tricks works for
a very simple reason. When two items have
similar sets of neighbors they have a high possibility
of being duplicates. This is it with data leaks. I hope you got the concept and
found a lot of interesting examples. Thank you for your attention. [SOUND]