[SOUND] Hi and welcome back. In the previous videos we discussed the concept of validation and overfitting. And discussed how to chose validation strategy based on the properties of data we have. And finally we learned to identify data split made by organizers. After all this work being done, we honestly expect that the relation will, in a way, substitute a leaderboard for us. That is the score we see on the validation will be the same for the private leaderboard. Or at least, if we improve our model and validation, there will be improvements on the private leaderboard. And this is usually true, but sometimes we encounter some problems here. In most cases these problems can be divided into two big groups. In the first group are the problems we encounter during local validation. Usually they are caused by inconsistency of the data, a widespread example is getting different optimal parameters for different faults. In this case we need to make more thorough validation. The problems from the second group, often reveal themselves only when we send our submissions to the platform. And observe that scores on the validation and on the leaderboard don't match. In this case, the problem usually occurs because we can't mimic the exact train test split on our validation. These are tough problems, and we definitely want to be able to handle them. So before we start, let me provide an overview of this video. For both validation and submission stages we will discuss main problems, their causes, how to handle them. And then, we'll talk a bit about when we can expect a leaderboard shuffle. Let's start with discussion of validation stage problems. Usually, they attract our attention during validation. Generally, the main problem is a significant difference in scores and optimal parameters for different train validation splits. Let's start with an example. So we can easily explain this problem. Consider that we need to predict sales in a shop in February. Say we have target values for the last year, and, usually, we will take last month in the validation. This means January, but clearly January has much more holidays than February. And people tend to buy more, which causes target values to be higher overall. And that mean squared error of our predictions for January will be greater than for February. Does this mean that the module will perform worse for February? Probably not, at least not in terms of overfitting. As we can see, sometimes this kind of model behavior can be expected. But what if there is no clear reason why scores differ for different folds? Let identify several common reasons for this and see what we can do about it. The first hypotheses we should consider and that we have too little data. For example, consider a case when we have a lot of patterns and trends in the data. But we do not have enough samples to generalize these patterns well. In that case, a model will utilize only some general patterns. And for each train validation split, these patterns will partially differ. This indeed, will lead to a difference in scores of the model. Furthermore, validation samples will be different each time only increasing the dispersion of scores for different folds. The second type of this, is data is too diverse and inconsistent. For example, if you have very similar samples with different target variance, a model can confuse them. Consider two cases, first, if one of such examples is in the train while another is in the validation. We can get a pretty high error for the second sample. And the second case, if both samples are in validation, we will get smaller errors for them. Or let's remember another example of diverse data we have already discussed a bit earlier. I'm talking about the example of predicting sales for January and February. Here we have the nature or the reason for the differences in scores. As a quick note, notice that in this example, we can reduce this diversity a bit if we will validate on the February from the previous year. So the main reasons for a difference in scores and optimal model parameters for different folds are, first, having too little data, and second, having too diverse and inconsistent data. Now let's outline our actions here. If we are facing this kind of problem, it can be useful to make more thorough validation. You can increase K in KFold, but usually 5 folds are enough. Make KFold validation several times with different random splits. And average scores to get a more stable estimate of model's quality. The same way we can choose the best parameters for the model if there is a chance to overfit. It is useful to use one set of KFold splits to select parameters and another set of KFold splits to check model's quality. Examples of competitions which required extensive validation include the Liberty Mutual Group Property Inspection Prediction competition and the Santander Customer Satisfaction competition. In both of them, scores of the competitors were very close to each other. And thus participants tried to squeeze more from the data. But do not overfit, so the thorough validation was crucial. Now, having discussed validation stage problems, let's move on to submission stage problems. Sometimes you can diagnose these problems in the process of doing careful. But still, often you encounter these type of problems only when you submit your solution to the platform. But then again, is your friend when it comes down to finding the root of the problem. Generally speaking, there are two cases of these issues. In the first case, leaderboard score is consistently higher or lower than validation score. In the second, leaderboard score is not correlated with validation score at all. So in the worst case, we can improve our score on the validation. While, on the contrary, score on the leaderboard will decrease. As you can imagine, these problems can be much more trouble. Now remember that the main rule of making a reliable validation, is to mimic a train tests pre made by organizers. I won't lie to you, it can be quite hard to identify and mimic the exact train tests here. Because of that, I highly you to start submitting your solutions right after you enter the competition. It's good to start exploring other possible roots of this problem. Let's first sort out causes we could observe during validation stage. Recall, we already have different model scores on different folds during validation. Here it is useful to see a leaderboard as another validation fold. Then, if we already have different scores in KFold, getting a not very similar result on the leaderboard is not suprising. More we can calculate mean and standard deviation of the validation scores and estimate if the leaderboard score is expected. But if this is not the case, then something is definitely wrong. There could be two more reasons for this problem. The first reason, we have too little data in public leaderboard, which is pretty self explanatory. Just trust your validation, and everything will be fine. And the second train and test data are from different distributions. Let me explain what I mean when I talk about different distributions. Consider a regression test of predicting people's height by their photos on Instagram. The blue line represents the distribution of heights for man, while the red line represents the distribution of heights for women. As you can see, these distributions are different. Now let's consider that the train data consists only of women, while the test data consists only of men. Then all model predictions will be around the average height for women. And the distribution of these predictions will be very similar to that for the train data. No wonder that our model will have a terrible score on the test data. Now, because our course is a practical one, let's take a moment and think what you can do if you encounter these in a competition. Okay, let's start with a general approach to such problems. At the broadest level, we need to find a way to tackle different distributions in train and test. Sometimes, these kind of problems could be solved by adjusting your solution during the training procedure. But sometimes, this problem can be solved only by adjusting your solution through the leaderboard. That is through leaderboard probing. The simplest way to solve this particular situation in a competition is to try to figure out the optimal constant prediction for train and test data. And shift your predictions by the difference. Right here we can calculate the average height of women from the train data. Calculating the average height of men is a big trickier. If the competition's metric is means squared error, we can send two constant submissions, write down the simple formula. And find out that the average target value for the test is equal to 70 inches. In general, this technique is known as leaderboard probing. And we will discuss it in the topic about. So now we know the difference between the average target values for the train and the test data, which is equal to 7 inches. And as the third step of adjusting our submission to the leaderboard we could just try to add 7 to all predictions. But from this point it is not validational it is a leaderboard probing and list. Yes we probably could discover this during exploratory data analysis and try to make a correction in our validation scheme. But sometimes it is not possible without leaderboard probing, just like in this example. A competition which has something similar is the Quora question pairs competition. There, distributions of the target from train and test were different. So one could get a good improvement of a score adjusting his predictions to the leaderboard. But fortunately, this case is rare enough. More often, we encounter situations which are more like the following case. Consider that now train consists not only of women, but mostly of women, and test, vice versa. Consists not only of men, but mostly of men. The main strategy to deal with these kind of situations is simple. Again, remember to mimic the train test split. If the test consists mostly of Men, force the validation to have the same distribution. In that case, you ensure that your validation will be fair. This is true for getting raw scores and optimal parameters correctly. For example, we could have quite different scores and optimal parameters for women's and men's parts of the data set. Ensuring the same distribution in test and validation helps us get scores and parameters relevant to test. I want to mention two examples of this here. First the Data Science Game Qualification Phase: Music recommendation challenge. And second, competition with CTR prediction which we discussed earlier in the data topic. Let's start with the second one, do you remember the problem, we have a test of predicting CTR. So, the train data, which basically was the history of displayed ads obviously didn't contain ads which were not shown. On the contrary, the test data consisted of every possible ad. Notice this is the exact case of different distributions in train and test. And again, we need to set up our validation to mimic test here. So we have this huge bias towards showing that in the train and to set up a correct validation. We had to complete the validation set with rows of not shown ads. Now, let's go back to the first example. In that competition, participants had to predict whether a user will listen to a song recommended by assistant. So, the test contained only recommended songs. But train, on the contrary, contained both recommended songs and songs users selected themselves. So again, one could adjust his validation by 50 renowned songs selected by users. And again, if we will not account for that fact, then improving our model on actually selected songs can result in the validation score going up. But it doesn't have to result and the same improvements for the leaderboard. Okay let's conclude this overview of handling validation problems for the submission stage. If you have too little data in public leaderboard, just trust your validation. If that's not the case, make sure that you did not overfit. Then check if you made correct train/test split, as we discussed in the previous video. And finally, check if you have different distributions in train and test. Great, let's move on to the next point of this video. For now, I hope you did everything all right. First, you did extensive validation. Second, you choose a correct splitter strategy for train validation split. And finally, you ensured the same distributions in validation and testing. But sometimes you have to expect leaderboard shuffle anyway, and not just for you, but for everyone. First, for those who've never heard of it, a leaderboard shuffle happens when participants position some public and private leaderboard drastically different. Take a look at this screenshot from the two sigma financial model in challenge competition. The green and the red arrows mean how far a team moved. For example, the participant who finished the 3rd on the private leaderboard was the 392nd on the public leaderboard. Let's discuss three main reasons for that shuffle, randomness, too little data, and different public, private distributions. So first, randomness, this is the case when all participants have very similar scores. This can be either a very good score or a very poor one. But the main point here is that the main reason for differences in scores is randomness. To understand this a bit more, let's go through two quick examples here. The first one is the Liberty Mutual Group, Property Inspection Prediction competition. In that competition, scores of competitors were very close. And though randomness didn't play a major role in that competition, still many people overfit on the public leaderboard. The second example, which is opposite to the first is the TWO SIGMA Financial Model and Challenge competition. Because the financial data in that competition was highly unpredictable, randomness played a major role in it. So one could say that the leaderboard shuffle there was among the biggest shuffles on KFold platform. Okay, that was randomness, the second reason to expect leaderboard shuffle is too little data overall, and in private test set especially. An example of this is the Restaurant Revenue Prediction Competition. In that competition, trained set consisted of less than 200 gross. And this set consisted of less than 400 gross. So as you can see shuffle here was more than expected. Last reason of leaderboard shuffle could be different distributions between public and private test sets. This is usually the case with time series prediction, like the Rossmann Stores Sales competition. When we have a time based split, we usually have first few weeks in public leaderboard, and next few weeks in private leaderboards. As people tend to adjust their submission to public leaderboard and overfit, we can expect worse results on private leaderboard. Here again, trust your validation and everything will be fine. Okay, let's go over reasons for leaderboard shuffling. Now let's conclude both this video and the entire validation topic. Let's start with the video. First, if you have big dispersion of scores on validation stage we should do extensive validation. That means every score from different KFold splits, and team model on one split while evaluating score on the other. Second, if submission do not match local validation score, we should first, check if we have too little data in public leaderboard. Second, check if we did not overfit, check if you chose correct splitting strategy. And finally, check if trained test have different distributions. You can expect leaderboard shuffle because of three key things, randomness, little amount of data, and different public/private test distributions. So that's it, in this topic we defined validation and its connection to overfitting. Described common validation strategies. Demonstrated major data splitting strategies. And finally analyzed and learned how to tackle main validation problems. Remember this, and it will absolutely help you out in competitions. Make sure you understand the main idea of validation well. That is, you need to mimic the trained test split. [MUSIC]