[SOUND] In the previous video, we understood that validation helps us select a model which will perform best on the unseen test data. But, to use validation, we first need to split the data with given labels, entrain, and validation parts. In this video, we will discuss different validation strategies. And answer the questions. How many splits should we make and what are the most often methods to perform such splits. Loosely speaking, the main difference between these validation strategies is the number of splits being done. Here I will discuss three of them. First is holdout, second is K-fold, and third is leave-one-out. Let's start with holdout. It's a simple data split which divides data into two parts, training data frame, and validation data frame. And the important note here is that in any method, holdout included, one sample can go either to train or to validation. So the samples between train and the validation do not overlap, if they do, we just can't trust our validation. This is sometimes the case, when we have repeated samples in the data. And if we are, we will get better predictions for these samples and more optimistic all estimation overall. It is easy to see that these can prevent us from selecting best parameters for our model. For example, over fitting is generally bad. But if we have duplicated samples that present, and train, and test simultaneously and over feed, validation scores can deceive us into a belief that maybe we are moving in the right direction. Okay, that was the quick note about why samples between train and validation must not overlap. Back to holdout. Here we fit our model on the training data frame, and evaluate its quality on the validation data frame. Using scores from this evaluation, we select the best model. When we are ready to make a submission, we can retrain our model on our data with given labels. Thinking about using holdout in the competition. It is usually a good choice, when we have enough data. Or we are likely to get similar scores for the same model, if we try different splits. Great, since we understood what holdout is, let's move onto the second validation strategy, which is called K-fold. K-fold can be viewed as a repeated holdout, because we split our data into key parts and iterate through them, using every part as a validation set only once. After this procedure, we average scores over these K-folds. Here it is important to understand the difference between K-fold and usual holdout or bits of K-times. While it is possible to average scores they receive after K different holdouts. In this case, some samples may never get invalidation, while others can be there multiple times. On the other side, the core idea of K-fold is that we want to use every sample for validation only once. This method is a good choice when we have a minimum amount of data, and we can get either a sufficiently big difference in quality, or different optimal parameters between folds. Great, having dealt with K-fold, we can move on to the third validation strategy in our release. It is called leave-one-out. And basically it is a special case of Kfold when K is equal to the number of samples in our data. This means that it will iterate through every sample in our data. Each time usion came in a slot minus one object is a train subset and one object left is a test subset. This method can be helpful if we have too little data and just enough model to entrain. So that there, validation strategies. Holdout, K-fold and leave-one-out. We usually use holdout or K-fold on shuffle data. By shuffling data we are trying to reproduce random trained validation split. But sometimes, especially if you do not have enough samples for some class, a random split can fail. Let's consider, for an example. We have binary classification tests and a small data set with eight samples. Four of class zero, and four of class one. Let's split data into four folds. Done, but notice, we are not always getting 0 and 1 in the same problem. If we'll use the second fold for validation, we'll get an average value of the target in the train of two third instead of one half. This can drastically change predictions of our model. What we need here to handle this problem is stratification. It is just the way to insure we'll get similar target distribution over different faults. If we split data into four faults with stratification, the average of each false target values will be equal to one half. It is easier to guess that significance of this problem is higher, first for small data sets, like in this example, second for unbalanced data sets. And for binary classification, that could be, if target average were very close to 0 or vice versa, very close to 1. And third, for multiclass classification tasks with huge amount of classes. For good classification data sets, stratification split will be quite similar to a simple shuffle split. That is, to a random split. Well done, in this video we have discussed different validation strategies and reasons to use each one of them. Let's summarize this all. If we have enough data, and we're likely to get similar scores and optimal model's parameters for different splits, we can go with Holdout. If on the contrary, scores and optimal parameters differ for different splits, we can choose KFold approach. And event, if we too little data, we can apply leave-one-out. The second big takeaway from this video for you should be stratification. It helps make validation more stable, and especially useful for small and unbalanced datasets. Great. In the next videos we will continue to comprehend validation at it's core. [SOUND] [MUSIC]