1 00:00:00,025 --> 00:00:05,829 [SOUND] In the previous video, we understood that validation 2 00:00:05,829 --> 00:00:12,675 helps us select a model which will perform best on the unseen test data. 3 00:00:12,675 --> 00:00:16,845 But, to use validation, we first need to split the data with given labels, 4 00:00:16,845 --> 00:00:19,410 entrain, and validation parts. 5 00:00:19,410 --> 00:00:23,130 In this video, we will discuss different validation strategies. 6 00:00:23,130 --> 00:00:25,140 And answer the questions. 7 00:00:25,140 --> 00:00:27,304 How many splits should we make and 8 00:00:27,304 --> 00:00:30,949 what are the most often methods to perform such splits. 9 00:00:30,949 --> 00:00:35,600 Loosely speaking, the main difference between these validation strategies 10 00:00:35,600 --> 00:00:38,400 is the number of splits being done. 11 00:00:38,400 --> 00:00:41,060 Here I will discuss three of them. 12 00:00:41,060 --> 00:00:46,070 First is holdout, second is K-fold, and third is leave-one-out. 13 00:00:46,070 --> 00:00:48,080 Let's start with holdout. 14 00:00:48,080 --> 00:00:52,910 It's a simple data split which divides data into two parts, 15 00:00:52,910 --> 00:00:55,880 training data frame, and validation data frame. 16 00:00:55,880 --> 00:00:59,355 And the important note here is that in any method, 17 00:00:59,355 --> 00:01:04,550 holdout included, one sample can go either to train or to validation. 18 00:01:04,550 --> 00:01:09,260 So the samples between train and the validation do not overlap, 19 00:01:09,260 --> 00:01:12,970 if they do, we just can't trust our validation. 20 00:01:12,970 --> 00:01:17,510 This is sometimes the case, when we have repeated samples in the data. 21 00:01:17,510 --> 00:01:20,640 And if we are, we will get better predictions for 22 00:01:20,640 --> 00:01:25,160 these samples and more optimistic all estimation overall. 23 00:01:25,160 --> 00:01:29,730 It is easy to see that these can prevent us from selecting best parameters for 24 00:01:29,730 --> 00:01:30,266 our model. 25 00:01:30,266 --> 00:01:34,816 For example, over fitting is generally bad. 26 00:01:34,816 --> 00:01:39,002 But if we have duplicated samples that present, and train, and 27 00:01:39,002 --> 00:01:43,886 test simultaneously and over feed, validation scores can deceive us into 28 00:01:43,886 --> 00:01:48,400 a belief that maybe we are moving in the right direction. 29 00:01:48,400 --> 00:01:53,323 Okay, that was the quick note about why samples between train and 30 00:01:53,323 --> 00:01:55,980 validation must not overlap. 31 00:01:55,980 --> 00:01:57,460 Back to holdout. 32 00:01:57,460 --> 00:02:01,482 Here we fit our model on the training data frame, and 33 00:02:01,482 --> 00:02:04,010 evaluate its quality on the validation data frame. 34 00:02:05,080 --> 00:02:09,460 Using scores from this evaluation, we select the best model. 35 00:02:09,460 --> 00:02:11,930 When we are ready to make a submission, 36 00:02:11,930 --> 00:02:17,180 we can retrain our model on our data with given labels. 37 00:02:17,180 --> 00:02:19,880 Thinking about using holdout in the competition. 38 00:02:19,880 --> 00:02:23,420 It is usually a good choice, when we have enough data. 39 00:02:23,420 --> 00:02:25,800 Or we are likely to get similar scores for 40 00:02:25,800 --> 00:02:28,339 the same model, if we try different splits. 41 00:02:29,520 --> 00:02:32,430 Great, since we understood what holdout is, 42 00:02:32,430 --> 00:02:36,616 let's move onto the second validation strategy, which is called K-fold. 43 00:02:36,616 --> 00:02:41,780 K-fold can be viewed as a repeated holdout, because we split our data into 44 00:02:41,780 --> 00:02:48,230 key parts and iterate through them, using every part as a validation set only once. 45 00:02:48,230 --> 00:02:52,920 After this procedure, we average scores over these K-folds. 46 00:02:52,920 --> 00:02:57,730 Here it is important to understand the difference between K-fold and 47 00:02:57,730 --> 00:03:00,610 usual holdout or bits of K-times. 48 00:03:00,610 --> 00:03:05,947 While it is possible to average scores they receive after K different holdouts. 49 00:03:05,947 --> 00:03:09,607 In this case, some samples may never get invalidation, 50 00:03:09,607 --> 00:03:12,930 while others can be there multiple times. 51 00:03:12,930 --> 00:03:18,930 On the other side, the core idea of K-fold is that we want to use every sample for 52 00:03:18,930 --> 00:03:21,250 validation only once. 53 00:03:21,250 --> 00:03:25,470 This method is a good choice when we have a minimum amount of data, and 54 00:03:25,470 --> 00:03:29,870 we can get either a sufficiently big difference in quality, or 55 00:03:29,870 --> 00:03:33,180 different optimal parameters between folds. 56 00:03:33,180 --> 00:03:35,365 Great, having dealt with K-fold, 57 00:03:35,365 --> 00:03:39,740 we can move on to the third validation strategy in our release. 58 00:03:39,740 --> 00:03:41,680 It is called leave-one-out. 59 00:03:41,680 --> 00:03:45,540 And basically it is a special case of Kfold when K 60 00:03:45,540 --> 00:03:48,850 is equal to the number of samples in our data. 61 00:03:48,850 --> 00:03:53,450 This means that it will iterate through every sample in our data. 62 00:03:53,450 --> 00:03:58,560 Each time usion came in a slot minus one object is a train subset and 63 00:03:58,560 --> 00:04:01,520 one object left is a test subset. 64 00:04:01,520 --> 00:04:05,245 This method can be helpful if we have too little data and 65 00:04:05,245 --> 00:04:07,152 just enough model to entrain. 66 00:04:07,152 --> 00:04:12,300 So that there, validation strategies. 67 00:04:12,300 --> 00:04:16,030 Holdout, K-fold and leave-one-out. 68 00:04:16,030 --> 00:04:19,614 We usually use holdout or K-fold on shuffle data. 69 00:04:19,614 --> 00:04:25,085 By shuffling data we are trying to reproduce random trained validation split. 70 00:04:25,085 --> 00:04:29,497 But sometimes, especially if you do not have enough samples for 71 00:04:29,497 --> 00:04:32,950 some class, a random split can fail. 72 00:04:32,950 --> 00:04:35,306 Let's consider, for an example. 73 00:04:35,306 --> 00:04:41,274 We have binary classification tests and a small data set with eight samples. 74 00:04:41,274 --> 00:04:43,630 Four of class zero, and four of class one. 75 00:04:43,630 --> 00:04:46,615 Let's split data into four folds. 76 00:04:46,615 --> 00:04:53,810 Done, but notice, we are not always getting 0 and 1 in the same problem. 77 00:04:53,810 --> 00:04:58,720 If we'll use the second fold for validation, we'll get 78 00:04:58,720 --> 00:05:04,130 an average value of the target in the train of two third instead of one half. 79 00:05:04,130 --> 00:05:08,000 This can drastically change predictions of our model. 80 00:05:08,000 --> 00:05:12,676 What we need here to handle this problem is stratification. 81 00:05:12,676 --> 00:05:15,360 It is just the way to insure 82 00:05:15,360 --> 00:05:19,710 we'll get similar target distribution over different faults. 83 00:05:19,710 --> 00:05:23,700 If we split data into four faults with stratification, 84 00:05:23,700 --> 00:05:28,310 the average of each false target values will be equal to one half. 85 00:05:29,340 --> 00:05:34,075 It is easier to guess that significance of this problem is higher, first for 86 00:05:34,075 --> 00:05:39,405 small data sets, like in this example, second for unbalanced data sets. 87 00:05:39,405 --> 00:05:41,415 And for binary classification, that could be, 88 00:05:41,415 --> 00:05:46,500 if target average were very close to 0 or vice versa, very close to 1. 89 00:05:46,500 --> 00:05:52,070 And third, for multiclass classification tasks with huge amount of classes. 90 00:05:52,070 --> 00:05:54,770 For good classification data sets, 91 00:05:54,770 --> 00:05:58,910 stratification split will be quite similar to a simple shuffle split. 92 00:05:58,910 --> 00:06:00,600 That is, to a random split. 93 00:06:01,610 --> 00:06:06,890 Well done, in this video we have discussed different validation strategies and 94 00:06:06,890 --> 00:06:09,360 reasons to use each one of them. 95 00:06:09,360 --> 00:06:11,420 Let's summarize this all. 96 00:06:11,420 --> 00:06:15,400 If we have enough data, and we're likely to get similar scores and 97 00:06:15,400 --> 00:06:21,120 optimal model's parameters for different splits, we can go with Holdout. 98 00:06:21,120 --> 00:06:25,350 If on the contrary, scores and optimal parameters differ for 99 00:06:25,350 --> 00:06:28,870 different splits, we can choose KFold approach. 100 00:06:28,870 --> 00:06:34,383 And event, if we too little data, we can apply leave-one-out. 101 00:06:34,383 --> 00:06:40,205 The second big takeaway from this video for you should be stratification. 102 00:06:40,205 --> 00:06:44,850 It helps make validation more stable, and especially useful for 103 00:06:44,850 --> 00:06:48,760 small and unbalanced datasets. 104 00:06:48,760 --> 00:06:49,770 Great. 105 00:06:49,770 --> 00:06:54,923 In the next videos we will continue to comprehend validation at it's core. 106 00:06:54,923 --> 00:07:01,646 [SOUND] 107 00:07:01,646 --> 00:07:07,389 [MUSIC]