1 00:00:02,430 --> 00:00:06,910 This isn't the rare case in competitions when you see 2 00:00:06,910 --> 00:00:10,800 people jumping down on leaderboard after revealing private results. 3 00:00:10,800 --> 00:00:12,682 So, we ask ourselves, 4 00:00:12,682 --> 00:00:14,125 what is happening out there? 5 00:00:14,125 --> 00:00:16,695 There are two main reasons for these jumps. 6 00:00:16,695 --> 00:00:19,960 First, competitors could ignore the validation and 7 00:00:19,960 --> 00:00:23,268 select the submission which scored best against the public leaderboard. 8 00:00:23,268 --> 00:00:26,710 Second, is that sometimes competitions have 9 00:00:26,710 --> 00:00:30,250 no consistent public/private data split or they 10 00:00:30,250 --> 00:00:34,725 have too little data in either public or private leaderboard. 11 00:00:34,725 --> 00:00:36,730 Well, we as participants, 12 00:00:36,730 --> 00:00:39,125 can't influence competitions organization. 13 00:00:39,125 --> 00:00:41,800 We can certainly make sure that we select 14 00:00:41,800 --> 00:00:45,929 our most appropriate submission to be evaluated by private leaderboard. 15 00:00:45,929 --> 00:00:49,780 So, the broad goal of next videos is 16 00:00:49,780 --> 00:00:53,980 to provide you a systematic way to set up validation in a competition, 17 00:00:53,980 --> 00:00:56,840 and tackle most common validation problems. 18 00:00:56,840 --> 00:01:01,220 Let's quickly overview of the content of the next videos. 19 00:01:01,220 --> 00:01:03,140 First, in this video, 20 00:01:03,140 --> 00:01:06,895 we will understand the concept of validation and overfitting. 21 00:01:06,895 --> 00:01:08,515 In the second video, 22 00:01:08,515 --> 00:01:13,815 we will identify the number of splits that should be done to establish stable validation. 23 00:01:13,815 --> 00:01:16,430 In the third video, we will go through 24 00:01:16,430 --> 00:01:21,290 most frequent methods which are used to make train/test split in competitions. 25 00:01:21,290 --> 00:01:22,549 In the last video, 26 00:01:22,549 --> 00:01:26,060 we will discuss most often validation problems. 27 00:01:26,060 --> 00:01:27,875 Now, let me start to explain 28 00:01:27,875 --> 00:01:31,325 the concept for validation for those who may never heard of it. 29 00:01:31,325 --> 00:01:36,754 In the nutshell, we want to check if the model gives expected results on the unseen data. 30 00:01:36,754 --> 00:01:39,410 For example, if you've worked in 31 00:01:39,410 --> 00:01:42,880 a healthcare company which goal is to improve life of patients, 32 00:01:42,880 --> 00:01:46,490 we could be given the task of predicting if a patient will 33 00:01:46,490 --> 00:01:50,525 be diagnosed a particular disease in the near future. 34 00:01:50,525 --> 00:01:55,505 Here, we need to be sure that the model we train will be applicable in the future. 35 00:01:55,505 --> 00:01:56,915 And not just applicable, 36 00:01:56,915 --> 00:01:59,900 we need to be sure about what quality this model will 37 00:01:59,900 --> 00:02:03,950 have depending on the number of mistakes the model make. 38 00:02:03,950 --> 00:02:08,495 And on the predictive probability of a patient having this particular disease, 39 00:02:08,495 --> 00:02:10,340 we may want to decide to run 40 00:02:10,340 --> 00:02:14,910 special medical tests for the patient to clarify the diagnosis. 41 00:02:14,910 --> 00:02:19,570 So, we need to correctly understand the quality of our model. 42 00:02:19,570 --> 00:02:23,000 But, this quality can differ on train data from 43 00:02:23,000 --> 00:02:27,748 the past and on the unseen test data from the future. 44 00:02:27,748 --> 00:02:31,460 The model could just memorize all patients from the train data and be 45 00:02:31,460 --> 00:02:35,950 completely useless on the test data because we don't want this to happen. 46 00:02:35,950 --> 00:02:38,870 We need to check the quality of the model with the data we 47 00:02:38,870 --> 00:02:42,500 have and these checks are the validation. 48 00:02:42,500 --> 00:02:47,670 So, usually, we divide data we have into two parts, 49 00:02:47,670 --> 00:02:50,280 train part and validation part. 50 00:02:50,280 --> 00:02:56,135 We fit our model on the train part and check its quality on the validation part. 51 00:02:56,135 --> 00:02:58,380 Beside that, in the last example, 52 00:02:58,380 --> 00:03:01,890 our model will be checked against the unseen data in 53 00:03:01,890 --> 00:03:06,140 the future and actually these data can differ from the data we have. 54 00:03:06,140 --> 00:03:08,160 So we should be ready for this. 55 00:03:08,160 --> 00:03:12,165 In competitions, we usually have the similar situation. 56 00:03:12,165 --> 00:03:16,735 The organizers of a competition give us the data in two chunks. 57 00:03:16,735 --> 00:03:20,070 First, train data with all target values. 58 00:03:20,070 --> 00:03:22,950 And second, test data without target values. 59 00:03:22,950 --> 00:03:24,585 As in the previous example, 60 00:03:24,585 --> 00:03:27,818 we should split the data with labels into train and validation parts. 61 00:03:27,818 --> 00:03:33,030 Furthermore, to ensure the competition spirit, 62 00:03:33,030 --> 00:03:38,346 the organizers split the test data into the public test set and the private test set. 63 00:03:38,346 --> 00:03:41,143 When we sent our submissions to the platform, 64 00:03:41,143 --> 00:03:45,150 we see the scores for the public test set while the scores for 65 00:03:45,150 --> 00:03:50,120 the private test set are released only after the end of the competition. 66 00:03:50,120 --> 00:03:56,489 This also ensures that we don't need the test set or in terms of a model do not overfit. 67 00:03:56,489 --> 00:03:59,995 Let me draw you an analogy with the disease projection, 68 00:03:59,995 --> 00:04:04,545 if we already divided our data into train and validation parts. 69 00:04:04,545 --> 00:04:08,925 And now, we are repeatedly checking our model against the validation set, 70 00:04:08,925 --> 00:04:11,010 some models, just by chance, 71 00:04:11,010 --> 00:04:13,475 will have better scores than the others. 72 00:04:13,475 --> 00:04:16,808 If we continue to select best models, modify them, 73 00:04:16,808 --> 00:04:19,007 and again select the best from them, 74 00:04:19,007 --> 00:04:22,170 we will see constant improvements in the score. 75 00:04:22,170 --> 00:04:26,295 But that doesn't mean we will see these improvements on the test data from the future. 76 00:04:26,295 --> 00:04:29,105 By repeating this over and over, 77 00:04:29,105 --> 00:04:33,555 we could just achieve the validation set or in terms of a competition, 78 00:04:33,555 --> 00:04:36,095 we could just cheat the public leaderboard. 79 00:04:36,095 --> 00:04:37,800 But again, if it overfit, 80 00:04:37,800 --> 00:04:40,930 the private leaderboard will let us down. 81 00:04:40,930 --> 00:04:44,185 This is what we call overfitting in a competition. 82 00:04:44,185 --> 00:04:46,890 Get an unrealistically good scores on 83 00:04:46,890 --> 00:04:52,390 the public leaderboard that later result in jumping down the private leaderboard. 84 00:04:52,390 --> 00:04:56,850 So, we want our model to be able to capture patterns in 85 00:04:56,850 --> 00:05:02,670 the data but only those patterns that generalize well between both train and test data. 86 00:05:02,670 --> 00:05:06,834 Let me show you this process in terms of underfitting and overfitting. 87 00:05:06,834 --> 00:05:09,180 So, to choose the best model, 88 00:05:09,180 --> 00:05:14,130 we basically want to avoid underfitting on the one side and overfitting on the other. 89 00:05:14,130 --> 00:05:19,430 Let's understand this concept on a very simple example of a binary classification test. 90 00:05:19,430 --> 00:05:23,070 We will be using simple models defined by formulas under 91 00:05:23,070 --> 00:05:27,730 the pictures and visualize the results of model's predictions. 92 00:05:27,730 --> 00:05:29,310 Here on the left picture, 93 00:05:29,310 --> 00:05:32,555 we can see that if the model is too simple, 94 00:05:32,555 --> 00:05:37,500 it can't capture underlined relationship and we will get poor results. 95 00:05:37,500 --> 00:05:38,520 This is called underfitting. 96 00:05:38,520 --> 00:05:42,345 Then, if we want our results to improve, 97 00:05:42,345 --> 00:05:45,660 we can increase the complexity of the model and 98 00:05:45,660 --> 00:05:50,750 that will probably find that quality on the training data is going down. 99 00:05:50,750 --> 00:05:55,045 But on the other hand, if we make too complicated model like on the right picture, 100 00:05:55,045 --> 00:06:00,960 it will start describing noise in the train data that doesn't generalize the test data. 101 00:06:00,960 --> 00:06:04,925 And this will lead to a decrease of model's quality. 102 00:06:04,925 --> 00:06:06,790 This is called overfitting. 103 00:06:06,790 --> 00:06:12,275 So, we want something in between underfitting and overfitting here. 104 00:06:12,275 --> 00:06:15,270 And for the purpose of choosing the most suitable model, 105 00:06:15,270 --> 00:06:19,080 we want to be able to evaluate our results. 106 00:06:19,080 --> 00:06:21,030 Here, we need to make a remark, 107 00:06:21,030 --> 00:06:25,020 that the meaning of overfitting in machine learning in general and 108 00:06:25,020 --> 00:06:29,970 the meaning of overfitting competitions in particular are slightly different. 109 00:06:29,970 --> 00:06:33,180 In general, we say that the model is overfitted if 110 00:06:33,180 --> 00:06:37,265 its quality on the train set is better than on the test set. 111 00:06:37,265 --> 00:06:39,735 But in competitions, we often say, 112 00:06:39,735 --> 00:06:42,540 that the models are overfitted only in case when 113 00:06:42,540 --> 00:06:46,970 quality on the test set will be worse than we have expected. 114 00:06:46,970 --> 00:06:49,040 For example, if you train gradient boosting decision tree in 115 00:06:49,040 --> 00:06:53,280 the competition is our area under a curve metric. 116 00:06:53,280 --> 00:06:56,040 We sometimes can observe that the quality on 117 00:06:56,040 --> 00:06:59,250 the training data is close to one while on the test data, 118 00:06:59,250 --> 00:07:02,345 it could be less for example, near 0.9. 119 00:07:02,345 --> 00:07:07,020 In general sense, the models overfitted here but while we get area 120 00:07:07,020 --> 00:07:12,810 under curve was 0.9 on both validation and public/private test sets, 121 00:07:12,810 --> 00:07:17,800 we will not say that it is overfitted in the context of a competition. 122 00:07:17,800 --> 00:07:21,860 Let me illustrate this concept again in a bit different way. 123 00:07:21,860 --> 00:07:25,200 So, lets say for the purpose of model evaluation, 124 00:07:25,200 --> 00:07:28,330 we divided our data into two parts. 125 00:07:28,330 --> 00:07:30,510 Train and validation parts. 126 00:07:30,510 --> 00:07:32,135 Like we already did, 127 00:07:32,135 --> 00:07:37,427 we will derive model's complexity from low to high and look at the models here. 128 00:07:37,427 --> 00:07:41,130 Note, that usually, we understand 129 00:07:41,130 --> 00:07:45,955 error or loss is something which is opposite to model's quality or score. 130 00:07:45,955 --> 00:07:49,530 In the figure, the dependency looks pretty reasonable. 131 00:07:49,530 --> 00:07:50,955 For two simple models, 132 00:07:50,955 --> 00:07:55,590 we have underfitting which means higher on both train and validation. 133 00:07:55,590 --> 00:07:57,070 For two complex models, 134 00:07:57,070 --> 00:08:03,030 we have overfitting which means low error on train but again high error on validation. 135 00:08:03,030 --> 00:08:04,960 In the middle, between them, 136 00:08:04,960 --> 00:08:06,960 if the perfect model's complexity, 137 00:08:06,960 --> 00:08:09,900 it has the lowest train on the validation data and 138 00:08:09,900 --> 00:08:14,370 thus we expect it to have the lowest error on the unseen test data. 139 00:08:14,370 --> 00:08:18,000 Note, that here the training error is always 140 00:08:18,000 --> 00:08:22,800 better than the test error which implies overfitting in general sense, 141 00:08:22,800 --> 00:08:26,340 but doesn't apply in the context of competitions. 142 00:08:26,340 --> 00:08:27,615 Well done. 143 00:08:27,615 --> 00:08:30,015 In this video, we define validation, 144 00:08:30,015 --> 00:08:32,857 demonstrated its purpose, and interpreted validation 145 00:08:32,857 --> 00:08:36,330 in terms of underfitting and overfitting. 146 00:08:36,330 --> 00:08:39,045 So, once again, in general, 147 00:08:39,045 --> 00:08:41,535 the validation helps us answer the question, 148 00:08:41,535 --> 00:08:45,960 what will be the quality of our model on the unseeing data and help 149 00:08:45,960 --> 00:08:52,130 us select the model which will be expected to get the best quality on that test data. 150 00:08:52,130 --> 00:08:56,185 Usually, we are trying to avoid underfitting on the one side that is 151 00:08:56,185 --> 00:09:01,105 we want our model to be expressive enough to capture the patterns in the data. 152 00:09:01,105 --> 00:09:04,260 And we are trying to avoid overfitting on the other side, 153 00:09:04,260 --> 00:09:06,935 and don't make too complex model, 154 00:09:06,935 --> 00:09:08,265 because in that case, 155 00:09:08,265 --> 00:09:14,210 we will start to capture noise or patterns that doesn't generalize to the test data.