1 00:00:00,000 --> 00:00:04,689 [NOISE] In this lesson, we'll discuss how to validate models, 2 00:00:04,689 --> 00:00:07,608 how to check whether they generalize, 3 00:00:07,608 --> 00:00:12,839 whether they work well not only on training set, but also on new data. 4 00:00:12,839 --> 00:00:15,560 And we'll start with the discussion of overfitting problem. 5 00:00:16,850 --> 00:00:19,279 So suppose that we have a classification problem. 6 00:00:19,279 --> 00:00:23,674 And we've just trained a classifier that has accuracy of 80%. 7 00:00:23,674 --> 00:00:27,970 So it gives correct answers on 80% of our training data. 8 00:00:27,970 --> 00:00:29,300 It looks good, but 9 00:00:29,300 --> 00:00:34,610 actually we don't have any guarantees that our model will work well on new data. 10 00:00:34,610 --> 00:00:38,100 Maybe it's overfitted, maybe it just remembered the answers for 11 00:00:38,100 --> 00:00:41,940 data from the training set and doesn't generalize at all. 12 00:00:41,940 --> 00:00:45,770 So let's consider two examples of overfitting. 13 00:00:45,770 --> 00:00:49,630 So suppose that we have a problem where each example is described by only one 14 00:00:49,630 --> 00:00:53,450 feature, and the data looks like that. 15 00:00:53,450 --> 00:00:57,900 The green line is a true target function, we want to estimate it. 16 00:00:57,900 --> 00:01:00,990 And suppose that we use a linear regression model. 17 00:01:00,990 --> 00:01:03,410 If we fit it, it looks like a blue line. 18 00:01:03,410 --> 00:01:06,540 So it's under fitted, it has inferior quality. 19 00:01:06,540 --> 00:01:09,490 So actually it's too simple for our data, 20 00:01:09,490 --> 00:01:12,920 because inter-dependency between y and x is not linear. 21 00:01:14,390 --> 00:01:17,080 To overcome this we can use polynomial model. 22 00:01:17,080 --> 00:01:19,520 So we add features to our examples. 23 00:01:19,520 --> 00:01:23,990 We use not only x, but also x squared, x to the 3rd degree, and 24 00:01:23,990 --> 00:01:25,680 x to the 4th degree. 25 00:01:25,680 --> 00:01:31,307 And if you fit this model, then we get a model blue line like on this picture. 26 00:01:31,307 --> 00:01:35,490 So it's a very good model, it fits the true target model perfectly. 27 00:01:35,490 --> 00:01:40,960 So here we have just as much parameters as we need, this is a nice model. 28 00:01:40,960 --> 00:01:43,730 But if we continue to increase number of features, for 29 00:01:43,730 --> 00:01:49,720 example up to x to the 15th degree, then we'll get this blue model. 30 00:01:49,720 --> 00:01:53,434 It's too complex for our data, it's overfitted, and 31 00:01:53,434 --> 00:01:56,990 maybe it has good performance on training examples, 32 00:01:56,990 --> 00:02:00,479 but it has very poor performance on new data points. 33 00:02:00,479 --> 00:02:02,660 Okay, here is another example. 34 00:02:02,660 --> 00:02:08,495 Suppose that we have eight data points: 0.2, 0.4 to 1.6, 35 00:02:08,495 --> 00:02:15,049 and the target value is calculated as sin(x) plus some small normal noise. 36 00:02:15,049 --> 00:02:20,570 And we use, once again, the polynomial model with eight parameters plus a bias. 37 00:02:20,570 --> 00:02:24,330 And if you fit this model, then we'll see a picture like this one. 38 00:02:24,330 --> 00:02:28,880 So our model goes through every training example, through every blue point. 39 00:02:28,880 --> 00:02:33,250 So it gives a perfect prediction for a training set, and we'll have, for example, 40 00:02:33,250 --> 00:02:35,880 zero loss on our training set. 41 00:02:35,880 --> 00:02:37,409 But this model is overfitted. 42 00:02:37,409 --> 00:02:40,641 If you take any data point not from our training set, 43 00:02:40,641 --> 00:02:42,929 then the quality would be very poor. 44 00:02:42,929 --> 00:02:46,370 And if you look at parameter vector, then they are very large. 45 00:02:46,370 --> 00:02:49,587 The range of target value is from 0 to 1, but 46 00:02:49,587 --> 00:02:52,732 the values of our parameters are hundreds. 47 00:02:52,732 --> 00:02:56,430 So this model just incorporates target values into parameters. 48 00:02:56,430 --> 00:03:00,890 So when you apply model with these parameters to training data points, 49 00:03:00,890 --> 00:03:03,180 then you get a correct answer. 50 00:03:03,180 --> 00:03:07,170 How can we validate our model and check whether it's overfitted or not? 51 00:03:07,170 --> 00:03:09,890 Well, we can take all our labeled examples and 52 00:03:09,890 --> 00:03:13,840 split them into two parts, training set and holdout set. 53 00:03:13,840 --> 00:03:18,342 We use training set to learn our model, like classifier or regression model, and 54 00:03:18,342 --> 00:03:21,033 we use holdout set to calculate the .equality. 55 00:03:21,033 --> 00:03:24,519 For example, accuracy or cross-entropy for classification, or 56 00:03:24,519 --> 00:03:27,340 maybe mean squared error for regression. 57 00:03:27,340 --> 00:03:32,720 And then, if the loss on holdout set is not very high, then the model is good. 58 00:03:32,720 --> 00:03:35,263 But if we see that loss increases on holdout set, 59 00:03:35,263 --> 00:03:37,870 then maybe the model is overfitted. 60 00:03:37,870 --> 00:03:42,610 And, of course, there is a question, how to split our data into two parts, 61 00:03:42,610 --> 00:03:45,560 whether training set should be large, or holdout set should be large. 62 00:03:45,560 --> 00:03:49,890 If the holdout set is small, then the training set is representative. 63 00:03:49,890 --> 00:03:53,910 It contains almost all data points from our label set. 64 00:03:53,910 --> 00:03:56,650 But the holdout set is too small, so 65 00:03:56,650 --> 00:04:01,010 the estimate of quality based on holdout set may have large variance. 66 00:04:01,010 --> 00:04:04,145 But if we select to have large holdout set and 67 00:04:04,145 --> 00:04:08,996 small training set, then the training set is not representative. 68 00:04:08,996 --> 00:04:12,980 It contains much lesser data points than we'll have in practice. 69 00:04:12,980 --> 00:04:16,230 So our estimate of quality would be biased. 70 00:04:16,230 --> 00:04:18,750 But since holdout set is large, 71 00:04:18,750 --> 00:04:22,250 then our estimates of quality will have low variance. 72 00:04:22,250 --> 00:04:26,358 In practice, we usually put 70% of our data as a training set, 73 00:04:26,358 --> 00:04:28,440 and 30% to holdout set. 74 00:04:28,440 --> 00:04:31,860 Or maybe 80% to training set, and 20% to holdout set. 75 00:04:31,860 --> 00:04:36,160 There are some problems with this approach with holdout set. 76 00:04:36,160 --> 00:04:37,830 For example, if the sample is small, 77 00:04:37,830 --> 00:04:41,900 we want to see what happens if each example is in training set. 78 00:04:41,900 --> 00:04:45,030 And what happens if this example is a holdout set. 79 00:04:45,030 --> 00:04:49,240 So to achieve this, for example we can split our data into training set and 80 00:04:49,240 --> 00:04:51,260 holdout set, K times. 81 00:04:51,260 --> 00:04:54,730 And then just average our estimates from all holdout set. 82 00:04:54,730 --> 00:04:59,516 But there are still no guarantees that every example will be both in 83 00:04:59,516 --> 00:05:02,774 training set and holdout set at some splits. 84 00:05:02,774 --> 00:05:05,420 Much better way to do it is cross-validation. 85 00:05:05,420 --> 00:05:11,100 In it we split our data into k blocks of approximately similar size, 86 00:05:11,100 --> 00:05:13,110 and we call these blocks folds. 87 00:05:13,110 --> 00:05:14,770 Then we take the first fold and 88 00:05:14,770 --> 00:05:18,820 use it as a holdout set, and all other blocks as training set. 89 00:05:18,820 --> 00:05:22,374 We train a model, we assess its quality, we validate it, 90 00:05:22,374 --> 00:05:26,302 we calculate the metric, on this first fold on our holdout set. 91 00:05:26,302 --> 00:05:31,950 Then we use second fold on our holdout set and repeat the same procedure, and etc. 92 00:05:31,950 --> 00:05:36,012 And we use, as the last step, the last fold as a holdout set, and 93 00:05:36,012 --> 00:05:38,620 all other folds as training set. 94 00:05:38,620 --> 00:05:42,851 And then we just take an average of our estimates from all iterations of 95 00:05:42,851 --> 00:05:43,935 this procedure. 96 00:05:43,935 --> 00:05:49,646 Cross-validation guarantees that each example will be both in holdout set and 97 00:05:49,646 --> 00:05:53,574 in training set at some iterations of this procedure. 98 00:05:53,574 --> 00:05:56,197 But cross-validation is quite hard to perform, 99 00:05:56,197 --> 00:05:58,940 because it requires to train our model K times. 100 00:05:58,940 --> 00:06:01,160 And if we are talking about deep neural networks, 101 00:06:01,160 --> 00:06:06,490 training of one network can take one or two or four weeks on several GPUs, 102 00:06:06,490 --> 00:06:10,330 so it will be quite hard to train them five times, or ten times. 103 00:06:10,330 --> 00:06:13,469 So in deep learning, we'll usually use just a holdout set. 104 00:06:13,469 --> 00:06:17,267 It's okay, because usually we'll work with a large samples where even one 105 00:06:17,267 --> 00:06:18,980 holdout set is representative. 106 00:06:18,980 --> 00:06:22,119 So there is no need to use multiple holdout sets. 107 00:06:22,119 --> 00:06:26,117 In this video we discussed that models can easily overfit if they have enough 108 00:06:26,117 --> 00:06:27,185 parameters for it. 109 00:06:27,185 --> 00:06:32,314 And discussed some ways to assess model quality to validate them, 110 00:06:32,314 --> 00:06:35,529 like holdout set or cross-validation. 111 00:06:35,529 --> 00:06:40,163 And in next video, we'll discuss how to modify our training procedure so 112 00:06:40,163 --> 00:06:44,364 that our models cannot overfit, how to reduce their complexity. 113 00:06:44,364 --> 00:06:54,364 [NOISE]