1
00:00:00,000 --> 00:00:04,689
[NOISE] In this lesson,
we'll discuss how to validate models,

2
00:00:04,689 --> 00:00:07,608
how to check whether they generalize,

3
00:00:07,608 --> 00:00:12,839
whether they work well not only on
training set, but also on new data.

4
00:00:12,839 --> 00:00:15,560
And we'll start with the discussion
of overfitting problem.

5
00:00:16,850 --> 00:00:19,279
So suppose that we have
a classification problem.

6
00:00:19,279 --> 00:00:23,674
And we've just trained a classifier
that has accuracy of 80%.

7
00:00:23,674 --> 00:00:27,970
So it gives correct answers
on 80% of our training data.

8
00:00:27,970 --> 00:00:29,300
It looks good, but

9
00:00:29,300 --> 00:00:34,610
actually we don't have any guarantees that
our model will work well on new data.

10
00:00:34,610 --> 00:00:38,100
Maybe it's overfitted,
maybe it just remembered the answers for

11
00:00:38,100 --> 00:00:41,940
data from the training set and
doesn't generalize at all.

12
00:00:41,940 --> 00:00:45,770
So let's consider two
examples of overfitting.

13
00:00:45,770 --> 00:00:49,630
So suppose that we have a problem where
each example is described by only one

14
00:00:49,630 --> 00:00:53,450
feature, and the data looks like that.

15
00:00:53,450 --> 00:00:57,900
The green line is a true target function,
we want to estimate it.

16
00:00:57,900 --> 00:01:00,990
And suppose that we use
a linear regression model.

17
00:01:00,990 --> 00:01:03,410
If we fit it, it looks like a blue line.

18
00:01:03,410 --> 00:01:06,540
So it's under fitted,
it has inferior quality.

19
00:01:06,540 --> 00:01:09,490
So actually it's too simple for our data,

20
00:01:09,490 --> 00:01:12,920
because inter-dependency between y and
x is not linear.

21
00:01:14,390 --> 00:01:17,080
To overcome this we can
use polynomial model.

22
00:01:17,080 --> 00:01:19,520
So we add features to our examples.

23
00:01:19,520 --> 00:01:23,990
We use not only x, but also x squared,
x to the 3rd degree, and

24
00:01:23,990 --> 00:01:25,680
x to the 4th degree.

25
00:01:25,680 --> 00:01:31,307
And if you fit this model, then we get
a model blue line like on this picture.

26
00:01:31,307 --> 00:01:35,490
So it's a very good model,
it fits the true target model perfectly.

27
00:01:35,490 --> 00:01:40,960
So here we have just as much parameters
as we need, this is a nice model.

28
00:01:40,960 --> 00:01:43,730
But if we continue to increase
number of features, for

29
00:01:43,730 --> 00:01:49,720
example up to x to the 15th degree,
then we'll get this blue model.

30
00:01:49,720 --> 00:01:53,434
It's too complex for our data,
it's overfitted, and

31
00:01:53,434 --> 00:01:56,990
maybe it has good performance
on training examples,

32
00:01:56,990 --> 00:02:00,479
but it has very poor
performance on new data points.

33
00:02:00,479 --> 00:02:02,660
Okay, here is another example.

34
00:02:02,660 --> 00:02:08,495
Suppose that we have eight
data points: 0.2, 0.4 to 1.6,

35
00:02:08,495 --> 00:02:15,049
and the target value is calculated as
sin(x) plus some small normal noise.

36
00:02:15,049 --> 00:02:20,570
And we use, once again, the polynomial
model with eight parameters plus a bias.

37
00:02:20,570 --> 00:02:24,330
And if you fit this model,
then we'll see a picture like this one.

38
00:02:24,330 --> 00:02:28,880
So our model goes through every training
example, through every blue point.

39
00:02:28,880 --> 00:02:33,250
So it gives a perfect prediction for a
training set, and we'll have, for example,

40
00:02:33,250 --> 00:02:35,880
zero loss on our training set.

41
00:02:35,880 --> 00:02:37,409
But this model is overfitted.

42
00:02:37,409 --> 00:02:40,641
If you take any data point
not from our training set,

43
00:02:40,641 --> 00:02:42,929
then the quality would be very poor.

44
00:02:42,929 --> 00:02:46,370
And if you look at parameter vector,
then they are very large.

45
00:02:46,370 --> 00:02:49,587
The range of target value is from 0 to 1,
but

46
00:02:49,587 --> 00:02:52,732
the values of our parameters are hundreds.

47
00:02:52,732 --> 00:02:56,430
So this model just incorporates
target values into parameters.

48
00:02:56,430 --> 00:03:00,890
So when you apply model with these
parameters to training data points,

49
00:03:00,890 --> 00:03:03,180
then you get a correct answer.

50
00:03:03,180 --> 00:03:07,170
How can we validate our model and
check whether it's overfitted or not?

51
00:03:07,170 --> 00:03:09,890
Well, we can take all
our labeled examples and

52
00:03:09,890 --> 00:03:13,840
split them into two parts,
training set and holdout set.

53
00:03:13,840 --> 00:03:18,342
We use training set to learn our model,
like classifier or regression model, and

54
00:03:18,342 --> 00:03:21,033
we use holdout set to
calculate the .equality.

55
00:03:21,033 --> 00:03:24,519
For example, accuracy or
cross-entropy for classification, or

56
00:03:24,519 --> 00:03:27,340
maybe mean squared error for regression.

57
00:03:27,340 --> 00:03:32,720
And then, if the loss on holdout set is
not very high, then the model is good.

58
00:03:32,720 --> 00:03:35,263
But if we see that loss
increases on holdout set,

59
00:03:35,263 --> 00:03:37,870
then maybe the model is overfitted.

60
00:03:37,870 --> 00:03:42,610
And, of course, there is a question,
how to split our data into two parts,

61
00:03:42,610 --> 00:03:45,560
whether training set should be large,
or holdout set should be large.

62
00:03:45,560 --> 00:03:49,890
If the holdout set is small,
then the training set is representative.

63
00:03:49,890 --> 00:03:53,910
It contains almost all data
points from our label set.

64
00:03:53,910 --> 00:03:56,650
But the holdout set is too small, so

65
00:03:56,650 --> 00:04:01,010
the estimate of quality based on
holdout set may have large variance.

66
00:04:01,010 --> 00:04:04,145
But if we select to have
large holdout set and

67
00:04:04,145 --> 00:04:08,996
small training set, then the training
set is not representative.

68
00:04:08,996 --> 00:04:12,980
It contains much lesser data points
than we'll have in practice.

69
00:04:12,980 --> 00:04:16,230
So our estimate of
quality would be biased.

70
00:04:16,230 --> 00:04:18,750
But since holdout set is large,

71
00:04:18,750 --> 00:04:22,250
then our estimates of quality
will have low variance.

72
00:04:22,250 --> 00:04:26,358
In practice, we usually put 70%
of our data as a training set,

73
00:04:26,358 --> 00:04:28,440
and 30% to holdout set.

74
00:04:28,440 --> 00:04:31,860
Or maybe 80% to training set,
and 20% to holdout set.

75
00:04:31,860 --> 00:04:36,160
There are some problems with
this approach with holdout set.

76
00:04:36,160 --> 00:04:37,830
For example, if the sample is small,

77
00:04:37,830 --> 00:04:41,900
we want to see what happens if
each example is in training set.

78
00:04:41,900 --> 00:04:45,030
And what happens if this
example is a holdout set.

79
00:04:45,030 --> 00:04:49,240
So to achieve this, for example we can
split our data into training set and

80
00:04:49,240 --> 00:04:51,260
holdout set, K times.

81
00:04:51,260 --> 00:04:54,730
And then just average our
estimates from all holdout set.

82
00:04:54,730 --> 00:04:59,516
But there are still no guarantees
that every example will be both in

83
00:04:59,516 --> 00:05:02,774
training set and
holdout set at some splits.

84
00:05:02,774 --> 00:05:05,420
Much better way to do
it is cross-validation.

85
00:05:05,420 --> 00:05:11,100
In it we split our data into k blocks
of approximately similar size,

86
00:05:11,100 --> 00:05:13,110
and we call these blocks folds.

87
00:05:13,110 --> 00:05:14,770
Then we take the first fold and

88
00:05:14,770 --> 00:05:18,820
use it as a holdout set, and
all other blocks as training set.

89
00:05:18,820 --> 00:05:22,374
We train a model,
we assess its quality, we validate it,

90
00:05:22,374 --> 00:05:26,302
we calculate the metric,
on this first fold on our holdout set.

91
00:05:26,302 --> 00:05:31,950
Then we use second fold on our holdout set
and repeat the same procedure, and etc.

92
00:05:31,950 --> 00:05:36,012
And we use, as the last step,
the last fold as a holdout set, and

93
00:05:36,012 --> 00:05:38,620
all other folds as training set.

94
00:05:38,620 --> 00:05:42,851
And then we just take an average of
our estimates from all iterations of

95
00:05:42,851 --> 00:05:43,935
this procedure.

96
00:05:43,935 --> 00:05:49,646
Cross-validation guarantees that each
example will be both in holdout set and

97
00:05:49,646 --> 00:05:53,574
in training set at some
iterations of this procedure.

98
00:05:53,574 --> 00:05:56,197
But cross-validation is
quite hard to perform,

99
00:05:56,197 --> 00:05:58,940
because it requires to
train our model K times.

100
00:05:58,940 --> 00:06:01,160
And if we are talking about
deep neural networks,

101
00:06:01,160 --> 00:06:06,490
training of one network can take one or
two or four weeks on several GPUs,

102
00:06:06,490 --> 00:06:10,330
so it will be quite hard to train
them five times, or ten times.

103
00:06:10,330 --> 00:06:13,469
So in deep learning,
we'll usually use just a holdout set.

104
00:06:13,469 --> 00:06:17,267
It's okay, because usually we'll work
with a large samples where even one

105
00:06:17,267 --> 00:06:18,980
holdout set is representative.

106
00:06:18,980 --> 00:06:22,119
So there is no need to use
multiple holdout sets.

107
00:06:22,119 --> 00:06:26,117
In this video we discussed that models
can easily overfit if they have enough

108
00:06:26,117 --> 00:06:27,185
parameters for it.

109
00:06:27,185 --> 00:06:32,314
And discussed some ways to assess
model quality to validate them,

110
00:06:32,314 --> 00:06:35,529
like holdout set or cross-validation.

111
00:06:35,529 --> 00:06:40,163
And in next video, we'll discuss how
to modify our training procedure so

112
00:06:40,163 --> 00:06:44,364
that our models cannot overfit,
how to reduce their complexity.

113
00:06:44,364 --> 00:06:54,364
[NOISE]