1
00:00:00,000 --> 00:00:01,370
>> In the last video,

2
00:00:01,370 --> 00:00:05,445
you saw how your dev and test sets should come from the same distribution,

3
00:00:05,445 --> 00:00:07,370
but how long should they be?

4
00:00:07,370 --> 00:00:08,780
The guidelines to help set up

5
00:00:08,780 --> 00:00:11,955
your dev and test sets are changing in the Deep Learning era.

6
00:00:11,955 --> 00:00:14,645
Let's take a look at some best practices.

7
00:00:14,645 --> 00:00:17,870
You might have heard of the rule of thumb

8
00:00:17,870 --> 00:00:20,489
in machine learning of taking all the data you have

9
00:00:20,489 --> 00:00:26,495
and using a 70/30 split into a train and test set,

10
00:00:26,495 --> 00:00:30,800
or have you had to set up train dev and test sets maybe,

11
00:00:30,800 --> 00:00:42,705
you would use a 60% training and 20% dev and 20% tests.

12
00:00:42,705 --> 00:00:47,200
In earlier eras of machine learning,

13
00:00:47,200 --> 00:00:50,155
this was pretty reasonable,

14
00:00:50,155 --> 00:00:54,550
especially back when data set sizes were just smaller.

15
00:00:54,550 --> 00:00:57,085
So if you had a hundred examples in total,

16
00:00:57,085 --> 00:01:03,555
these 70/30 or 60/20/20 rule of thumb would be pretty reasonable.

17
00:01:03,555 --> 00:01:05,485
If you had thousand examples,

18
00:01:05,485 --> 00:01:09,070
maybe if you had ten thousand examples,

19
00:01:09,070 --> 00:01:13,070
these things are not unreasonable.

20
00:01:13,070 --> 00:01:16,255
But in the modern machine learning era,

21
00:01:16,255 --> 00:01:20,310
we are now used to working with much larger data set sizes.

22
00:01:20,310 --> 00:01:26,430
So let's say you have a million training examples,

23
00:01:26,430 --> 00:01:29,490
it might be quite reasonable to set up

24
00:01:29,490 --> 00:01:33,810
your data so that you have 98% in the training set,

25
00:01:33,810 --> 00:01:40,437
1% dev, and 1% test.

26
00:01:40,437 --> 00:01:44,590
And when you use DNT to abbreviate dev and test sets.

27
00:01:44,590 --> 00:01:46,710
Because if you have a million examples,

28
00:01:46,710 --> 00:01:48,285
then 1% of that,

29
00:01:48,285 --> 00:01:54,800
is 10,000 examples, and that might be plenty enough for a dev set or for a test set.

30
00:01:54,800 --> 00:02:00,255
So, in the modern Deep Learning era where sometimes we have much larger data sets,

31
00:02:00,255 --> 00:02:04,020
It's quite reasonable to use a much smaller than 20

32
00:02:04,020 --> 00:02:07,785
or 30% of your data for a dev set or a test set.

33
00:02:07,785 --> 00:02:12,690
And because Deep Learning algorithms have such a huge hunger for data, I'm seeing that,

34
00:02:12,690 --> 00:02:16,020
the problems we have large data sets that have

35
00:02:16,020 --> 00:02:20,430
much larger fraction of it goes into the training set.

36
00:02:20,430 --> 00:02:24,447
So, how about the test set?

37
00:02:24,447 --> 00:02:28,930
Remember the purpose of your test set is that,

38
00:02:28,930 --> 00:02:30,865
after you finish developing a system,

39
00:02:30,865 --> 00:02:34,360
the test set helps evaluate how good your final system is.

40
00:02:34,360 --> 00:02:37,690
The guideline is, to set your test set to big enough to give

41
00:02:37,690 --> 00:02:41,150
high confidence in the overall performance of your system.

42
00:02:41,150 --> 00:02:43,690
So, unless you need to have

43
00:02:43,690 --> 00:02:48,090
a very accurate measure of how well your final system is performing,

44
00:02:48,090 --> 00:02:54,059
maybe you don't need millions and millions of examples in a test set,

45
00:02:54,059 --> 00:02:57,640
and maybe for your application if you think that having 10,000 examples gives you

46
00:02:57,640 --> 00:03:00,545
enough confidence to find the performance on maybe

47
00:03:00,545 --> 00:03:03,725
100,000 or whatever it is, that might be enough.

48
00:03:03,725 --> 00:03:05,260
And this could be much less than,

49
00:03:05,260 --> 00:03:07,340
say 30% of the overall data set,

50
00:03:07,340 --> 00:03:08,440
depend on how much data you have.

51
00:03:08,440 --> 00:03:13,250
For some applications,

52
00:03:13,250 --> 00:03:18,320
maybe you don't need a high confidence in the overall performance of your final system.

53
00:03:18,320 --> 00:03:23,055
Maybe all you need is a train and dev set,

54
00:03:23,055 --> 00:03:29,230
And I think, not having a test set might be okay.

55
00:03:29,230 --> 00:03:31,685
In fact, what sometimes happened was,

56
00:03:31,685 --> 00:03:33,965
people were talking about using

57
00:03:33,965 --> 00:03:40,580
train test splits but what they were actually doing was iterating on the test set.

58
00:03:40,580 --> 00:03:42,250
So rather than test set,

59
00:03:42,250 --> 00:03:46,415
what they had was a train dev split and no test set.

60
00:03:46,415 --> 00:03:48,604
If you're actually tuning to this set,

61
00:03:48,604 --> 00:03:50,390
to this dev set and this test set,

62
00:03:50,390 --> 00:03:53,205
It's better to call the dev set.

63
00:03:53,205 --> 00:03:56,335
Although I think in the history of machine learning,

64
00:03:56,335 --> 00:03:59,875
not everyone has been completely clean and completely records

65
00:03:59,875 --> 00:04:03,895
of about calling the dev set when it really should be treated as test set.

66
00:04:03,895 --> 00:04:07,485
But, if all you care about is having some data that you train on,

67
00:04:07,485 --> 00:04:09,150
and having some data to tune to,

68
00:04:09,150 --> 00:04:11,682
and you're just going to shake the final system

69
00:04:11,682 --> 00:04:15,710
and not worry too much about how it was actually doing,

70
00:04:15,710 --> 00:04:17,940
I think it will be healthy and just call the train dev set

71
00:04:17,940 --> 00:04:20,700
and acknowledge that you have no test set.

72
00:04:20,700 --> 00:04:22,720
This a bit unusual?

73
00:04:22,720 --> 00:04:26,970
I'm definitely not recommending not having a test set when building a system.

74
00:04:26,970 --> 00:04:30,225
I do find it reassuring to have a separate test set

75
00:04:30,225 --> 00:04:33,900
you can use to get an unbiased estimate of how I was doing before you shift it,

76
00:04:33,900 --> 00:04:37,770
but if you have a very large dev

77
00:04:37,770 --> 00:04:41,650
set so that you think you won't overfit the dev set too badly.

78
00:04:41,650 --> 00:04:45,200
Maybe it's not totally unreasonable to just have a train dev set,

79
00:04:45,200 --> 00:04:48,800
although it's not what I usually recommend.

80
00:04:48,800 --> 00:04:51,600
So to summarize, in the era of big data,

81
00:04:51,600 --> 00:04:54,500
I think the old rule of thumb of a 70/30 is that,

82
00:04:54,500 --> 00:04:56,275
that no longer applies.

83
00:04:56,275 --> 00:05:01,035
And the trend has been to use more data for training and less for dev and test,

84
00:05:01,035 --> 00:05:03,220
especially when you have a very large data sets.

85
00:05:03,220 --> 00:05:06,960
And the rule of thumb is really to try to set the dev set to big enough for its purpose,

86
00:05:06,960 --> 00:05:11,110
which helps you evaluate different ideas and pick this up from AOP better.

87
00:05:11,110 --> 00:05:15,450
And the purpose of test set is to help you evaluate your final cost buys.

88
00:05:15,450 --> 00:05:18,590
You just have to set your test set big enough for that purpose,

89
00:05:18,590 --> 00:05:21,710
and that could be much less than 30% of the data.

90
00:05:21,710 --> 00:05:24,810
So, I hope that gives some guidance or some suggestions on how to

91
00:05:24,810 --> 00:05:28,710
set up your dev and test sets in the Deep Learning era.

92
00:05:28,710 --> 00:05:30,595
Next, it turns out that sometimes,

93
00:05:30,595 --> 00:05:32,640
part way through a machine learning problem,

94
00:05:32,640 --> 00:05:34,800
you might want to change your evaluation metric,

95
00:05:34,800 --> 00:05:36,615
or change your dev and test sets.

96
00:05:36,615 --> 00:05:38,250
Let's talk about it when you might want to do