1
00:00:00,000 --> 00:00:04,754
[MUSIC]

2
00:00:04,754 --> 00:00:10,350
Often, the best way to prevent overfitting
decision trees however, is to use pruning.

3
00:00:10,350 --> 00:00:13,260
Where you grow the tree to be
much bigger than you need,

4
00:00:13,260 --> 00:00:17,140
and then you chop off parts that
turn out to be less important.

5
00:00:18,270 --> 00:00:21,640
Now, pruning is very important
if you're implementing

6
00:00:21,640 --> 00:00:25,110
a real world decision tree algorithm.

7
00:00:25,110 --> 00:00:28,530
However, the details can
get a little tedious.

8
00:00:28,530 --> 00:00:32,770
Not too mathematical but
just a lot of little things you have to,

9
00:00:32,770 --> 00:00:35,570
a lot of Ts to cross and Is to dot.

10
00:00:35,570 --> 00:00:38,260
And so we've made this section optional.

11
00:00:38,260 --> 00:00:40,450
You're welcome to watch it.

12
00:00:40,450 --> 00:00:43,790
Not a lot of math, just a lot of detail.

13
00:00:43,790 --> 00:00:47,080
And so I would recommend you
especially pay attention

14
00:00:47,080 --> 00:00:50,330
if you're implementing your own
decision tree, learning from scratch.

15
00:00:50,330 --> 00:00:53,750
But if not, just know what pruning is,
we just talked about it.

16
00:00:53,750 --> 00:00:56,000
And know that most implementations,

17
00:00:56,000 --> 00:00:59,240
most great implementations out there
will use some kind of pruning.

18
00:01:00,520 --> 00:01:03,920
We discussed lots of recursions where
we are building the decision tree from

19
00:01:03,920 --> 00:01:09,110
the top down and we discussed two types
of list stopping L stopping conditions,

20
00:01:09,110 --> 00:01:11,520
the ones that we should always do,
the stopping conditions for

21
00:01:11,520 --> 00:01:15,380
in the previous module and the three list
stopping conditions from this module.

22
00:01:15,380 --> 00:01:19,370
Now let's explore some of
the challenges that come about,

23
00:01:19,370 --> 00:01:22,230
with this early stopping conditions.

24
00:01:22,230 --> 00:01:25,500
Early stopping condition once was,
let's pick a max depth,

25
00:01:25,500 --> 00:01:28,430
and stop recursing,
once you reach that max depth.

26
00:01:28,430 --> 00:01:32,760
The question is,
how do you really pick the max depth?

27
00:01:33,810 --> 00:01:37,320
We said that you should use cross
validation of validation set, but

28
00:01:37,320 --> 00:01:41,760
there can be challenges with setting data
side for that, you had to rerun your

29
00:01:41,760 --> 00:01:47,910
algorithms many times, and you might not
always reach the desired answer you like.

30
00:01:47,910 --> 00:01:51,390
And plus you want some parts of
the tree to grow more than others.

31
00:01:51,390 --> 00:01:54,640
It becomes really kind of
complicated to set this parameter.

32
00:01:54,640 --> 00:01:57,940
Now, early stopping condition two,
is the most dangerous one.

33
00:01:57,940 --> 00:02:04,880
Early stopping condition two says stop,
if the training error stops decreasing.

34
00:02:04,880 --> 00:02:09,680
So if that, classification error for
the training data stops going down, so

35
00:02:09,680 --> 00:02:10,600
flattens out.

36
00:02:10,600 --> 00:02:13,150
However it turns out that in practice

37
00:02:13,150 --> 00:02:15,420
the curve might not be as
beautiful as what I drew.

38
00:02:15,420 --> 00:02:18,250
It might not be smoothly going down,
it might jump and

39
00:02:18,250 --> 00:02:20,100
it might do some weird stuff.

40
00:02:20,100 --> 00:02:21,770
Let's see why that can come about.

41
00:02:23,110 --> 00:02:26,300
As we discussed early
stopping condition two

42
00:02:26,300 --> 00:02:28,710
stop when error doesn't
decrease can be dangerous.

43
00:02:28,710 --> 00:02:32,470
And to show how it's dangerous, we need a
counter example, and we'll use the counter

44
00:02:32,470 --> 00:02:38,680
example for everything, which if you saw
the course one you know that it is xor.

45
00:02:39,720 --> 00:02:42,720
So, here's our data and

46
00:02:42,720 --> 00:02:48,640
this data corresponds to the xor of
the first input and the second input.

47
00:02:48,640 --> 00:02:52,040
So, when they're both false
the output is false and

48
00:02:52,040 --> 00:02:55,860
when it's false the other is true
output is true which happens in the two

49
00:02:55,860 --> 00:03:00,910
middle rows here and then when
they're both true the output is false.

50
00:03:00,910 --> 00:03:04,810
If you are ready to find the counter
example for anything just say xor.

51
00:03:04,810 --> 00:03:07,240
In fact if someone calls you
in the middle of the night and

52
00:03:07,240 --> 00:03:11,370
asks what is the counter example for
this crazy problem I'm having?

53
00:03:11,370 --> 00:03:13,280
You say, xor.

54
00:03:14,920 --> 00:03:20,510
So what happens with this xor when we
just look at the root of decision tree.

55
00:03:20,510 --> 00:03:28,030
Well we have two rows with output true,
and two rows of output false.

56
00:03:28,030 --> 00:03:33,230
And so what is the error of this decision
tree we're just going to top level.

57
00:03:33,230 --> 00:03:37,540
Well you just choose one of these because
they're both the same, let's say we choose

58
00:03:37,540 --> 00:03:42,350
true because we're optimists,
we like to think positively.

59
00:03:42,350 --> 00:03:46,990
And so we're going to say
that the output here is +1.

60
00:03:46,990 --> 00:03:52,150
And so
that means that we're making two mistakes.

61
00:03:53,940 --> 00:03:58,519
Two mistakes, so
two out of four datapoints, so

62
00:03:58,519 --> 00:04:02,440
our error is 0.5, which is bad.

63
00:04:02,440 --> 00:04:07,496
That means that you are as good as random,
just error 0.5.

64
00:04:07,496 --> 00:04:10,656
Okay, let's see what happens
if we split on x[1].

65
00:04:10,656 --> 00:04:16,370
Well, if we split on x[1] and
you are actually

66
00:04:16,370 --> 00:04:21,110
splitting the table like this and you see
that because xor is so beautiful, we,

67
00:04:21,110 --> 00:04:24,790
again, for each one of these subsequent
nodes, we're in the same situation.

68
00:04:24,790 --> 00:04:31,170
So, for example, for an x-1 equals 2,
you have one y true and one y false.

69
00:04:31,170 --> 00:04:35,920
So let's suppose that you predict plus
one for this side and also plus one for

70
00:04:35,920 --> 00:04:41,540
the side where x-1 is false,
because it's all even stevens.

71
00:04:41,540 --> 00:04:42,325
So what's the error?

72
00:04:42,325 --> 00:04:47,920
Well, we made mistakes,
mistakes were made.

73
00:04:47,920 --> 00:04:52,698
Mistakes were made here and
here so we made 1 + 1

74
00:04:52,698 --> 00:04:57,487
mistakes which is 2 out of 4, so it's 0.5.

75
00:04:57,487 --> 00:05:01,160
So it turns out that splitting
on x[1] does not help.

76
00:05:02,710 --> 00:05:08,100
Okay, and see what happens
when you try to split on x[2]?

77
00:05:08,100 --> 00:05:10,960
So if I try to split on x[2] I end up with

78
00:05:10,960 --> 00:05:15,480
the same situation expect I change
this label here from x[1] to x[2].

79
00:05:15,480 --> 00:05:20,810
But again,
we'll be in exactly the same situation

80
00:05:20,810 --> 00:05:26,350
where we're going to output 1,
on each one of these outcomes.

81
00:05:26,350 --> 00:05:28,760
And make the same number of mistakes.

82
00:05:28,760 --> 00:05:35,727
So one mistake for each,
which gives me, total error of 0.5.

83
00:05:35,727 --> 00:05:39,278
And so, if I look, all in all,

84
00:05:39,278 --> 00:05:43,940
neither are splitting on x1, nor x2.

85
00:05:45,040 --> 00:05:47,590
That helps me reduce the error.

86
00:05:47,590 --> 00:05:50,720
So both splits give you the same error.

87
00:05:52,570 --> 00:05:54,710
So the question is, should I start now?

88
00:05:54,710 --> 00:05:59,510
If neither feature improves and
your using early stopping condition

89
00:05:59,510 --> 00:06:04,460
two then you just say no don't
do it take it in the splits.

90
00:06:04,460 --> 00:06:07,730
You just stop splitting the decision tree.

91
00:06:07,730 --> 00:06:13,950
At the very very root output to be true,
make two mistakes

92
00:06:13,950 --> 00:06:19,115
and say that the error that
you get in the end is 0.5.

93
00:06:19,115 --> 00:06:25,000
So, [LAUGH] this is why xor is
a counter example to everything.

94
00:06:25,000 --> 00:06:28,422
What would happen if you didn't take
early stopping condition two and

95
00:06:28,422 --> 00:06:29,470
you just kept going?

96
00:06:31,699 --> 00:06:36,580
If you just kept going,
you get this tree that looks here.

97
00:06:36,580 --> 00:06:41,680
And if you see, for each one of these
leaves, there is zero mistakes.

98
00:06:41,680 --> 00:06:47,680
So the leaf from the left outputs false,
or minus one.

99
00:06:47,680 --> 00:06:51,200
And, makes zero mistakes,
next one makes zero mistakes.

100
00:06:51,200 --> 00:06:54,510
The third one makes zero mistakes,
and the last one makes zero mistakes.

101
00:06:54,510 --> 00:06:56,100
So, if you kept going.

102
00:06:56,100 --> 00:07:00,979
If you didn't use early
stopping condition two,

103
00:07:00,979 --> 00:07:04,480
you would get three errors of zero.

104
00:07:04,480 --> 00:07:08,908
And so, in this case, there is a huge gap,

105
00:07:08,908 --> 00:07:13,582
huge gap between training error was 0.5,

106
00:07:13,582 --> 00:07:19,363
which is as good as random,
and training error of zero,

107
00:07:19,363 --> 00:07:22,930
which is perfect training error.

108
00:07:24,540 --> 00:07:27,520
So in this case,
early stopping condition two

109
00:07:28,740 --> 00:07:33,610
created this arbitrary large gap
between the best he could have done.

110
00:07:33,610 --> 00:07:34,880
And what you end up outputting.

111
00:07:37,170 --> 00:07:39,590
So we saw a list of in condition two.

112
00:07:39,590 --> 00:07:43,980
Don't stop recursing if the training
error does not decrease enough.

113
00:07:43,980 --> 00:07:50,090
It seems like a reasonable heuristic,
but it has some dangerous pitfalls.

114
00:07:50,090 --> 00:07:53,720
Because two short, short-sighted,
this doesn't explore branches far enough.

115
00:07:53,720 --> 00:07:56,540
So it's kind of generally
a good principle.

116
00:07:56,540 --> 00:08:00,560
But if you apply top down you
might get yourself into trouble.

117
00:08:00,560 --> 00:08:02,040
So what can you do about this?

118
00:08:02,040 --> 00:08:06,360
This is where pruning comes in,
uses something like this criterian but

119
00:08:06,360 --> 00:08:07,515
uses it bottom up.

120
00:08:07,515 --> 00:08:11,709
[MUSIC]