1
00:00:00,000 --> 00:00:04,608
[MUSIC]

2
00:00:04,608 --> 00:00:07,012
With all that in mind
let's just dig in and

3
00:00:07,012 --> 00:00:11,140
observe how overfitting comes about
in the context of decision trees.

4
00:00:11,140 --> 00:00:15,555
When we start learning decision
tree we learn the decision stump,

5
00:00:15,555 --> 00:00:17,725
which is a very simple
boundary between the data.

6
00:00:17,725 --> 00:00:20,525
Let's saying there's a simple
example would be using something of

7
00:00:20,525 --> 00:00:22,895
vertical line or horizontal line.

8
00:00:22,895 --> 00:00:27,625
So in this case, when the decisions
with respect to x(1) as to

9
00:00:27,625 --> 00:00:32,559
whether it's less than -0.07 or
greater than 0.07.

10
00:00:32,559 --> 00:00:36,090
So if it's less than,
we are going to predict -1 and

11
00:00:36,090 --> 00:00:40,560
that will correspond to this side
of the decision boundary, but

12
00:00:40,560 --> 00:00:45,540
if it's greater than -0.07 we are going
to predict +1 and we are going to

13
00:00:46,750 --> 00:00:51,220
be falling to the right side
of the decision boundary.

14
00:00:51,220 --> 00:00:54,920
So that's a very simple decision boundary
with just a depth 1 decision tree or

15
00:00:54,920 --> 00:00:59,390
a decision stump, but
as we increase the depth,

16
00:00:59,390 --> 00:01:02,020
that situation becomes more and
more complex.

17
00:01:02,020 --> 00:01:04,100
The decision boundaries become more and
more complex.

18
00:01:04,100 --> 00:01:05,144
So we can see,

19
00:01:05,144 --> 00:01:11,252
as you increase the depth you increase
the complexity of the decision boundaries.

20
00:01:15,003 --> 00:01:19,370
Until we end up with a super crazy
one here where the depth is 10.

21
00:01:19,370 --> 00:01:27,810
And so, that depth=10 decision
boundary has 0.00 training error.

22
00:01:27,810 --> 00:01:32,781
So the training error at this point
goes down from 0.22 which is relatively

23
00:01:32,781 --> 00:01:37,754
high for decision stamp through 0.13,
0.10, which are depth two and

24
00:01:37,754 --> 00:01:41,219
depth three decision trees,
which looked okay to this

25
00:01:41,219 --> 00:01:46,070
crazy decision tree of depth 10,
which had zero training error.

26
00:01:46,070 --> 00:01:50,200
And this should be a big warning sign for
you.

27
00:01:52,450 --> 00:01:54,040
And now we should sad face.

28
00:01:55,845 --> 00:02:00,550
We've seen that with decision trees, as
we increase the depth, the training error

29
00:02:00,550 --> 00:02:05,500
goes down until we get to a point where
training error could either reach zero,

30
00:02:05,500 --> 00:02:09,080
and very often actually does in this
case with decision tree of depth 10.

31
00:02:09,080 --> 00:02:12,570
So we could say,
decision tree of depth 10, that's great,

32
00:02:12,570 --> 00:02:15,980
it has zero training error, so
that's a perfect decision tree.

33
00:02:15,980 --> 00:02:19,660
But in reality,
it's not a perfect decision tree.

34
00:02:19,660 --> 00:02:23,110
And as we know,
even though the training error is zero,

35
00:02:23,110 --> 00:02:26,080
the true error can shoot up.

36
00:02:26,080 --> 00:02:32,583
And so this can be a really highly
over fitting decision tree.

37
00:02:32,583 --> 00:02:37,260
So it's good to take a step back and
really understand a little bit better

38
00:02:37,260 --> 00:02:42,540
why the training error of decision trees
tend to go down so quickly with depth.

39
00:02:42,540 --> 00:02:46,200
So let's take this simple example where
we're just learning decision stump.

40
00:02:46,200 --> 00:02:49,120
We have 40 data points,
just like we've been using,

41
00:02:49,120 --> 00:02:53,070
22 of them were safe loans, 18 were risky.

42
00:02:53,070 --> 00:02:55,060
And we chose first to split on credit.

43
00:02:56,570 --> 00:03:01,510
Now the question is why did we
choose to split on credit first?

44
00:03:01,510 --> 00:03:03,670
Why was that the first feature we chose?

45
00:03:03,670 --> 00:03:06,730
And the reason we choose to split
on credit first is because it

46
00:03:06,730 --> 00:03:09,500
improved the training error the most.

47
00:03:09,500 --> 00:03:13,605
Improved the training
error from 0.45 to 0.20.

48
00:03:13,605 --> 00:03:16,590
So that was a good first split to make.

49
00:03:16,590 --> 00:03:19,750
Now, if we go back and
review the algorithm for

50
00:03:19,750 --> 00:03:24,970
choosing what the best decision split is,
the best feature to split on is,

51
00:03:24,970 --> 00:03:28,100
we'd try every single possible feature and

52
00:03:28,100 --> 00:03:32,600
pick the one that decreases
the training error the most.

53
00:03:34,460 --> 00:03:36,870
And so at every step of the way,

54
00:03:36,870 --> 00:03:39,380
what are the features that
decrease the training error?

55
00:03:39,380 --> 00:03:41,530
And we're adding features that
decrease the training error.

56
00:03:41,530 --> 00:03:43,590
And we're adding features and
decreasing the error.

57
00:03:43,590 --> 00:03:46,710
And eventually we'll drive
the training error to zero.

58
00:03:46,710 --> 00:03:49,460
Unless of course we get to some points
where we can't drive the training error

59
00:03:49,460 --> 00:03:51,870
down because we've run out
of features to split on and

60
00:03:51,870 --> 00:03:55,990
we have positive points on top of
negative points, but that's a side note.

61
00:03:55,990 --> 00:03:58,670
Most importantly is to remember
the training error test go down,

62
00:03:58,670 --> 00:04:00,120
down down, down.

63
00:04:00,120 --> 00:04:04,627
And so that going down as we increase the
depth is what leads to this low training

64
00:04:04,627 --> 00:04:07,839
error, which often leads to
these very complex trees,

65
00:04:07,839 --> 00:04:10,050
which are very prone to overfitting.

66
00:04:11,230 --> 00:04:14,460
And here's a real world
dataset from loan data.

67
00:04:14,460 --> 00:04:18,570
Where we actually observed that
big bad overfitting problem.

68
00:04:18,570 --> 00:04:23,830
So if we take the depth of the tree,
and we push it all the way to depth 18,

69
00:04:23,830 --> 00:04:28,880
we'll see that the training
error has gone down a lot.

70
00:04:28,880 --> 00:04:33,680
And the blue line, which gets you down
to about eight percent training error,

71
00:04:33,680 --> 00:04:35,010
which is extremely low.

72
00:04:35,010 --> 00:04:40,640
However, if you look at the validation
set error, man that was not so good.

73
00:04:40,640 --> 00:04:48,340
Maybe that's around 39%, and so
there's a big gap between the two.

74
00:04:49,800 --> 00:04:54,190
Which we'll characterize
as a form of over fitting.

75
00:04:56,330 --> 00:05:02,240
If somehow you were able to pick
the best depth for a decision tree,

76
00:05:02,240 --> 00:05:07,880
which in this case was depth seven.

77
00:05:10,040 --> 00:05:14,715
Then you notice that in this case,

78
00:05:17,141 --> 00:05:21,741
The validation error

79
00:05:21,741 --> 00:05:26,601
is just under 35%.

80
00:05:26,601 --> 00:05:31,341
In other words, if i could pick the right

81
00:05:31,341 --> 00:05:36,900
depth I would get 39% validation error.

82
00:05:36,900 --> 00:05:41,110
But, if I let it go until
very low training error,

83
00:05:41,110 --> 00:05:43,380
I get a validation error of 39%.

84
00:05:43,380 --> 00:05:47,230
So going all the way, bad idea.

85
00:05:47,230 --> 00:05:49,053
Gotta stop a little earlier.

86
00:05:49,053 --> 00:05:53,109
[MUSIC]