1
00:00:00,000 --> 00:00:04,211
[MUSIC]

2
00:00:04,211 --> 00:00:07,152
Next, we're going to explore
a different approach for

3
00:00:07,152 --> 00:00:09,055
finding simpler decision trees.

4
00:00:09,055 --> 00:00:11,841
We're going to grow them all the way and
we're going to try pruning.

5
00:00:11,841 --> 00:00:16,360
And the way we going to prune it is
kind of like stopping criteria when

6
00:00:16,360 --> 00:00:20,731
the prune thing that don't
improve our error enough.

7
00:00:20,731 --> 00:00:24,989
So, we have two ways to
try to find simpler trees.

8
00:00:24,989 --> 00:00:29,835
One way was to stop recursing earlier, and

9
00:00:29,835 --> 00:00:34,910
the second one was to expand and
then prove.

10
00:00:34,910 --> 00:00:40,831
Let's explore the second one, because
it's a great complement to the first.

11
00:00:40,831 --> 00:00:45,497
Our goal is going to be to take
the complete way that we learned by going

12
00:00:45,497 --> 00:00:46,146
too far.

13
00:00:46,146 --> 00:00:52,542
Man, going too far and
somehow top of some stuff remove it and

14
00:00:52,542 --> 00:00:57,185
then simplify it in order
to come up with what

15
00:00:57,185 --> 00:01:01,710
will be a better way in
terms of true error.

16
00:01:03,180 --> 00:01:06,064
So, here's a kind of motivational picture.

17
00:01:06,064 --> 00:01:10,409
You might end up with a complex tree
if you don't stop too early and

18
00:01:10,409 --> 00:01:15,004
you can imagine trying to come up
with some fancy way to stop earlier.

19
00:01:15,004 --> 00:01:16,731
But instead of doing this,

20
00:01:16,731 --> 00:01:21,510
you just go a little too far and
then we're going to back track.

21
00:01:21,510 --> 00:01:23,790
So we're going to start
from this complex tree and

22
00:01:23,790 --> 00:01:28,520
we're going to simplify it until
we get a reasonable tree that has

23
00:01:28,520 --> 00:01:33,200
maybe higher training error,
but has better true error.

24
00:01:35,050 --> 00:01:37,700
The fundamental question though
is what makes a tree simpler?

25
00:01:38,990 --> 00:01:41,930
Clearly, if it have a super

26
00:01:41,930 --> 00:01:46,160
deep tree on the left versus there's
a shallow tree on the right.

27
00:01:46,160 --> 00:01:49,488
So in other words,
if you're looking at depth,

28
00:01:49,488 --> 00:01:53,241
then it is a here's a technical
term that I like to use.

29
00:01:53,241 --> 00:01:56,956
No brainier.

30
00:01:56,956 --> 00:02:01,397
It is a no-brainier that the tree on
the right is simpler than the tree on

31
00:02:01,397 --> 00:02:02,570
the left.

32
00:02:02,570 --> 00:02:09,127
So, is depth the measure we want to use to
quality what complexity of these trees is?

33
00:02:09,127 --> 00:02:14,142
It turns out that depth is not a very
good metric and the reason is the core

34
00:02:14,142 --> 00:02:18,840
problem is that some features might
have more values than others.

35
00:02:18,840 --> 00:02:23,363
And so, some features might lead
to more splits than others.

36
00:02:23,363 --> 00:02:26,989
So here, both trees have the same depth.

37
00:02:29,486 --> 00:02:35,441
And so
the question is which one is simpler?

38
00:02:36,840 --> 00:02:39,190
And we will see this problem all the time.

39
00:02:39,190 --> 00:02:44,520
And intuitively, the more leaves you have,
the more complex the tree is.

40
00:02:44,520 --> 00:02:47,943
So we're going to introduce this function,
L(T),

41
00:02:47,943 --> 00:02:51,535
which just measures the number
of leaves the tree has.

42
00:02:51,535 --> 00:02:54,238
And so
the tree on the left here has five leaves,

43
00:02:54,238 --> 00:02:56,660
the tree on the right has two leaves.

44
00:02:56,660 --> 00:03:01,078
And so we're going to say,
the tree on the right is simpler.

45
00:03:01,078 --> 00:03:03,150
And now,
we're getting somewhere interesting.

46
00:03:03,150 --> 00:03:06,867
So now, we can start talking about
the complexity of decision trees.

47
00:03:06,867 --> 00:03:12,421
So for example, this super complex one on
the left had the number of leaves here

48
00:03:12,421 --> 00:03:19,210
was one, two, three, four, five, six while
the one that's too simple on the right.

49
00:03:19,210 --> 00:03:22,992
The number of leaves it had was one and

50
00:03:22,992 --> 00:03:27,630
somehow I want to find
a solution in between.

51
00:03:27,630 --> 00:03:33,440
A solution in between will allow us to
adopt the logical deeper some branches

52
00:03:33,440 --> 00:03:40,070
shallower in others, but try to keep
the number of leaves as good as possible.

53
00:03:40,070 --> 00:03:45,690
And so
just like other areas in machine learning,

54
00:03:45,690 --> 00:03:49,190
we're going to introduce some kind of
trade off between these two concepts.

55
00:03:50,580 --> 00:03:55,260
The two concepts we want to trade
off to the two quantities are how

56
00:03:55,260 --> 00:03:59,951
well my tree fits the data and
what the complexity of the tree is.

57
00:03:59,951 --> 00:04:04,037
So the total cost is going to be a measure
of fit and a measure of complexity.

58
00:04:04,037 --> 00:04:04,745
We saw this in regression.

59
00:04:04,745 --> 00:04:06,185
We saw this in logistic regression.

60
00:04:06,185 --> 00:04:08,147
We're seeing this again in decision tree.

61
00:04:08,147 --> 00:04:11,769
Here, the measure of fit is just
classification error in the training data.

62
00:04:11,769 --> 00:04:14,841
So training error and the measure of
complexity is going to be number of

63
00:04:14,841 --> 00:04:16,780
leaves, and
we're going to balance the two.

64
00:04:18,050 --> 00:04:22,226
So more specifically, we're going to
use Error(T) to node the error,

65
00:04:22,226 --> 00:04:26,760
classification error training data and
L(T) to note the number of leaves.

66
00:04:28,090 --> 00:04:29,690
And as we will see,

67
00:04:29,690 --> 00:04:33,119
we'll have a magic parameter to trade
off between those two quantities.

68
00:04:34,240 --> 00:04:39,104
In particular, we're going to use lambda
to denote that magic parameter that

69
00:04:39,104 --> 00:04:43,363
we can fit then for validation set
where we have cross validation.

70
00:04:43,363 --> 00:04:48,954
When lambda equals 0, we're back to the
standard decision tree learning process.

71
00:04:52,090 --> 00:04:55,053
When lambda is infinite,
we have infinite penalty.

72
00:04:57,510 --> 00:05:02,168
So we're going to have infinite penalty,

73
00:05:02,168 --> 00:05:07,511
which means we're going to
just learn a tree that

74
00:05:07,511 --> 00:05:12,307
has no decisions in it, so just the root.

75
00:05:12,307 --> 00:05:14,442
So, we're going to learn
something like this.

76
00:05:14,442 --> 00:05:20,270
Root and then a class.

77
00:05:20,270 --> 00:05:24,910
And in this case,
we're just going to say that y hat for

78
00:05:24,910 --> 00:05:28,406
all data points is the majority class.

79
00:05:28,406 --> 00:05:30,399
In other words,
lambda is equal to infinity.

80
00:05:30,399 --> 00:05:34,305
We're just going to output the majority
class, when lambda is equal to 0,

81
00:05:34,305 --> 00:05:36,070
it's crazy deep decision tree.

82
00:05:36,070 --> 00:05:41,922
When lambda is in between,
that's where we came here.

83
00:05:41,922 --> 00:05:48,749
So, this is going to balance the fit and
the complexity of the tree.

84
00:05:50,838 --> 00:05:57,151
More specifically, we're going to go
from that complex tree to a simple

85
00:05:57,151 --> 00:06:04,106
tree by using a pruning approach where
the pruning is based on this function,

86
00:06:04,106 --> 00:06:10,851
which balances the Error(T) plus lambda
times the number of leaves in T.

87
00:06:10,851 --> 00:06:14,609
[MUSIC]