1
00:00:00,000 --> 00:00:05,066
[MUSIC]

2
00:00:05,066 --> 00:00:08,699
So now we've discussed the idea of
building a decision stump from data and

3
00:00:08,699 --> 00:00:11,460
a little bit of what
that data set looks like.

4
00:00:11,460 --> 00:00:14,940
Let's discuss now how to
pick the right feature to

5
00:00:14,940 --> 00:00:17,400
split on when we're
building a decision stump.

6
00:00:18,760 --> 00:00:22,470
So trying to learn decision stump
from data and he was split on credit.

7
00:00:22,470 --> 00:00:26,030
But the question is what is
the best feature to split on and

8
00:00:26,030 --> 00:00:26,840
how do we measure that?

9
00:00:28,180 --> 00:00:31,740
So in our example we split on credit or
we could have split on something else.

10
00:00:31,740 --> 00:00:33,130
Let's say we could have
split on term of the loan,

11
00:00:33,130 --> 00:00:36,070
this is a three year loan or
a five year loan.

12
00:00:36,070 --> 00:00:39,260
And the question is,
what is better between the two?

13
00:00:39,260 --> 00:00:40,250
What is a better split?

14
00:00:42,430 --> 00:00:45,340
And that's what we're going to ask next.

15
00:00:46,620 --> 00:00:51,410
And intuitively, a better split
is one that gives you lowest

16
00:00:51,410 --> 00:00:54,900
classification error, and that's exactly
what we'll explore in the algorithm.

17
00:00:56,210 --> 00:00:59,580
We'd like to figure out whether
it's better to split on credit or

18
00:00:59,580 --> 00:01:00,210
split on term.

19
00:01:00,210 --> 00:01:04,100
And the way we're going to do that is
by measuring the number of mistakes

20
00:01:04,100 --> 00:01:06,420
each one of the decision stumps makes and

21
00:01:06,420 --> 00:01:10,390
pick the one that makes the least number
of mistakes, so has lowest error.

22
00:01:10,390 --> 00:01:12,820
And just remember the error is
just the number of mistakes

23
00:01:12,820 --> 00:01:16,510
a classifier makes divided by
the total number of data points.

24
00:01:16,510 --> 00:01:20,420
So, lets start with the root node.

25
00:01:20,420 --> 00:01:24,284
And this is if we made no splits
on the decision stump and

26
00:01:24,284 --> 00:01:27,960
measure the error that
we get in that example.

27
00:01:27,960 --> 00:01:32,740
So, as a reminder,
we're going to predict y hat to be

28
00:01:32,740 --> 00:01:37,820
the majority class associated
with a particular node.

29
00:01:37,820 --> 00:01:42,410
So, in our case here the class
that has most values for

30
00:01:42,410 --> 00:01:44,765
the root is going to be the safe class.

31
00:01:44,765 --> 00:01:47,654
And we're going to compute
the classification error of making that

32
00:01:47,654 --> 00:01:48,720
prediction.

33
00:01:48,720 --> 00:01:51,890
So here worse classification
error we're just saying that

34
00:01:51,890 --> 00:01:54,170
all the data points are safe.

35
00:01:55,850 --> 00:02:00,369
So we can go ahead and say that there
were 22 correct things, 22 were safe and

36
00:02:00,369 --> 00:02:01,322
18 mistakes.

37
00:02:01,322 --> 00:02:06,880
So the classification
error here is going to be,

38
00:02:06,880 --> 00:02:14,210
I'm going to change colors,
is going to be 18 / 22 + 18.

39
00:02:14,210 --> 00:02:15,417
Which is 40.

40
00:02:15,417 --> 00:02:20,525
So, 18 over 40 which is 0.45.

41
00:02:20,525 --> 00:02:22,243
It's a binary classification problem.

42
00:02:22,243 --> 00:02:28,410
You only get 0.45, I mean,
you get 0.45 error which is really bad.

43
00:02:28,410 --> 00:02:33,989
And so not splitting on anything
gives you a pretty bad result.

44
00:02:33,989 --> 00:02:39,817
So the question here is, how good is
the decision stump that splits on credit?

45
00:02:39,817 --> 00:02:42,838
And how does it compare to
not splitting on anything,

46
00:02:42,838 --> 00:02:45,560
which had a classification error of 0.45?

47
00:02:45,560 --> 00:02:47,080
Is this one better?

48
00:02:47,080 --> 00:02:48,750
So let's look at the decision stump.

49
00:02:48,750 --> 00:02:56,230
For data points that have excellent credit
we're going to predict that they're safe.

50
00:02:56,230 --> 00:02:58,760
For those that have fair credit,
we're going to predict that they're safe.

51
00:02:58,760 --> 00:03:02,590
And ones that have poor credit we're
going to predict them to be risky.

52
00:03:02,590 --> 00:03:04,740
And so this is our prediction, and

53
00:03:04,740 --> 00:03:09,580
this is again the majority value of
the data in each one of these nodes.

54
00:03:09,580 --> 00:03:12,870
And if you look at how many mistakes
we make, you'll see that for

55
00:03:12,870 --> 00:03:15,430
the data that had excellent credit,

56
00:03:15,430 --> 00:03:18,220
we make zero mistakes because
everything was safe there.

57
00:03:18,220 --> 00:03:20,618
For data that had fair credit,
we're going to make four mistakes.

58
00:03:20,618 --> 00:03:23,195
There were four risky
loans with fair credit.

59
00:03:23,195 --> 00:03:27,860
And then for data that had poor credit,
and we'll see there's four mistakes

60
00:03:27,860 --> 00:03:32,520
again because there were four safe
loans when we had a poor credit.

61
00:03:32,520 --> 00:03:34,770
So let's compute our overall error.

62
00:03:34,770 --> 00:03:36,210
We're going to write over here.

63
00:03:36,210 --> 00:03:38,310
I'm just going to change my colors.

64
00:03:38,310 --> 00:03:43,120
And we'll see that we make 4 + 4 mistakes.

65
00:03:43,120 --> 00:03:48,314
So that's 8 out of 40 data points,
so that's an error of 0.20, 0.2.

66
00:03:48,314 --> 00:03:51,861
Which is smaller than we had, 0.45.

67
00:03:51,861 --> 00:03:56,413
So, we've gone down now from 0.45 to 0.2.

68
00:03:56,413 --> 00:03:58,800
So splitting on credit seems
like a pretty good idea.

69
00:04:00,070 --> 00:04:04,730
Now let's see what happens when
we split on term of the loan.

70
00:04:04,730 --> 00:04:06,250
We still have term of the loan.

71
00:04:06,250 --> 00:04:12,585
If the term is three years, maybe there
is 16 safe loans and 4 risky ones.

72
00:04:12,585 --> 00:04:17,230
So, in this case we're
making four mistakes.

73
00:04:17,230 --> 00:04:22,718
And for five years we predicted risky
where there were six safe loans,

74
00:04:22,718 --> 00:04:25,524
so now we're making six mistakes.

75
00:04:25,524 --> 00:04:32,610
And so if you look at the overall
error here it's 4 + 6 / 40.

76
00:04:32,610 --> 00:04:35,729
So that's 10 divided by 40, which is 0.25.

77
00:04:37,250 --> 00:04:39,450
So overall, if we look at our data,

78
00:04:39,450 --> 00:04:43,540
not splitting on anything,
the root node, has 0.45 error,

79
00:04:43,540 --> 00:04:48,620
splitting on credit has 0.2 error,
and splitting on term has 0.25 error.

80
00:04:51,100 --> 00:04:53,180
So can go back and ask,
what is the best choice?

81
00:04:53,180 --> 00:04:55,473
Should we split on credit,
or should split on term?

82
00:04:55,473 --> 00:04:56,982
The answer now becomes obvious.

83
00:04:56,982 --> 00:05:00,417
Splitting on credit gives you
lower classification error, so

84
00:05:00,417 --> 00:05:03,089
this is what our greedy
algorithm will do first.

85
00:05:03,089 --> 00:05:06,665
This is the first feature to split on.

86
00:05:10,194 --> 00:05:13,420
And that would be the winner
of our selection process.

87
00:05:13,420 --> 00:05:18,055
So in general, the decision tree splitting

88
00:05:18,055 --> 00:05:23,441
process will say given
the subset of data at node M,

89
00:05:23,441 --> 00:05:28,576
which is what we're looking at,
the root node so

90
00:05:28,576 --> 00:05:32,208
far, we try out every feature xi,

91
00:05:32,208 --> 00:05:38,010
which in our case here was credit,
term, and income.

92
00:05:38,010 --> 00:05:41,040
And I could see splitting the data
according to possible values

93
00:05:41,040 --> 00:05:44,570
of each one of these features, and
I compute a classification error

94
00:05:44,570 --> 00:05:48,110
of the resulting split,
just like we did manually over here.

95
00:05:48,110 --> 00:05:49,470
And then I pick the feature,

96
00:05:49,470 --> 00:05:54,840
which in our case was credit which
had the lowest classification error.

97
00:05:56,850 --> 00:06:03,310
So, if we go back to our decision tree
learning algorithm that first challenge

98
00:06:03,310 --> 00:06:08,160
that we had, figuring out what feature to
split on, we can now do that using this

99
00:06:08,160 --> 00:06:11,930
feature split selection algorithm that
minimizes the classification error.

100
00:06:12,960 --> 00:06:17,596
So next, we'll explore the other parts of
this decision tree learning algorithm.

101
00:06:17,596 --> 00:06:21,669
[MUSIC]