1
00:00:00,000 --> 00:00:04,463
[MUSIC]

2
00:00:04,463 --> 00:00:08,316
So far in this module, we've
discussed learning decision tree, but

3
00:00:08,316 --> 00:00:12,360
we only used what's called
categorical data inputs or features.

4
00:00:12,360 --> 00:00:19,390
So, we looked at, credit could be poor,
fair, or excellent.

5
00:00:19,390 --> 00:00:23,210
However, if you look at income,
that's what's called a real value feature.

6
00:00:23,210 --> 00:00:25,215
It has continuous possible values.

7
00:00:25,215 --> 00:00:29,351
So 105,000, 73,000, 69,000, and so on.

8
00:00:29,351 --> 00:00:32,860
So the question is how do you build
a decision tree with this kind of input?

9
00:00:34,400 --> 00:00:37,440
One natural approach is
to just treat income or

10
00:00:37,440 --> 00:00:41,130
the continuous valued feature
by of the categorical data.

11
00:00:41,130 --> 00:00:46,280
So let's take that root nulled with 40
datapoints and just split on income.

12
00:00:46,280 --> 00:00:47,700
And see what happens.

13
00:00:47,700 --> 00:00:51,395
Well, there is one datapoint
with income of $30,000.

14
00:00:51,395 --> 00:00:54,693
There's one datapoint income of $31,400.

15
00:00:54,693 --> 00:00:57,760
There's one datapoint with
income $39,500 and so on.

16
00:00:57,760 --> 00:01:01,390
And it turns out that the nodes
that we get out of them

17
00:01:01,390 --> 00:01:06,520
basically only have one datapoint in them.

18
00:01:06,520 --> 00:01:09,180
And this can be really, really bad.

19
00:01:09,180 --> 00:01:12,960
When you have very few data points in
the intermediate node in the decision tree

20
00:01:12,960 --> 00:01:14,960
you're very prone to overfitting.

21
00:01:14,960 --> 00:01:17,410
Very prone to make
predictions you cannot trust.

22
00:01:17,410 --> 00:01:22,730
So, for example, if you look
here you'd predict in this case

23
00:01:22,730 --> 00:01:27,850
that if you're income is $30,000,
this is definitely a risky loan,

24
00:01:27,850 --> 00:01:31,920
but if you're income is $31,400
is definitely a safe loan,

25
00:01:31,920 --> 00:01:36,400
however if your income is now 39,500,
you're back to risky.

26
00:01:36,400 --> 00:01:43,440
So [LAUGH] it's risky, safe, risky,
which doesn't make any sense.

27
00:01:43,440 --> 00:01:45,040
Do you trust it?

28
00:01:45,040 --> 00:01:46,080
I wouldn't.

29
00:01:46,080 --> 00:01:50,920
And so the question is, how do we
deal with this real valued features.

30
00:01:52,070 --> 00:01:55,730
As a very natural alternative,
we can work on threshold splits.

31
00:01:55,730 --> 00:02:00,430
And these are simply picking
a threshold on the value

32
00:02:00,430 --> 00:02:03,466
of that continuous valued feature,
so let's say 60,000.

33
00:02:03,466 --> 00:02:07,490
And for the left side of that split
will put all the data points have income

34
00:02:07,490 --> 00:02:09,700
lower than $60,000, and on the right,

35
00:02:09,700 --> 00:02:13,050
will put all the data points have
incomes higher than or equal to $60,000.

36
00:02:13,050 --> 00:02:17,988
And as we can see, we have a subset of
the data here, income higher than $60,000.

37
00:02:17,988 --> 00:02:23,220
And for
those we have many data points there.

38
00:02:23,220 --> 00:02:28,160
So, it's a lot less risk of over
fitting and we see that 14 of

39
00:02:28,160 --> 00:02:33,070
them have our safe laws so
probably predict a safe there.

40
00:02:33,070 --> 00:02:39,150
Well, 13 will risk you on the $60,000 so
maybe you'll predict those as risks.

41
00:02:39,150 --> 00:02:42,520
So this is a very natural kind of
split that we might want to do

42
00:02:42,520 --> 00:02:43,590
with continuous value data.

43
00:02:45,140 --> 00:02:47,790
Let's now take a moment to visualize
what happens when we do this kind of

44
00:02:47,790 --> 00:02:49,050
threshold split.

45
00:02:49,050 --> 00:02:54,490
So for example, I've laid out my
data income into this line here that

46
00:02:54,490 --> 00:02:59,500
ranges from 10,000 to 120, and
if we pick a threshold split of 60,000 and

47
00:02:59,500 --> 00:03:03,800
we say everything on the left of the split
has income less that $60,000 we're

48
00:03:03,800 --> 00:03:05,920
going to predict to be risky loans.

49
00:03:05,920 --> 00:03:10,050
Everything to the right has income higher
than $60,000 we're going to predict those

50
00:03:10,050 --> 00:03:11,050
as being safe loans.

51
00:03:12,960 --> 00:03:16,190
Now let's supposed that we have
two continuous value to features.

52
00:03:16,190 --> 00:03:21,070
We have income in the y axis and
we have age in the x axis, and

53
00:03:21,070 --> 00:03:22,110
let's see what happens here.

54
00:03:22,110 --> 00:03:26,540
And you'll see there are some positive and
negative examples laid out in 2D.

55
00:03:26,540 --> 00:03:29,760
Another thing that's interesting
is that you see that

56
00:03:29,760 --> 00:03:32,500
older people with higher incomes
tend to be safe loans, but

57
00:03:32,500 --> 00:03:35,590
also younger people that may have
lower incomes, those might also be

58
00:03:35,590 --> 00:03:39,430
safe loans because those people may
make money over time, let's say.

59
00:03:39,430 --> 00:03:44,980
So we might look at this state and
decide to split on age first.

60
00:03:44,980 --> 00:03:49,460
And if we split on age, let's say
age equals 38, we'll see that for

61
00:03:49,460 --> 00:03:51,590
the folks that are younger than 38,

62
00:03:51,590 --> 00:03:55,590
on average, more of them have risky long,
so you might predict risky.

63
00:03:56,610 --> 00:04:00,010
But for
the folks that have age greater than 38,

64
00:04:00,010 --> 00:04:01,800
we have more safe loans than risky.

65
00:04:01,800 --> 00:04:02,920
So we might predict safe.

66
00:04:04,230 --> 00:04:06,860
Now to the next split
in our decision tree.

67
00:04:06,860 --> 00:04:11,050
We might choose to split for
the folks that have age greater than 38 we

68
00:04:11,050 --> 00:04:16,250
might split on the income and ask whether
this income greater than $60,000 or not.

69
00:04:16,250 --> 00:04:19,440
And if it is, we put a split there.

70
00:04:19,440 --> 00:04:24,440
And we'll see that the point
below Income below $60,000

71
00:04:24,440 --> 00:04:28,550
even the higher age might be negative,
so might be predicted negative.

72
00:04:30,590 --> 00:04:34,400
So let's take a moment to visualize
the decision tree we've learned so far.

73
00:04:34,400 --> 00:04:42,095
So we start from the root node over
here and we made our first split.

74
00:04:42,095 --> 00:04:46,959
And for our first split,
we decide to split on age.

75
00:04:46,959 --> 00:04:53,185
And the two possibilities
we looked at were,

76
00:04:53,185 --> 00:04:57,700
is the age smaller

77
00:04:59,000 --> 00:05:05,350
than 38 or is the age greater than or
equal to 38.

78
00:05:05,350 --> 00:05:09,070
So that was our first threshold split.

79
00:05:09,070 --> 00:05:13,400
And for those with age smaller than 38,
let's say that we stopped right here,

80
00:05:13,400 --> 00:05:17,170
we'd see that there's five risky and
three safe.

81
00:05:17,170 --> 00:05:18,550
So we'd predict risky.

82
00:05:20,370 --> 00:05:21,690
So that might be our leaf here.

83
00:05:22,890 --> 00:05:30,030
And for age greater than 38 we took
another split, which was on income.

84
00:05:30,030 --> 00:05:35,520
And we just ask ourselves
is the income Is it

85
00:05:35,520 --> 00:05:42,430
less than 60,000 or
is it greater than or equal to 60,000?

86
00:05:42,430 --> 00:05:46,170
Now for
the ones that have income greater than or

87
00:05:46,170 --> 00:05:51,770
equal to 60,000 in age greater than 38
we predicted those were safe loans.

88
00:05:52,790 --> 00:05:55,890
While the ones that had
age greater than 38 and

89
00:05:55,890 --> 00:06:00,130
income less than $60,000,
we predicted those to be risky loans.

90
00:06:01,420 --> 00:06:05,429
And this is an example for
the tree where we're making

91
00:06:05,429 --> 00:06:09,527
these binary splits on the data for
the continuous variables.

92
00:06:09,527 --> 00:06:13,625
[MUSIC]