1
00:00:00,000 --> 00:00:04,366
[MUSIC]

2
00:00:04,366 --> 00:00:08,366
At the core of the decision tree algorithm
is choosing the feature to split on and

3
00:00:08,366 --> 00:00:12,186
that's the place we're going to modify
it to also choose what to do with that

4
00:00:12,186 --> 00:00:12,980
missing data.

5
00:00:14,220 --> 00:00:18,926
Regarding the most important question
in decision tree learning is

6
00:00:18,926 --> 00:00:21,659
figuring out what feature to split up.

7
00:00:21,659 --> 00:00:24,714
And so now when we look at
what features split on,

8
00:00:24,714 --> 00:00:28,142
we're not just going to decide
what feature to split on,

9
00:00:28,142 --> 00:00:32,931
we're also going to decide, which branch
to take when we see a missing value?

10
00:00:32,931 --> 00:00:37,498
And so conceptually,
it's just a very very simplification

11
00:00:37,498 --> 00:00:41,179
algorithm we described
in the previous module.

12
00:00:41,179 --> 00:00:45,353
But now, we're just going to add this
extra search to figure out where to

13
00:00:45,353 --> 00:00:46,777
put that missing data.

14
00:00:46,777 --> 00:00:49,142
Just like with grading
learning decision trees,

15
00:00:49,142 --> 00:00:53,039
we can just focus on learning decision
stump to understand what's going on here.

16
00:00:53,039 --> 00:00:55,427
So, we chose to split on credit.

17
00:00:55,427 --> 00:01:00,451
Credit can be excellent, fair or
poor and the branch here on

18
00:01:00,451 --> 00:01:06,094
the right is the one where all data
points have credit equals poor.

19
00:01:06,094 --> 00:01:10,047
And now, we might decide that the missing
data is going to go into that branch.

20
00:01:10,047 --> 00:01:13,540
So now we're going to take all
the data that was missing and

21
00:01:13,540 --> 00:01:17,493
the data where credit equals poor,
and feed it into that node.

22
00:01:17,493 --> 00:01:20,820
That's what's going to happen
with categorical data.

23
00:01:20,820 --> 00:01:22,693
Now when we have numeric data,

24
00:01:22,693 --> 00:01:27,538
when they use those threshold splits that
we discussed in the previous module.

25
00:01:27,538 --> 00:01:31,631
And in this case, when my threshold
income is less than $50,000 or

26
00:01:31,631 --> 00:01:33,419
greater than $50,000.

27
00:01:33,419 --> 00:01:39,464
And if we decided that missing data or
data where the income is unknown,

28
00:01:39,464 --> 00:01:46,230
when the feed it to the same node of
those the income was less than 50,000.

29
00:01:46,230 --> 00:01:49,485
That's the decisions that we are going
to try to make inside of our learning

30
00:01:49,485 --> 00:01:50,060
algorithm.

31
00:01:51,340 --> 00:01:54,400
Now, we have to decide where
this data's going to go.

32
00:01:54,400 --> 00:01:58,590
So let's say, we choose to split a credit
or considering splitting a credit.

33
00:01:58,590 --> 00:02:01,820
If we consider spilling of credit,
what should we do with the missing data?

34
00:02:01,820 --> 00:02:05,295
Should we put it with
people of excellent credit,

35
00:02:05,295 --> 00:02:09,855
for people with fair credit or
with some people with poor credit?

36
00:02:09,855 --> 00:02:14,200
I don't know what's best or
we have a principle for doing this.

37
00:02:14,200 --> 00:02:17,920
As the principle says, choose the branch
that leads to the lowest classification

38
00:02:17,920 --> 00:02:19,480
error, just like we did before.

39
00:02:20,770 --> 00:02:25,800
Let say, they were considering splitting
on credit and we'll consider assigning

40
00:02:25,800 --> 00:02:31,170
the missing data to the branch
where the credit was excellent.

41
00:02:31,170 --> 00:02:32,194
Let see what happens.

42
00:02:32,194 --> 00:02:37,464
So in this example,
we have three data points

43
00:02:37,464 --> 00:02:42,060
with missing data, these three here.

44
00:02:42,060 --> 00:02:46,438
And we're considering assigning those data
points to the branch where the credit was

45
00:02:46,438 --> 00:02:47,120
excellent.

46
00:02:47,120 --> 00:02:49,960
So, we had two kinds of data.

47
00:02:49,960 --> 00:02:53,045
We had the observe data and then there,

48
00:02:53,045 --> 00:02:57,918
we had in the exon branch,
we had nine safe loans, zero risk.

49
00:02:57,918 --> 00:03:02,010
In the fair branch,
we had eight safe loans and four risky.

50
00:03:02,010 --> 00:03:05,010
And in the poor branch,
we have 4 safe loans and 12 risky.

51
00:03:06,060 --> 00:03:09,010
Now, we have the three extra data points.

52
00:03:09,010 --> 00:03:15,190
And in this three extra data points as you
can see, there's two risky and one safe.

53
00:03:15,190 --> 00:03:17,190
So, we're thinking about
adding these two risky and

54
00:03:17,190 --> 00:03:20,290
one safe into the excellent branch.

55
00:03:20,290 --> 00:03:21,847
So, what are the grand totals here?

56
00:03:21,847 --> 00:03:27,908
The grand totals are ten safe and
two risky inside the excellent branch and

57
00:03:27,908 --> 00:03:34,820
then the same for the other two branches,
because we're not in excellent data.

58
00:03:34,820 --> 00:03:38,678
So what's the overall total error
of the resulting decision tree?

59
00:03:38,678 --> 00:03:42,560
Well, for the excellent branch,
we're going to predict safe.

60
00:03:42,560 --> 00:03:45,350
So, we're going to make two mistakes.

61
00:03:45,350 --> 00:03:49,020
For the fair branch,
we're going to predict safe.

62
00:03:49,020 --> 00:03:52,733
So we're going to make four mistakes and
for the poor branch,

63
00:03:52,733 --> 00:03:56,828
we're going to predict risky and
we're going to make four mistakes.

64
00:03:56,828 --> 00:04:03,628
So overall, we're going to make 10
mistakes and our total was 40 data points,

65
00:04:03,628 --> 00:04:07,975
so the total error here
is 0.25 one-quarter.

66
00:04:07,975 --> 00:04:13,362
Now we do the same for
every single option, every single branch.

67
00:04:13,362 --> 00:04:20,670
So, we tried putting the unknown data
with the excellent and got error 0.5.

68
00:04:20,670 --> 00:04:27,742
We tried to put it with the fair branch,
we get A 0.25 error as well.

69
00:04:27,742 --> 00:04:31,831
But if we try to put it
with the last branch here,

70
00:04:31,831 --> 00:04:36,448
with the poor branch,
we get an error of 0.225.

71
00:04:36,448 --> 00:04:40,260
And so that's a lower error,
so this is the winner.

72
00:04:40,260 --> 00:04:45,774
So the best thing to do would be to assign

73
00:04:45,774 --> 00:04:52,792
unknowns to the poor batch
to credit equals poor,

74
00:04:52,792 --> 00:04:59,660
which is exactly what we
did in the earlier tree.

75
00:05:01,200 --> 00:05:05,876
This is the modification of
the tree splitting algorithm to

76
00:05:05,876 --> 00:05:10,750
take into account missing data,
gets a little complicated.

77
00:05:10,750 --> 00:05:14,068
But it's generally,
kind of simple idea we just described.

78
00:05:14,068 --> 00:05:18,595
You're leaving here for a reference,
if you choose to implement it.

79
00:05:18,595 --> 00:05:23,827
But overall, the fundamental concept
is you can use this idea of embedding,

80
00:05:23,827 --> 00:05:28,657
handling of missing data inside of
the learning algorithm to get better

81
00:05:28,657 --> 00:05:30,280
overall performance.

82
00:05:30,280 --> 00:05:35,809
[MUSIC]