[MUSIC] So far in this module, we've
discussed learning decision tree, but we only used what's called
categorical data inputs or features. So, we looked at, credit could be poor,
fair, or excellent. However, if you look at income,
that's what's called a real value feature. It has continuous possible values. So 105,000, 73,000, 69,000, and so on. So the question is how do you build
a decision tree with this kind of input? One natural approach is
to just treat income or the continuous valued feature
by of the categorical data. So let's take that root nulled with 40
datapoints and just split on income. And see what happens. Well, there is one datapoint
with income of $30,000. There's one datapoint income of $31,400. There's one datapoint with
income $39,500 and so on. And it turns out that the nodes
that we get out of them basically only have one datapoint in them. And this can be really, really bad. When you have very few data points in
the intermediate node in the decision tree you're very prone to overfitting. Very prone to make
predictions you cannot trust. So, for example, if you look
here you'd predict in this case that if you're income is $30,000,
this is definitely a risky loan, but if you're income is $31,400
is definitely a safe loan, however if your income is now 39,500,
you're back to risky. So [LAUGH] it's risky, safe, risky,
which doesn't make any sense. Do you trust it? I wouldn't. And so the question is, how do we
deal with this real valued features. As a very natural alternative,
we can work on threshold splits. And these are simply picking
a threshold on the value of that continuous valued feature,
so let's say 60,000. And for the left side of that split
will put all the data points have income lower than $60,000, and on the right, will put all the data points have
incomes higher than or equal to $60,000. And as we can see, we have a subset of
the data here, income higher than $60,000. And for
those we have many data points there. So, it's a lot less risk of over
fitting and we see that 14 of them have our safe laws so
probably predict a safe there. Well, 13 will risk you on the $60,000 so
maybe you'll predict those as risks. So this is a very natural kind of
split that we might want to do with continuous value data. Let's now take a moment to visualize
what happens when we do this kind of threshold split. So for example, I've laid out my
data income into this line here that ranges from 10,000 to 120, and
if we pick a threshold split of 60,000 and we say everything on the left of the split
has income less that $60,000 we're going to predict to be risky loans. Everything to the right has income higher
than $60,000 we're going to predict those as being safe loans. Now let's supposed that we have
two continuous value to features. We have income in the y axis and
we have age in the x axis, and let's see what happens here. And you'll see there are some positive and
negative examples laid out in 2D. Another thing that's interesting
is that you see that older people with higher incomes
tend to be safe loans, but also younger people that may have
lower incomes, those might also be safe loans because those people may
make money over time, let's say. So we might look at this state and
decide to split on age first. And if we split on age, let's say
age equals 38, we'll see that for the folks that are younger than 38, on average, more of them have risky long,
so you might predict risky. But for
the folks that have age greater than 38, we have more safe loans than risky. So we might predict safe. Now to the next split
in our decision tree. We might choose to split for
the folks that have age greater than 38 we might split on the income and ask whether
this income greater than $60,000 or not. And if it is, we put a split there. And we'll see that the point
below Income below $60,000 even the higher age might be negative,
so might be predicted negative. So let's take a moment to visualize
the decision tree we've learned so far. So we start from the root node over
here and we made our first split. And for our first split,
we decide to split on age. And the two possibilities
we looked at were, is the age smaller than 38 or is the age greater than or
equal to 38. So that was our first threshold split. And for those with age smaller than 38,
let's say that we stopped right here, we'd see that there's five risky and
three safe. So we'd predict risky. So that might be our leaf here. And for age greater than 38 we took
another split, which was on income. And we just ask ourselves
is the income Is it less than 60,000 or
is it greater than or equal to 60,000? Now for
the ones that have income greater than or equal to 60,000 in age greater than 38
we predicted those were safe loans. While the ones that had
age greater than 38 and income less than $60,000,
we predicted those to be risky loans. And this is an example for
the tree where we're making these binary splits on the data for
the continuous variables. [MUSIC]