[MUSIC] Often, the best way to prevent overfitting
decision trees however, is to use pruning. Where you grow the tree to be
much bigger than you need, and then you chop off parts that
turn out to be less important. Now, pruning is very important
if you're implementing a real world decision tree algorithm. However, the details can
get a little tedious. Not too mathematical but
just a lot of little things you have to, a lot of Ts to cross and Is to dot. And so we've made this section optional. You're welcome to watch it. Not a lot of math, just a lot of detail. And so I would recommend you
especially pay attention if you're implementing your own
decision tree, learning from scratch. But if not, just know what pruning is,
we just talked about it. And know that most implementations, most great implementations out there
will use some kind of pruning. We discussed lots of recursions where
we are building the decision tree from the top down and we discussed two types
of list stopping L stopping conditions, the ones that we should always do,
the stopping conditions for in the previous module and the three list
stopping conditions from this module. Now let's explore some of
the challenges that come about, with this early stopping conditions. Early stopping condition once was,
let's pick a max depth, and stop recursing,
once you reach that max depth. The question is,
how do you really pick the max depth? We said that you should use cross
validation of validation set, but there can be challenges with setting data
side for that, you had to rerun your algorithms many times, and you might not
always reach the desired answer you like. And plus you want some parts of
the tree to grow more than others. It becomes really kind of
complicated to set this parameter. Now, early stopping condition two,
is the most dangerous one. Early stopping condition two says stop,
if the training error stops decreasing. So if that, classification error for
the training data stops going down, so flattens out. However it turns out that in practice the curve might not be as
beautiful as what I drew. It might not be smoothly going down,
it might jump and it might do some weird stuff. Let's see why that can come about. As we discussed early
stopping condition two stop when error doesn't
decrease can be dangerous. And to show how it's dangerous, we need a
counter example, and we'll use the counter example for everything, which if you saw
the course one you know that it is xor. So, here's our data and this data corresponds to the xor of
the first input and the second input. So, when they're both false
the output is false and when it's false the other is true
output is true which happens in the two middle rows here and then when
they're both true the output is false. If you are ready to find the counter
example for anything just say xor. In fact if someone calls you
in the middle of the night and asks what is the counter example for
this crazy problem I'm having? You say, xor. So what happens with this xor when we
just look at the root of decision tree. Well we have two rows with output true,
and two rows of output false. And so what is the error of this decision
tree we're just going to top level. Well you just choose one of these because
they're both the same, let's say we choose true because we're optimists,
we like to think positively. And so we're going to say
that the output here is +1. And so
that means that we're making two mistakes. Two mistakes, so
two out of four datapoints, so our error is 0.5, which is bad. That means that you are as good as random,
just error 0.5. Okay, let's see what happens
if we split on x[1]. Well, if we split on x[1] and
you are actually splitting the table like this and you see
that because xor is so beautiful, we, again, for each one of these subsequent
nodes, we're in the same situation. So, for example, for an x-1 equals 2,
you have one y true and one y false. So let's suppose that you predict plus
one for this side and also plus one for the side where x-1 is false,
because it's all even stevens. So what's the error? Well, we made mistakes,
mistakes were made. Mistakes were made here and
here so we made 1 + 1 mistakes which is 2 out of 4, so it's 0.5. So it turns out that splitting
on x[1] does not help. Okay, and see what happens
when you try to split on x[2]? So if I try to split on x[2] I end up with the same situation expect I change
this label here from x[1] to x[2]. But again,
we'll be in exactly the same situation where we're going to output 1,
on each one of these outcomes. And make the same number of mistakes. So one mistake for each,
which gives me, total error of 0.5. And so, if I look, all in all, neither are splitting on x1, nor x2. That helps me reduce the error. So both splits give you the same error. So the question is, should I start now? If neither feature improves and
your using early stopping condition two then you just say no don't
do it take it in the splits. You just stop splitting the decision tree. At the very very root output to be true,
make two mistakes and say that the error that
you get in the end is 0.5. So, [LAUGH] this is why xor is
a counter example to everything. What would happen if you didn't take
early stopping condition two and you just kept going? If you just kept going,
you get this tree that looks here. And if you see, for each one of these
leaves, there is zero mistakes. So the leaf from the left outputs false,
or minus one. And, makes zero mistakes,
next one makes zero mistakes. The third one makes zero mistakes,
and the last one makes zero mistakes. So, if you kept going. If you didn't use early
stopping condition two, you would get three errors of zero. And so, in this case, there is a huge gap, huge gap between training error was 0.5, which is as good as random,
and training error of zero, which is perfect training error. So in this case,
early stopping condition two created this arbitrary large gap
between the best he could have done. And what you end up outputting. So we saw a list of in condition two. Don't stop recursing if the training
error does not decrease enough. It seems like a reasonable heuristic,
but it has some dangerous pitfalls. Because two short, short-sighted,
this doesn't explore branches far enough. So it's kind of generally
a good principle. But if you apply top down you
might get yourself into trouble. So what can you do about this? This is where pruning comes in,
uses something like this criterian but uses it bottom up. [MUSIC]