[MUSIC] Next, we're going to explore
a different approach for finding simpler decision trees. We're going to grow them all the way and
we're going to try pruning. And the way we going to prune it is
kind of like stopping criteria when the prune thing that don't
improve our error enough. So, we have two ways to
try to find simpler trees. One way was to stop recursing earlier, and the second one was to expand and
then prove. Let's explore the second one, because
it's a great complement to the first. Our goal is going to be to take
the complete way that we learned by going too far. Man, going too far and
somehow top of some stuff remove it and then simplify it in order
to come up with what will be a better way in
terms of true error. So, here's a kind of motivational picture. You might end up with a complex tree
if you don't stop too early and you can imagine trying to come up
with some fancy way to stop earlier. But instead of doing this, you just go a little too far and
then we're going to back track. So we're going to start
from this complex tree and we're going to simplify it until
we get a reasonable tree that has maybe higher training error,
but has better true error. The fundamental question though
is what makes a tree simpler? Clearly, if it have a super deep tree on the left versus there's
a shallow tree on the right. So in other words,
if you're looking at depth, then it is a here's a technical
term that I like to use. No brainier. It is a no-brainier that the tree on
the right is simpler than the tree on the left. So, is depth the measure we want to use to
quality what complexity of these trees is? It turns out that depth is not a very
good metric and the reason is the core problem is that some features might
have more values than others. And so, some features might lead
to more splits than others. So here, both trees have the same depth. And so
the question is which one is simpler? And we will see this problem all the time. And intuitively, the more leaves you have,
the more complex the tree is. So we're going to introduce this function,
L(T), which just measures the number
of leaves the tree has. And so
the tree on the left here has five leaves, the tree on the right has two leaves. And so we're going to say,
the tree on the right is simpler. And now,
we're getting somewhere interesting. So now, we can start talking about
the complexity of decision trees. So for example, this super complex one on
the left had the number of leaves here was one, two, three, four, five, six while
the one that's too simple on the right. The number of leaves it had was one and somehow I want to find
a solution in between. A solution in between will allow us to
adopt the logical deeper some branches shallower in others, but try to keep
the number of leaves as good as possible. And so
just like other areas in machine learning, we're going to introduce some kind of
trade off between these two concepts. The two concepts we want to trade
off to the two quantities are how well my tree fits the data and
what the complexity of the tree is. So the total cost is going to be a measure
of fit and a measure of complexity. We saw this in regression. We saw this in logistic regression. We're seeing this again in decision tree. Here, the measure of fit is just
classification error in the training data. So training error and the measure of
complexity is going to be number of leaves, and
we're going to balance the two. So more specifically, we're going to
use Error(T) to node the error, classification error training data and
L(T) to note the number of leaves. And as we will see, we'll have a magic parameter to trade
off between those two quantities. In particular, we're going to use lambda
to denote that magic parameter that we can fit then for validation set
where we have cross validation. When lambda equals 0, we're back to the
standard decision tree learning process. When lambda is infinite,
we have infinite penalty. So we're going to have infinite penalty, which means we're going to
just learn a tree that has no decisions in it, so just the root. So, we're going to learn
something like this. Root and then a class. And in this case,
we're just going to say that y hat for all data points is the majority class. In other words,
lambda is equal to infinity. We're just going to output the majority
class, when lambda is equal to 0, it's crazy deep decision tree. When lambda is in between,
that's where we came here. So, this is going to balance the fit and
the complexity of the tree. More specifically, we're going to go
from that complex tree to a simple tree by using a pruning approach where
the pruning is based on this function, which balances the Error(T) plus lambda
times the number of leaves in T. [MUSIC]