[MUSIC] With all that in mind
let's just dig in and observe how overfitting comes about
in the context of decision trees. When we start learning decision
tree we learn the decision stump, which is a very simple
boundary between the data. Let's saying there's a simple
example would be using something of vertical line or horizontal line. So in this case, when the decisions
with respect to x(1) as to whether it's less than -0.07 or
greater than 0.07. So if it's less than,
we are going to predict -1 and that will correspond to this side
of the decision boundary, but if it's greater than -0.07 we are going
to predict +1 and we are going to be falling to the right side
of the decision boundary. So that's a very simple decision boundary
with just a depth 1 decision tree or a decision stump, but
as we increase the depth, that situation becomes more and
more complex. The decision boundaries become more and
more complex. So we can see, as you increase the depth you increase
the complexity of the decision boundaries. Until we end up with a super crazy
one here where the depth is 10. And so, that depth=10 decision
boundary has 0.00 training error. So the training error at this point
goes down from 0.22 which is relatively high for decision stamp through 0.13,
0.10, which are depth two and depth three decision trees,
which looked okay to this crazy decision tree of depth 10,
which had zero training error. And this should be a big warning sign for
you. And now we should sad face. We've seen that with decision trees, as
we increase the depth, the training error goes down until we get to a point where
training error could either reach zero, and very often actually does in this
case with decision tree of depth 10. So we could say,
decision tree of depth 10, that's great, it has zero training error, so
that's a perfect decision tree. But in reality,
it's not a perfect decision tree. And as we know,
even though the training error is zero, the true error can shoot up. And so this can be a really highly
over fitting decision tree. So it's good to take a step back and
really understand a little bit better why the training error of decision trees
tend to go down so quickly with depth. So let's take this simple example where
we're just learning decision stump. We have 40 data points,
just like we've been using, 22 of them were safe loans, 18 were risky. And we chose first to split on credit. Now the question is why did we
choose to split on credit first? Why was that the first feature we chose? And the reason we choose to split
on credit first is because it improved the training error the most. Improved the training
error from 0.45 to 0.20. So that was a good first split to make. Now, if we go back and
review the algorithm for choosing what the best decision split is,
the best feature to split on is, we'd try every single possible feature and pick the one that decreases
the training error the most. And so at every step of the way, what are the features that
decrease the training error? And we're adding features that
decrease the training error. And we're adding features and
decreasing the error. And eventually we'll drive
the training error to zero. Unless of course we get to some points
where we can't drive the training error down because we've run out
of features to split on and we have positive points on top of
negative points, but that's a side note. Most importantly is to remember
the training error test go down, down down, down. And so that going down as we increase the
depth is what leads to this low training error, which often leads to
these very complex trees, which are very prone to overfitting. And here's a real world
dataset from loan data. Where we actually observed that
big bad overfitting problem. So if we take the depth of the tree,
and we push it all the way to depth 18, we'll see that the training
error has gone down a lot. And the blue line, which gets you down
to about eight percent training error, which is extremely low. However, if you look at the validation
set error, man that was not so good. Maybe that's around 39%, and so
there's a big gap between the two. Which we'll characterize
as a form of over fitting. If somehow you were able to pick
the best depth for a decision tree, which in this case was depth seven. Then you notice that in this case, The validation error is just under 35%. In other words, if i could pick the right depth I would get 39% validation error. But, if I let it go until
very low training error, I get a validation error of 39%. So going all the way, bad idea. Gotta stop a little earlier. [MUSIC]