[MUSIC] Next, we're going to explore a different approach for finding simpler decision trees. We're going to grow them all the way and we're going to try pruning. And the way we going to prune it is kind of like stopping criteria when the prune thing that don't improve our error enough. So, we have two ways to try to find simpler trees. One way was to stop recursing earlier, and the second one was to expand and then prove. Let's explore the second one, because it's a great complement to the first. Our goal is going to be to take the complete way that we learned by going too far. Man, going too far and somehow top of some stuff remove it and then simplify it in order to come up with what will be a better way in terms of true error. So, here's a kind of motivational picture. You might end up with a complex tree if you don't stop too early and you can imagine trying to come up with some fancy way to stop earlier. But instead of doing this, you just go a little too far and then we're going to back track. So we're going to start from this complex tree and we're going to simplify it until we get a reasonable tree that has maybe higher training error, but has better true error. The fundamental question though is what makes a tree simpler? Clearly, if it have a super deep tree on the left versus there's a shallow tree on the right. So in other words, if you're looking at depth, then it is a here's a technical term that I like to use. No brainier. It is a no-brainier that the tree on the right is simpler than the tree on the left. So, is depth the measure we want to use to quality what complexity of these trees is? It turns out that depth is not a very good metric and the reason is the core problem is that some features might have more values than others. And so, some features might lead to more splits than others. So here, both trees have the same depth. And so the question is which one is simpler? And we will see this problem all the time. And intuitively, the more leaves you have, the more complex the tree is. So we're going to introduce this function, L(T), which just measures the number of leaves the tree has. And so the tree on the left here has five leaves, the tree on the right has two leaves. And so we're going to say, the tree on the right is simpler. And now, we're getting somewhere interesting. So now, we can start talking about the complexity of decision trees. So for example, this super complex one on the left had the number of leaves here was one, two, three, four, five, six while the one that's too simple on the right. The number of leaves it had was one and somehow I want to find a solution in between. A solution in between will allow us to adopt the logical deeper some branches shallower in others, but try to keep the number of leaves as good as possible. And so just like other areas in machine learning, we're going to introduce some kind of trade off between these two concepts. The two concepts we want to trade off to the two quantities are how well my tree fits the data and what the complexity of the tree is. So the total cost is going to be a measure of fit and a measure of complexity. We saw this in regression. We saw this in logistic regression. We're seeing this again in decision tree. Here, the measure of fit is just classification error in the training data. So training error and the measure of complexity is going to be number of leaves, and we're going to balance the two. So more specifically, we're going to use Error(T) to node the error, classification error training data and L(T) to note the number of leaves. And as we will see, we'll have a magic parameter to trade off between those two quantities. In particular, we're going to use lambda to denote that magic parameter that we can fit then for validation set where we have cross validation. When lambda equals 0, we're back to the standard decision tree learning process. When lambda is infinite, we have infinite penalty. So we're going to have infinite penalty, which means we're going to just learn a tree that has no decisions in it, so just the root. So, we're going to learn something like this. Root and then a class. And in this case, we're just going to say that y hat for all data points is the majority class. In other words, lambda is equal to infinity. We're just going to output the majority class, when lambda is equal to 0, it's crazy deep decision tree. When lambda is in between, that's where we came here. So, this is going to balance the fit and the complexity of the tree. More specifically, we're going to go from that complex tree to a simple tree by using a pruning approach where the pruning is based on this function, which balances the Error(T) plus lambda times the number of leaves in T. [MUSIC]