[MUSIC] Often, the best way to prevent overfitting decision trees however, is to use pruning. Where you grow the tree to be much bigger than you need, and then you chop off parts that turn out to be less important. Now, pruning is very important if you're implementing a real world decision tree algorithm. However, the details can get a little tedious. Not too mathematical but just a lot of little things you have to, a lot of Ts to cross and Is to dot. And so we've made this section optional. You're welcome to watch it. Not a lot of math, just a lot of detail. And so I would recommend you especially pay attention if you're implementing your own decision tree, learning from scratch. But if not, just know what pruning is, we just talked about it. And know that most implementations, most great implementations out there will use some kind of pruning. We discussed lots of recursions where we are building the decision tree from the top down and we discussed two types of list stopping L stopping conditions, the ones that we should always do, the stopping conditions for in the previous module and the three list stopping conditions from this module. Now let's explore some of the challenges that come about, with this early stopping conditions. Early stopping condition once was, let's pick a max depth, and stop recursing, once you reach that max depth. The question is, how do you really pick the max depth? We said that you should use cross validation of validation set, but there can be challenges with setting data side for that, you had to rerun your algorithms many times, and you might not always reach the desired answer you like. And plus you want some parts of the tree to grow more than others. It becomes really kind of complicated to set this parameter. Now, early stopping condition two, is the most dangerous one. Early stopping condition two says stop, if the training error stops decreasing. So if that, classification error for the training data stops going down, so flattens out. However it turns out that in practice the curve might not be as beautiful as what I drew. It might not be smoothly going down, it might jump and it might do some weird stuff. Let's see why that can come about. As we discussed early stopping condition two stop when error doesn't decrease can be dangerous. And to show how it's dangerous, we need a counter example, and we'll use the counter example for everything, which if you saw the course one you know that it is xor. So, here's our data and this data corresponds to the xor of the first input and the second input. So, when they're both false the output is false and when it's false the other is true output is true which happens in the two middle rows here and then when they're both true the output is false. If you are ready to find the counter example for anything just say xor. In fact if someone calls you in the middle of the night and asks what is the counter example for this crazy problem I'm having? You say, xor. So what happens with this xor when we just look at the root of decision tree. Well we have two rows with output true, and two rows of output false. And so what is the error of this decision tree we're just going to top level. Well you just choose one of these because they're both the same, let's say we choose true because we're optimists, we like to think positively. And so we're going to say that the output here is +1. And so that means that we're making two mistakes. Two mistakes, so two out of four datapoints, so our error is 0.5, which is bad. That means that you are as good as random, just error 0.5. Okay, let's see what happens if we split on x[1]. Well, if we split on x[1] and you are actually splitting the table like this and you see that because xor is so beautiful, we, again, for each one of these subsequent nodes, we're in the same situation. So, for example, for an x-1 equals 2, you have one y true and one y false. So let's suppose that you predict plus one for this side and also plus one for the side where x-1 is false, because it's all even stevens. So what's the error? Well, we made mistakes, mistakes were made. Mistakes were made here and here so we made 1 + 1 mistakes which is 2 out of 4, so it's 0.5. So it turns out that splitting on x[1] does not help. Okay, and see what happens when you try to split on x[2]? So if I try to split on x[2] I end up with the same situation expect I change this label here from x[1] to x[2]. But again, we'll be in exactly the same situation where we're going to output 1, on each one of these outcomes. And make the same number of mistakes. So one mistake for each, which gives me, total error of 0.5. And so, if I look, all in all, neither are splitting on x1, nor x2. That helps me reduce the error. So both splits give you the same error. So the question is, should I start now? If neither feature improves and your using early stopping condition two then you just say no don't do it take it in the splits. You just stop splitting the decision tree. At the very very root output to be true, make two mistakes and say that the error that you get in the end is 0.5. So, [LAUGH] this is why xor is a counter example to everything. What would happen if you didn't take early stopping condition two and you just kept going? If you just kept going, you get this tree that looks here. And if you see, for each one of these leaves, there is zero mistakes. So the leaf from the left outputs false, or minus one. And, makes zero mistakes, next one makes zero mistakes. The third one makes zero mistakes, and the last one makes zero mistakes. So, if you kept going. If you didn't use early stopping condition two, you would get three errors of zero. And so, in this case, there is a huge gap, huge gap between training error was 0.5, which is as good as random, and training error of zero, which is perfect training error. So in this case, early stopping condition two created this arbitrary large gap between the best he could have done. And what you end up outputting. So we saw a list of in condition two. Don't stop recursing if the training error does not decrease enough. It seems like a reasonable heuristic, but it has some dangerous pitfalls. Because two short, short-sighted, this doesn't explore branches far enough. So it's kind of generally a good principle. But if you apply top down you might get yourself into trouble. So what can you do about this? This is where pruning comes in, uses something like this criterian but uses it bottom up. [MUSIC]