[MUSIC] At the core of the decision tree algorithm is choosing the feature to split on and that's the place we're going to modify it to also choose what to do with that missing data. Regarding the most important question in decision tree learning is figuring out what feature to split up. And so now when we look at what features split on, we're not just going to decide what feature to split on, we're also going to decide, which branch to take when we see a missing value? And so conceptually, it's just a very very simplification algorithm we described in the previous module. But now, we're just going to add this extra search to figure out where to put that missing data. Just like with grading learning decision trees, we can just focus on learning decision stump to understand what's going on here. So, we chose to split on credit. Credit can be excellent, fair or poor and the branch here on the right is the one where all data points have credit equals poor. And now, we might decide that the missing data is going to go into that branch. So now we're going to take all the data that was missing and the data where credit equals poor, and feed it into that node. That's what's going to happen with categorical data. Now when we have numeric data, when they use those threshold splits that we discussed in the previous module. And in this case, when my threshold income is less than $50,000 or greater than $50,000. And if we decided that missing data or data where the income is unknown, when the feed it to the same node of those the income was less than 50,000. That's the decisions that we are going to try to make inside of our learning algorithm. Now, we have to decide where this data's going to go. So let's say, we choose to split a credit or considering splitting a credit. If we consider spilling of credit, what should we do with the missing data? Should we put it with people of excellent credit, for people with fair credit or with some people with poor credit? I don't know what's best or we have a principle for doing this. As the principle says, choose the branch that leads to the lowest classification error, just like we did before. Let say, they were considering splitting on credit and we'll consider assigning the missing data to the branch where the credit was excellent. Let see what happens. So in this example, we have three data points with missing data, these three here. And we're considering assigning those data points to the branch where the credit was excellent. So, we had two kinds of data. We had the observe data and then there, we had in the exon branch, we had nine safe loans, zero risk. In the fair branch, we had eight safe loans and four risky. And in the poor branch, we have 4 safe loans and 12 risky. Now, we have the three extra data points. And in this three extra data points as you can see, there's two risky and one safe. So, we're thinking about adding these two risky and one safe into the excellent branch. So, what are the grand totals here? The grand totals are ten safe and two risky inside the excellent branch and then the same for the other two branches, because we're not in excellent data. So what's the overall total error of the resulting decision tree? Well, for the excellent branch, we're going to predict safe. So, we're going to make two mistakes. For the fair branch, we're going to predict safe. So we're going to make four mistakes and for the poor branch, we're going to predict risky and we're going to make four mistakes. So overall, we're going to make 10 mistakes and our total was 40 data points, so the total error here is 0.25 one-quarter. Now we do the same for every single option, every single branch. So, we tried putting the unknown data with the excellent and got error 0.5. We tried to put it with the fair branch, we get A 0.25 error as well. But if we try to put it with the last branch here, with the poor branch, we get an error of 0.225. And so that's a lower error, so this is the winner. So the best thing to do would be to assign unknowns to the poor batch to credit equals poor, which is exactly what we did in the earlier tree. This is the modification of the tree splitting algorithm to take into account missing data, gets a little complicated. But it's generally, kind of simple idea we just described. You're leaving here for a reference, if you choose to implement it. But overall, the fundamental concept is you can use this idea of embedding, handling of missing data inside of the learning algorithm to get better overall performance. [MUSIC]