[MUSIC] So far in this specialization data has always looked pretty beautiful. We sometimes end the look at little features in the area like taking raw text and turning into counts of words, the TFIDF, sometimes we created more advanced feature like polynomials, sines and cosines and so on. We did feature transformations, feature engineering but we always observe all of our data, so for every feature we always observe every possible value for every data point. Now that, is rarely true in the real world. Real world data tends to be pretty messy, and often is fraught with missing data, and unobserved values. And this is a significant issue we should always be on the lookout for. In today's module, we're going to talk about some of the basic concepts and ideas of what you can do to try to address, missing data in a learning problem. Approaches to dealing with missing data are better understood in the context of a particular learning algorithm. So for this module, we're going to pick decision trees as a way to kind of better see the impact of missing data, and some of the key approaches to dealing with it. Again, we're going to be dealing with loan data. So as input xi is coming out, another term of the loan, your credit history, and so on, we're going to push it through this crazy decision tree, set to make a decision, whether your loan is safe, or your loan is risky. Which is going to be the output y hat i, that we're trying to decide here. As we discussed thus far, we've assumed that all the data was fully observed. So nothing was missing. So for every row of the data, for every feature we observed, for example the credit was excellent, fair, or poor. If the term of the loan was three years or five years, if the income was high or low. And we will observe the output of course, say for risky. Now, in reality you may have missing data. So missing data, for example in this highlighted row might say I know that for this particular loan application the credit was poor, the income was high. Turned out to be a risky loan, but nobody entered in there, whether the loan was a three year loan or five year loan. And that may be true for multiple data points. And the question is, what can we do about this? What impact is on our learning algorithm, what happens? Well missing data can impact a learning algorithm both in the training phase, because I don't know how to train a model when you have this question marks we know what values those are. And it can have an impact on prediction time. Let's say I build a great decision tree, I put it out in the wild. Banks. Somebody an application in there but we don't know a particular entry. What predictions do we make? So, let's be more specific, let's say that we have this tree that I learned from data and I have particular input where the credit was poor, the term was five years but the income was a question mark, I don't know the income of this person. So, I tried to go down the decision tree, I hit credit first, credit was poor. I hit income, that was a question mark. I didn't know it was unknown, what do we do next? So where a learning problem, where you have some training data we tried some features, fed some machinery model, which then use a quality metric to learn a decision tree T of x. But, we're in a setting, where some of the data, might be missing a training time. And, some of the data might be missing a prediction time. And what do we do? What we're going to do is modify the machinery model a little bit, the decision tree model a little bit, to be able to deal with this kind of missing data. Let's see how. [MUSIC]