1 00:00:00,000 --> 00:00:04,670 [MUSIC] 2 00:00:04,670 --> 00:00:09,510 So far in this specialization data has always looked pretty beautiful. 3 00:00:09,510 --> 00:00:13,000 We sometimes end the look at little features in the area like taking raw text 4 00:00:13,000 --> 00:00:19,710 and turning into counts of words, the TFIDF, sometimes we created 5 00:00:19,710 --> 00:00:25,370 more advanced feature like polynomials, sines and cosines and so on. 6 00:00:25,370 --> 00:00:26,770 We did feature transformations, 7 00:00:26,770 --> 00:00:31,390 feature engineering but we always observe all of our data, so 8 00:00:31,390 --> 00:00:35,320 for every feature we always observe every possible value for every data point. 9 00:00:36,330 --> 00:00:40,840 Now that, is rarely true in the real world. 10 00:00:40,840 --> 00:00:43,290 Real world data tends to be pretty messy, and 11 00:00:43,290 --> 00:00:48,430 often is fraught with missing data, and unobserved values. 12 00:00:48,430 --> 00:00:52,230 And this is a significant issue we should always be on the lookout for. 13 00:00:52,230 --> 00:00:55,870 In today's module, we're going to talk about some of the basic concepts and ideas 14 00:00:55,870 --> 00:01:00,730 of what you can do to try to address, missing data in a learning problem. 15 00:01:02,770 --> 00:01:06,510 Approaches to dealing with missing data are better understood in the context of 16 00:01:06,510 --> 00:01:08,410 a particular learning algorithm. 17 00:01:08,410 --> 00:01:13,100 So for this module, we're going to pick decision trees as a way to kind of better 18 00:01:13,100 --> 00:01:17,770 see the impact of missing data, and some of the key approaches to dealing with it. 19 00:01:17,770 --> 00:01:20,380 Again, we're going to be dealing with loan data. 20 00:01:20,380 --> 00:01:25,160 So as input xi is coming out, another term of the loan, your credit history, and so 21 00:01:25,160 --> 00:01:27,920 on, we're going to push it through this crazy decision tree, 22 00:01:27,920 --> 00:01:31,600 set to make a decision, whether your loan is safe, or your loan is risky. 23 00:01:31,600 --> 00:01:35,120 Which is going to be the output y hat i, that we're trying to decide here. 24 00:01:36,360 --> 00:01:41,560 As we discussed thus far, we've assumed that all the data was fully observed. 25 00:01:41,560 --> 00:01:42,440 So nothing was missing. 26 00:01:42,440 --> 00:01:45,440 So for every row of the data, for every feature we observed, for 27 00:01:45,440 --> 00:01:47,790 example the credit was excellent, fair, or poor. 28 00:01:47,790 --> 00:01:51,360 If the term of the loan was three years or five years, if the income was high or low. 29 00:01:51,360 --> 00:01:55,380 And we will observe the output of course, say for risky. 30 00:01:55,380 --> 00:01:59,120 Now, in reality you may have missing data. 31 00:01:59,120 --> 00:02:03,170 So missing data, for example in this highlighted row might say I know that for 32 00:02:03,170 --> 00:02:08,080 this particular loan application the credit was poor, the income was high. 33 00:02:08,080 --> 00:02:10,290 Turned out to be a risky loan, but 34 00:02:10,290 --> 00:02:14,350 nobody entered in there, whether the loan was a three year loan or five year loan. 35 00:02:15,630 --> 00:02:18,310 And that may be true for multiple data points. 36 00:02:18,310 --> 00:02:20,910 And the question is, what can we do about this? 37 00:02:20,910 --> 00:02:24,360 What impact is on our learning algorithm, what happens? 38 00:02:24,360 --> 00:02:28,930 Well missing data can impact a learning algorithm both in the training phase, 39 00:02:28,930 --> 00:02:31,640 because I don't know how to train a model when you have this 40 00:02:31,640 --> 00:02:34,020 question marks we know what values those are. 41 00:02:34,020 --> 00:02:37,750 And it can have an impact on prediction time. 42 00:02:37,750 --> 00:02:41,700 Let's say I build a great decision tree, I put it out in the wild. 43 00:02:41,700 --> 00:02:44,260 Banks. Somebody an application in there but 44 00:02:44,260 --> 00:02:46,670 we don't know a particular entry. 45 00:02:46,670 --> 00:02:47,810 What predictions do we make? 46 00:02:49,330 --> 00:02:54,480 So, let's be more specific, let's say that we have this tree that I learned from data 47 00:02:54,480 --> 00:02:59,390 and I have particular input where the credit was poor, the term was five years 48 00:02:59,390 --> 00:03:02,950 but the income was a question mark, I don't know the income of this person. 49 00:03:02,950 --> 00:03:08,310 So, I tried to go down the decision tree, I hit credit first, credit was poor. 50 00:03:08,310 --> 00:03:12,100 I hit income, that was a question mark. 51 00:03:12,100 --> 00:03:15,240 I didn't know it was unknown, what do we do next? 52 00:03:16,710 --> 00:03:17,840 So where a learning problem, 53 00:03:17,840 --> 00:03:22,170 where you have some training data we tried some features, fed some machinery model, 54 00:03:22,170 --> 00:03:27,710 which then use a quality metric to learn a decision tree T of x. 55 00:03:27,710 --> 00:03:32,910 But, we're in a setting, where some of the data, might be missing a training time. 56 00:03:34,150 --> 00:03:37,090 And, some of the data might be missing a prediction time. 57 00:03:37,090 --> 00:03:37,670 And what do we do? 58 00:03:37,670 --> 00:03:41,150 What we're going to do is modify the machinery model a little bit, the decision 59 00:03:41,150 --> 00:03:45,450 tree model a little bit, to be able to deal with this kind of missing data. 60 00:03:45,450 --> 00:03:46,299 Let's see how. 61 00:03:46,299 --> 00:03:49,842 [MUSIC]