1 00:00:00,042 --> 00:00:04,689 [MUSIC] 2 00:00:04,689 --> 00:00:08,275 In my opinion the best thing you can to do to deal with missing data is 3 00:00:08,275 --> 00:00:12,644 to make your algorithm, machine learning algorithm, robust to missing data. 4 00:00:13,715 --> 00:00:16,315 In other words, make sure that an algorithm adapt to missing data. 5 00:00:16,315 --> 00:00:19,745 And that's exactly what we're going to do in this particular module. 6 00:00:19,745 --> 00:00:23,195 And we're going to do it in the context of decision trees, 7 00:00:23,195 --> 00:00:26,685 because a simple modification of decision trees, they can handle missing data. 8 00:00:27,685 --> 00:00:30,445 So, how we going to deal with missing data? 9 00:00:30,445 --> 00:00:33,180 Let us try to understand a little better 10 00:00:33,180 --> 00:00:35,330 what happened in the context of the decision trees? 11 00:00:35,330 --> 00:00:40,840 So I have this input xi where the credit was poor and income was unknown. 12 00:00:40,840 --> 00:00:45,920 I go down the decision tree, I hit credit poor, so 13 00:00:45,920 --> 00:00:51,270 I take the branch here, poor. 14 00:00:51,270 --> 00:00:54,075 And then I hit income, but income was unknown. 15 00:00:54,075 --> 00:00:56,250 Income was question mark. 16 00:00:56,250 --> 00:00:57,050 So what do I do then? 17 00:00:57,050 --> 00:01:02,030 So what we're going to do is assign a branch to follow 18 00:01:02,030 --> 00:01:05,180 when the data is unknown, when the value is unknown. 19 00:01:05,180 --> 00:01:08,870 And in this case, I'm going to associate all unknown data, for 20 00:01:08,870 --> 00:01:13,130 example, with the branch where income was low. 21 00:01:13,130 --> 00:01:19,130 So if your income was unknown, take the branch low and 22 00:01:19,130 --> 00:01:22,890 from there I'm just going to predict that this loan was risky. 23 00:01:22,890 --> 00:01:28,280 So in other words, in our decision tree we're going to make explicit decisions 24 00:01:28,280 --> 00:01:32,770 as to where question marks or missing values, unknown values will go. 25 00:01:34,560 --> 00:01:38,930 We've introduced a decision to what happens when we hit an unknown value 26 00:01:38,930 --> 00:01:40,810 at this point of the decision tree. 27 00:01:40,810 --> 00:01:43,940 But we may have unknown values any point of decision tree, so we're going to 28 00:01:43,940 --> 00:01:48,990 make decisions everywhere as to what should happen when we see unknown values. 29 00:01:48,990 --> 00:01:53,960 So for every decision known, when I choose one of them its right to absorb 30 00:01:54,980 --> 00:01:59,730 the unknown values and notice that those are going to be specific 31 00:01:59,730 --> 00:02:03,060 decisions we're going to make associated with every diamond known or 32 00:02:03,060 --> 00:02:08,240 every decision known in the decision tree so for credit we're going to have 33 00:02:08,240 --> 00:02:13,335 the unknowns or question marks going to the fair branch for 34 00:02:13,335 --> 00:02:17,600 income, it was poor. 35 00:02:17,600 --> 00:02:23,300 And we look at income, we're going to go down the lower branch. 36 00:02:23,300 --> 00:02:27,980 When credit equals poor and income is high we look at term and, 37 00:02:27,980 --> 00:02:31,830 in this case, when term is equal to five years, 38 00:02:31,830 --> 00:02:37,430 we're going to put the unknowns, or the question marks, in the five year bunch. 39 00:02:37,430 --> 00:02:40,720 However, note take a note of this, 40 00:02:40,720 --> 00:02:43,530 we might choose to say that when credit goes fair and 41 00:02:43,530 --> 00:02:49,860 we look at term the unknowns in this case will go to the term equals three years. 42 00:02:49,860 --> 00:02:55,080 So the decision that we make about where unknowns go 43 00:02:55,080 --> 00:02:58,720 does not need to be the same in different parts of the decision tree. 44 00:02:59,820 --> 00:03:01,871 So we take different decisions here, 45 00:03:04,529 --> 00:03:09,130 About the questions marks, or unknowns. 46 00:03:10,300 --> 00:03:13,043 And that's the beauty of the approach that we're describing, 47 00:03:13,043 --> 00:03:17,158 because we're going to optimize, just like we're optimizing of the decision trees, 48 00:03:17,158 --> 00:03:19,443 we're going to optimize where the unknowns will go, 49 00:03:19,443 --> 00:03:21,558 in order to minimize the classification error. 50 00:03:21,558 --> 00:03:27,690 Now if a learning tree like this and we're given an input, 51 00:03:27,690 --> 00:03:30,930 for example, where the credit is unknown and the income was high, 52 00:03:30,930 --> 00:03:34,700 and the term was 5 years, we can go down this tree and traverse. 53 00:03:34,700 --> 00:03:38,060 Well, credit was unknown, the term was 5 years, so 54 00:03:38,060 --> 00:03:39,940 I'm going to predict it's a safe loan. 55 00:03:42,180 --> 00:03:45,720 While if I have another input where the credit is poor, 56 00:03:45,720 --> 00:03:50,980 the income is high, the term was question mark when the different 57 00:03:50,980 --> 00:03:55,510 branch of the decision tree will go down credit poor, income high and 58 00:03:57,190 --> 00:04:02,410 the term here unknown goes down the same as the five year loan and 59 00:04:02,410 --> 00:04:05,680 again we'll predict safe. 60 00:04:05,680 --> 00:04:09,990 In general, approaches that explicitly handle missing data 61 00:04:09,990 --> 00:04:13,010 in the algorithm itself can be quite helpful. 62 00:04:13,010 --> 00:04:18,150 They can help address missing data training time and prediction time. 63 00:04:18,150 --> 00:04:20,990 And can make more accurate predictions in general. 64 00:04:20,990 --> 00:04:23,760 And we just talked about an example in decision trees. 65 00:04:23,760 --> 00:04:27,930 The downside though is that it requires it to modify the actual learning algorithm to 66 00:04:27,930 --> 00:04:29,640 deal with missing data. 67 00:04:29,640 --> 00:04:32,190 In the case of decision trees, I'm going to describe 68 00:04:32,190 --> 00:04:35,270 kind of a very simple modification that can make this happen, 69 00:04:35,270 --> 00:04:37,580 but this can be very complex for other algorithms. 70 00:04:37,580 --> 00:04:39,760 And even for decision trees, if you're going to make it really, really good, 71 00:04:39,760 --> 00:04:42,530 it might be more complex than what we're talking about next. 72 00:04:42,530 --> 00:04:44,830 But the idea here is fundamentally important. 73 00:04:44,830 --> 00:04:45,330 [MUSIC]