1 00:00:00,000 --> 00:00:04,366 [MUSIC] 2 00:00:04,366 --> 00:00:08,366 At the core of the decision tree algorithm is choosing the feature to split on and 3 00:00:08,366 --> 00:00:12,186 that's the place we're going to modify it to also choose what to do with that 4 00:00:12,186 --> 00:00:12,980 missing data. 5 00:00:14,220 --> 00:00:18,926 Regarding the most important question in decision tree learning is 6 00:00:18,926 --> 00:00:21,659 figuring out what feature to split up. 7 00:00:21,659 --> 00:00:24,714 And so now when we look at what features split on, 8 00:00:24,714 --> 00:00:28,142 we're not just going to decide what feature to split on, 9 00:00:28,142 --> 00:00:32,931 we're also going to decide, which branch to take when we see a missing value? 10 00:00:32,931 --> 00:00:37,498 And so conceptually, it's just a very very simplification 11 00:00:37,498 --> 00:00:41,179 algorithm we described in the previous module. 12 00:00:41,179 --> 00:00:45,353 But now, we're just going to add this extra search to figure out where to 13 00:00:45,353 --> 00:00:46,777 put that missing data. 14 00:00:46,777 --> 00:00:49,142 Just like with grading learning decision trees, 15 00:00:49,142 --> 00:00:53,039 we can just focus on learning decision stump to understand what's going on here. 16 00:00:53,039 --> 00:00:55,427 So, we chose to split on credit. 17 00:00:55,427 --> 00:01:00,451 Credit can be excellent, fair or poor and the branch here on 18 00:01:00,451 --> 00:01:06,094 the right is the one where all data points have credit equals poor. 19 00:01:06,094 --> 00:01:10,047 And now, we might decide that the missing data is going to go into that branch. 20 00:01:10,047 --> 00:01:13,540 So now we're going to take all the data that was missing and 21 00:01:13,540 --> 00:01:17,493 the data where credit equals poor, and feed it into that node. 22 00:01:17,493 --> 00:01:20,820 That's what's going to happen with categorical data. 23 00:01:20,820 --> 00:01:22,693 Now when we have numeric data, 24 00:01:22,693 --> 00:01:27,538 when they use those threshold splits that we discussed in the previous module. 25 00:01:27,538 --> 00:01:31,631 And in this case, when my threshold income is less than $50,000 or 26 00:01:31,631 --> 00:01:33,419 greater than $50,000. 27 00:01:33,419 --> 00:01:39,464 And if we decided that missing data or data where the income is unknown, 28 00:01:39,464 --> 00:01:46,230 when the feed it to the same node of those the income was less than 50,000. 29 00:01:46,230 --> 00:01:49,485 That's the decisions that we are going to try to make inside of our learning 30 00:01:49,485 --> 00:01:50,060 algorithm. 31 00:01:51,340 --> 00:01:54,400 Now, we have to decide where this data's going to go. 32 00:01:54,400 --> 00:01:58,590 So let's say, we choose to split a credit or considering splitting a credit. 33 00:01:58,590 --> 00:02:01,820 If we consider spilling of credit, what should we do with the missing data? 34 00:02:01,820 --> 00:02:05,295 Should we put it with people of excellent credit, 35 00:02:05,295 --> 00:02:09,855 for people with fair credit or with some people with poor credit? 36 00:02:09,855 --> 00:02:14,200 I don't know what's best or we have a principle for doing this. 37 00:02:14,200 --> 00:02:17,920 As the principle says, choose the branch that leads to the lowest classification 38 00:02:17,920 --> 00:02:19,480 error, just like we did before. 39 00:02:20,770 --> 00:02:25,800 Let say, they were considering splitting on credit and we'll consider assigning 40 00:02:25,800 --> 00:02:31,170 the missing data to the branch where the credit was excellent. 41 00:02:31,170 --> 00:02:32,194 Let see what happens. 42 00:02:32,194 --> 00:02:37,464 So in this example, we have three data points 43 00:02:37,464 --> 00:02:42,060 with missing data, these three here. 44 00:02:42,060 --> 00:02:46,438 And we're considering assigning those data points to the branch where the credit was 45 00:02:46,438 --> 00:02:47,120 excellent. 46 00:02:47,120 --> 00:02:49,960 So, we had two kinds of data. 47 00:02:49,960 --> 00:02:53,045 We had the observe data and then there, 48 00:02:53,045 --> 00:02:57,918 we had in the exon branch, we had nine safe loans, zero risk. 49 00:02:57,918 --> 00:03:02,010 In the fair branch, we had eight safe loans and four risky. 50 00:03:02,010 --> 00:03:05,010 And in the poor branch, we have 4 safe loans and 12 risky. 51 00:03:06,060 --> 00:03:09,010 Now, we have the three extra data points. 52 00:03:09,010 --> 00:03:15,190 And in this three extra data points as you can see, there's two risky and one safe. 53 00:03:15,190 --> 00:03:17,190 So, we're thinking about adding these two risky and 54 00:03:17,190 --> 00:03:20,290 one safe into the excellent branch. 55 00:03:20,290 --> 00:03:21,847 So, what are the grand totals here? 56 00:03:21,847 --> 00:03:27,908 The grand totals are ten safe and two risky inside the excellent branch and 57 00:03:27,908 --> 00:03:34,820 then the same for the other two branches, because we're not in excellent data. 58 00:03:34,820 --> 00:03:38,678 So what's the overall total error of the resulting decision tree? 59 00:03:38,678 --> 00:03:42,560 Well, for the excellent branch, we're going to predict safe. 60 00:03:42,560 --> 00:03:45,350 So, we're going to make two mistakes. 61 00:03:45,350 --> 00:03:49,020 For the fair branch, we're going to predict safe. 62 00:03:49,020 --> 00:03:52,733 So we're going to make four mistakes and for the poor branch, 63 00:03:52,733 --> 00:03:56,828 we're going to predict risky and we're going to make four mistakes. 64 00:03:56,828 --> 00:04:03,628 So overall, we're going to make 10 mistakes and our total was 40 data points, 65 00:04:03,628 --> 00:04:07,975 so the total error here is 0.25 one-quarter. 66 00:04:07,975 --> 00:04:13,362 Now we do the same for every single option, every single branch. 67 00:04:13,362 --> 00:04:20,670 So, we tried putting the unknown data with the excellent and got error 0.5. 68 00:04:20,670 --> 00:04:27,742 We tried to put it with the fair branch, we get A 0.25 error as well. 69 00:04:27,742 --> 00:04:31,831 But if we try to put it with the last branch here, 70 00:04:31,831 --> 00:04:36,448 with the poor branch, we get an error of 0.225. 71 00:04:36,448 --> 00:04:40,260 And so that's a lower error, so this is the winner. 72 00:04:40,260 --> 00:04:45,774 So the best thing to do would be to assign 73 00:04:45,774 --> 00:04:52,792 unknowns to the poor batch to credit equals poor, 74 00:04:52,792 --> 00:04:59,660 which is exactly what we did in the earlier tree. 75 00:05:01,200 --> 00:05:05,876 This is the modification of the tree splitting algorithm to 76 00:05:05,876 --> 00:05:10,750 take into account missing data, gets a little complicated. 77 00:05:10,750 --> 00:05:14,068 But it's generally, kind of simple idea we just described. 78 00:05:14,068 --> 00:05:18,595 You're leaving here for a reference, if you choose to implement it. 79 00:05:18,595 --> 00:05:23,827 But overall, the fundamental concept is you can use this idea of embedding, 80 00:05:23,827 --> 00:05:28,657 handling of missing data inside of the learning algorithm to get better 81 00:05:28,657 --> 00:05:30,280 overall performance. 82 00:05:30,280 --> 00:05:35,809 [MUSIC]