1 00:00:00,127 --> 00:00:04,492 [MUSIC] 2 00:00:04,492 --> 00:00:08,520 We've now outlined the greedy algorithm for learning and decision tree. 3 00:00:08,520 --> 00:00:12,020 The first thing we're going to explore is this idea of picking what feature to split 4 00:00:12,020 --> 00:00:12,780 on next. 5 00:00:13,810 --> 00:00:17,560 We split in our example, credit first, and 6 00:00:17,560 --> 00:00:20,740 if we can split some different feature, how do we decide what to do? 7 00:00:20,740 --> 00:00:24,510 And it turns out that this feature selection problem, 8 00:00:24,510 --> 00:00:28,320 this feature splitting learning problem, can be viewed as the problem of learning 9 00:00:28,320 --> 00:00:33,010 what's called a decision stump, which is that one level on the decision tree. 10 00:00:33,010 --> 00:00:36,570 For those not familiar with it, a tree, it's kind of this really big thing, but 11 00:00:36,570 --> 00:00:39,520 if you cut it, you're only left with the little bit at the bottom. 12 00:00:39,520 --> 00:00:40,550 And that thing is called the stump. 13 00:00:40,550 --> 00:00:43,790 So its a really, really, really short piece of a tree. 14 00:00:43,790 --> 00:00:47,890 So how do you run a decision stump or a level 1 decision tree from data. 15 00:00:47,890 --> 00:00:51,310 So we are given a data set like this just like we had before and 16 00:00:51,310 --> 00:00:54,010 our goal here is to learn a 1 level decision tree. 17 00:00:54,010 --> 00:00:58,640 So were given the top node or the root node of the data so 18 00:00:58,640 --> 00:01:01,600 we have a point to safe some of the data points are risky. 19 00:01:01,600 --> 00:01:04,000 There's 40 examples, in our case, and 20 00:01:04,000 --> 00:01:08,150 it turns out that 22 of those are safe loans, and 18 are risky loans. 21 00:01:08,150 --> 00:01:10,810 That's what our data set looks like. 22 00:01:10,810 --> 00:01:12,200 Now, I have a histogram but 23 00:01:12,200 --> 00:01:16,480 as we're building this figures they can get really big and complicated. 24 00:01:16,480 --> 00:01:18,380 So I'm going to compress it a little bit. 25 00:01:18,380 --> 00:01:20,470 Instead of showing the histogram and the numbers 22 and 26 00:01:20,470 --> 00:01:26,130 18, I'm just going to show the numbers 22 and 18 to simplify the visualization. 27 00:01:26,130 --> 00:01:27,710 So now when you see that root node, 28 00:01:27,710 --> 00:01:30,320 you should interpret it as we have 40 data points. 29 00:01:30,320 --> 00:01:35,750 22 are green which is safe loans, 18 are orange which are risky loans. 30 00:01:37,070 --> 00:01:41,770 And starting from there, how do we go and build that decision stump? 31 00:01:41,770 --> 00:01:44,320 So in our case, we had all the data. 32 00:01:44,320 --> 00:01:49,100 We split on credit, and we decided that some subset of data had excellent credit, 33 00:01:49,100 --> 00:01:51,080 some had fair, and some had poor. 34 00:01:51,080 --> 00:01:57,430 So we assign each one of those subsets the subsequent node. 35 00:01:57,430 --> 00:02:02,000 In our new kind of visualization notation, we have the original root node with all 36 00:02:02,000 --> 00:02:07,600 the data, 22 risky and 18, sorry, 22 safe and 18 risky. 37 00:02:07,600 --> 00:02:12,086 For X and credit we have some sort of data where 9 38 00:02:12,086 --> 00:02:15,460 of them were safe and 0 were risky. 39 00:02:15,460 --> 00:02:17,660 So, 9 in green, 0 in orange. 40 00:02:17,660 --> 00:02:22,210 For fair credit we have 9 safe, 4 risky. 41 00:02:22,210 --> 00:02:27,430 And for poor credit we have 4 safe, and 14 risky. 42 00:02:27,430 --> 00:02:32,740 So that's what it later looks like at the next level, after we've done the splits. 43 00:02:32,740 --> 00:02:36,100 These nodes here in the middle we call intermediate nodes. 44 00:02:37,420 --> 00:02:41,700 Now, for each intermediate node we can try to make a prediction in decision stump. 45 00:02:41,700 --> 00:02:46,650 So for example for poor credit we see the majority of the data 46 00:02:46,650 --> 00:02:51,150 in there has risky associated with it. 47 00:02:51,150 --> 00:02:54,310 So we predicate that'll be a risky loan. 48 00:02:54,310 --> 00:03:01,430 For fair credit, we see that the majority 9 versus 4 have, are safe loans. 49 00:03:01,430 --> 00:03:03,190 So we predict that to be a safe loan. 50 00:03:03,190 --> 00:03:07,580 And for excellent credit, we predict that to be a safe loan, because 9 versus 0 So 51 00:03:07,580 --> 00:03:09,330 nine safe loans in there. 52 00:03:09,330 --> 00:03:13,760 So for each node, we look at the majority value to make a prediction. 53 00:03:13,760 --> 00:03:16,660 And you've now learned your first decision stump. 54 00:03:16,660 --> 00:03:20,890 It's a pretty simple one, but to get better predictions and 55 00:03:20,890 --> 00:03:26,250 more accuracy, we're going to explore that more and split further. 56 00:03:26,250 --> 00:03:30,360 But before we split further, we're going to discuss why we picked credit to 57 00:03:30,360 --> 00:03:34,951 do the first split as opposed to say, for example, the term of the loan or income. 58 00:03:34,951 --> 00:03:39,239 [MUSIC]