1 00:00:00,000 --> 00:00:05,066 [MUSIC] 2 00:00:05,066 --> 00:00:08,699 So now we've discussed the idea of building a decision stump from data and 3 00:00:08,699 --> 00:00:11,460 a little bit of what that data set looks like. 4 00:00:11,460 --> 00:00:14,940 Let's discuss now how to pick the right feature to 5 00:00:14,940 --> 00:00:17,400 split on when we're building a decision stump. 6 00:00:18,760 --> 00:00:22,470 So trying to learn decision stump from data and he was split on credit. 7 00:00:22,470 --> 00:00:26,030 But the question is what is the best feature to split on and 8 00:00:26,030 --> 00:00:26,840 how do we measure that? 9 00:00:28,180 --> 00:00:31,740 So in our example we split on credit or we could have split on something else. 10 00:00:31,740 --> 00:00:33,130 Let's say we could have split on term of the loan, 11 00:00:33,130 --> 00:00:36,070 this is a three year loan or a five year loan. 12 00:00:36,070 --> 00:00:39,260 And the question is, what is better between the two? 13 00:00:39,260 --> 00:00:40,250 What is a better split? 14 00:00:42,430 --> 00:00:45,340 And that's what we're going to ask next. 15 00:00:46,620 --> 00:00:51,410 And intuitively, a better split is one that gives you lowest 16 00:00:51,410 --> 00:00:54,900 classification error, and that's exactly what we'll explore in the algorithm. 17 00:00:56,210 --> 00:00:59,580 We'd like to figure out whether it's better to split on credit or 18 00:00:59,580 --> 00:01:00,210 split on term. 19 00:01:00,210 --> 00:01:04,100 And the way we're going to do that is by measuring the number of mistakes 20 00:01:04,100 --> 00:01:06,420 each one of the decision stumps makes and 21 00:01:06,420 --> 00:01:10,390 pick the one that makes the least number of mistakes, so has lowest error. 22 00:01:10,390 --> 00:01:12,820 And just remember the error is just the number of mistakes 23 00:01:12,820 --> 00:01:16,510 a classifier makes divided by the total number of data points. 24 00:01:16,510 --> 00:01:20,420 So, lets start with the root node. 25 00:01:20,420 --> 00:01:24,284 And this is if we made no splits on the decision stump and 26 00:01:24,284 --> 00:01:27,960 measure the error that we get in that example. 27 00:01:27,960 --> 00:01:32,740 So, as a reminder, we're going to predict y hat to be 28 00:01:32,740 --> 00:01:37,820 the majority class associated with a particular node. 29 00:01:37,820 --> 00:01:42,410 So, in our case here the class that has most values for 30 00:01:42,410 --> 00:01:44,765 the root is going to be the safe class. 31 00:01:44,765 --> 00:01:47,654 And we're going to compute the classification error of making that 32 00:01:47,654 --> 00:01:48,720 prediction. 33 00:01:48,720 --> 00:01:51,890 So here worse classification error we're just saying that 34 00:01:51,890 --> 00:01:54,170 all the data points are safe. 35 00:01:55,850 --> 00:02:00,369 So we can go ahead and say that there were 22 correct things, 22 were safe and 36 00:02:00,369 --> 00:02:01,322 18 mistakes. 37 00:02:01,322 --> 00:02:06,880 So the classification error here is going to be, 38 00:02:06,880 --> 00:02:14,210 I'm going to change colors, is going to be 18 / 22 + 18. 39 00:02:14,210 --> 00:02:15,417 Which is 40. 40 00:02:15,417 --> 00:02:20,525 So, 18 over 40 which is 0.45. 41 00:02:20,525 --> 00:02:22,243 It's a binary classification problem. 42 00:02:22,243 --> 00:02:28,410 You only get 0.45, I mean, you get 0.45 error which is really bad. 43 00:02:28,410 --> 00:02:33,989 And so not splitting on anything gives you a pretty bad result. 44 00:02:33,989 --> 00:02:39,817 So the question here is, how good is the decision stump that splits on credit? 45 00:02:39,817 --> 00:02:42,838 And how does it compare to not splitting on anything, 46 00:02:42,838 --> 00:02:45,560 which had a classification error of 0.45? 47 00:02:45,560 --> 00:02:47,080 Is this one better? 48 00:02:47,080 --> 00:02:48,750 So let's look at the decision stump. 49 00:02:48,750 --> 00:02:56,230 For data points that have excellent credit we're going to predict that they're safe. 50 00:02:56,230 --> 00:02:58,760 For those that have fair credit, we're going to predict that they're safe. 51 00:02:58,760 --> 00:03:02,590 And ones that have poor credit we're going to predict them to be risky. 52 00:03:02,590 --> 00:03:04,740 And so this is our prediction, and 53 00:03:04,740 --> 00:03:09,580 this is again the majority value of the data in each one of these nodes. 54 00:03:09,580 --> 00:03:12,870 And if you look at how many mistakes we make, you'll see that for 55 00:03:12,870 --> 00:03:15,430 the data that had excellent credit, 56 00:03:15,430 --> 00:03:18,220 we make zero mistakes because everything was safe there. 57 00:03:18,220 --> 00:03:20,618 For data that had fair credit, we're going to make four mistakes. 58 00:03:20,618 --> 00:03:23,195 There were four risky loans with fair credit. 59 00:03:23,195 --> 00:03:27,860 And then for data that had poor credit, and we'll see there's four mistakes 60 00:03:27,860 --> 00:03:32,520 again because there were four safe loans when we had a poor credit. 61 00:03:32,520 --> 00:03:34,770 So let's compute our overall error. 62 00:03:34,770 --> 00:03:36,210 We're going to write over here. 63 00:03:36,210 --> 00:03:38,310 I'm just going to change my colors. 64 00:03:38,310 --> 00:03:43,120 And we'll see that we make 4 + 4 mistakes. 65 00:03:43,120 --> 00:03:48,314 So that's 8 out of 40 data points, so that's an error of 0.20, 0.2. 66 00:03:48,314 --> 00:03:51,861 Which is smaller than we had, 0.45. 67 00:03:51,861 --> 00:03:56,413 So, we've gone down now from 0.45 to 0.2. 68 00:03:56,413 --> 00:03:58,800 So splitting on credit seems like a pretty good idea. 69 00:04:00,070 --> 00:04:04,730 Now let's see what happens when we split on term of the loan. 70 00:04:04,730 --> 00:04:06,250 We still have term of the loan. 71 00:04:06,250 --> 00:04:12,585 If the term is three years, maybe there is 16 safe loans and 4 risky ones. 72 00:04:12,585 --> 00:04:17,230 So, in this case we're making four mistakes. 73 00:04:17,230 --> 00:04:22,718 And for five years we predicted risky where there were six safe loans, 74 00:04:22,718 --> 00:04:25,524 so now we're making six mistakes. 75 00:04:25,524 --> 00:04:32,610 And so if you look at the overall error here it's 4 + 6 / 40. 76 00:04:32,610 --> 00:04:35,729 So that's 10 divided by 40, which is 0.25. 77 00:04:37,250 --> 00:04:39,450 So overall, if we look at our data, 78 00:04:39,450 --> 00:04:43,540 not splitting on anything, the root node, has 0.45 error, 79 00:04:43,540 --> 00:04:48,620 splitting on credit has 0.2 error, and splitting on term has 0.25 error. 80 00:04:51,100 --> 00:04:53,180 So can go back and ask, what is the best choice? 81 00:04:53,180 --> 00:04:55,473 Should we split on credit, or should split on term? 82 00:04:55,473 --> 00:04:56,982 The answer now becomes obvious. 83 00:04:56,982 --> 00:05:00,417 Splitting on credit gives you lower classification error, so 84 00:05:00,417 --> 00:05:03,089 this is what our greedy algorithm will do first. 85 00:05:03,089 --> 00:05:06,665 This is the first feature to split on. 86 00:05:10,194 --> 00:05:13,420 And that would be the winner of our selection process. 87 00:05:13,420 --> 00:05:18,055 So in general, the decision tree splitting 88 00:05:18,055 --> 00:05:23,441 process will say given the subset of data at node M, 89 00:05:23,441 --> 00:05:28,576 which is what we're looking at, the root node so 90 00:05:28,576 --> 00:05:32,208 far, we try out every feature xi, 91 00:05:32,208 --> 00:05:38,010 which in our case here was credit, term, and income. 92 00:05:38,010 --> 00:05:41,040 And I could see splitting the data according to possible values 93 00:05:41,040 --> 00:05:44,570 of each one of these features, and I compute a classification error 94 00:05:44,570 --> 00:05:48,110 of the resulting split, just like we did manually over here. 95 00:05:48,110 --> 00:05:49,470 And then I pick the feature, 96 00:05:49,470 --> 00:05:54,840 which in our case was credit which had the lowest classification error. 97 00:05:56,850 --> 00:06:03,310 So, if we go back to our decision tree learning algorithm that first challenge 98 00:06:03,310 --> 00:06:08,160 that we had, figuring out what feature to split on, we can now do that using this 99 00:06:08,160 --> 00:06:11,930 feature split selection algorithm that minimizes the classification error. 100 00:06:12,960 --> 00:06:17,596 So next, we'll explore the other parts of this decision tree learning algorithm. 101 00:06:17,596 --> 00:06:21,669 [MUSIC]