1 00:00:00,000 --> 00:00:04,630 [MUSIC] 2 00:00:04,630 --> 00:00:08,935 Well, let's start with the idea of early stopping where we keep going down but 3 00:00:08,935 --> 00:00:11,870 we stop before the tree gets too complex. 4 00:00:11,870 --> 00:00:13,940 So, how does that work? 5 00:00:13,940 --> 00:00:17,460 Simply recall that as we add more depth to the tree 6 00:00:17,460 --> 00:00:19,100 we're increasing the complexity of the model. 7 00:00:19,100 --> 00:00:23,916 And so if we are avoiding these super complex models 8 00:00:23,916 --> 00:00:27,500 now we can just avoid very deep trees. 9 00:00:28,730 --> 00:00:33,440 So the first early stopping condition is absolutely simple, 10 00:00:33,440 --> 00:00:35,480 just limit the depth of the tree. 11 00:00:35,480 --> 00:00:39,270 Now we know that the training error goes down as you increase the depth 12 00:00:39,270 --> 00:00:42,370 of the tree, but the true error goes down and then goes up. 13 00:00:42,370 --> 00:00:46,850 So maybe there's some magical max depth where before that, 14 00:00:46,850 --> 00:00:50,880 the trees are simple, they get to good true error, but 15 00:00:50,880 --> 00:00:53,970 after that the trees get too complex and the true error keeps going up. 16 00:00:54,990 --> 00:01:00,333 And so somehow, this max depth is going to be a magic parameter we're going to 17 00:01:00,333 --> 00:01:06,128 provide, where we stop going down the recursion once we hit that max_depth. 18 00:01:06,128 --> 00:01:09,335 And just like any other magic parameter in machine learning, 19 00:01:09,335 --> 00:01:11,254 you have to figure out how to pick it. 20 00:01:11,254 --> 00:01:17,990 And the answer is validation set or cross-validation. 21 00:01:17,990 --> 00:01:20,037 Never use a validation set, sorry, 22 00:01:20,037 --> 00:01:23,086 never use your test set to pick your magic parameters. 23 00:01:23,086 --> 00:01:26,670 So this what you are to do with max_depth. 24 00:01:26,670 --> 00:01:31,450 Limiting the depth is just one approach for limiting the complexive decision tree. 25 00:01:32,700 --> 00:01:36,950 However, if you limit the depth you're cutting every branch the same way, but 26 00:01:36,950 --> 00:01:39,600 some branches might want to go further than others. 27 00:01:39,600 --> 00:01:42,360 Let's take a look at an approach where some branches might go 28 00:01:42,360 --> 00:01:43,340 down further than others. 29 00:01:43,340 --> 00:01:44,530 So for example, 30 00:01:44,530 --> 00:01:49,590 we can limit we can stop recursing if the classification error doesn't get better. 31 00:01:50,780 --> 00:01:52,069 So we split up credit and 32 00:01:52,069 --> 00:01:55,171 now we're looking at what happens if we recur some poor. 33 00:01:55,171 --> 00:02:00,613 So of the five safe loans that we have in poor, with credit score poor, 34 00:02:00,613 --> 00:02:05,965 and the 16 risky loans of credit card poor, if we just stop there and 35 00:02:05,965 --> 00:02:09,600 made no splits we'd be making five mistakes. 36 00:02:10,930 --> 00:02:17,795 The error will be 5 out of 21 which is 0.24. 37 00:02:17,795 --> 00:02:22,860 Now if I were to try to split on the other features, term and 38 00:02:22,860 --> 00:02:26,780 income, suppose that neither of them helped. 39 00:02:26,780 --> 00:02:30,860 Suppose that neither of them decreased the training error. 40 00:02:30,860 --> 00:02:34,940 So training error we're still 0.24 for both of them, what should I do? 41 00:02:34,940 --> 00:02:38,650 Now I could try to split on one then keep going, but 42 00:02:38,650 --> 00:02:40,526 that will make the tree more and more complex. 43 00:02:40,526 --> 00:02:47,050 So a second early stopping condition we might use is to say stop there. 44 00:02:47,050 --> 00:02:51,130 If the tree error is not decreasing, don't make the tree more complex. 45 00:02:51,130 --> 00:02:54,920 Use Occam's Razor and stop the process right there and 46 00:02:54,920 --> 00:03:00,520 just say everything after credit goes poor is just a risky loan. 47 00:03:00,520 --> 00:03:05,340 And then just focus on what to do with the note where credit is fair. 48 00:03:06,840 --> 00:03:11,800 So the second type of early stopping criteria we're exploring is 49 00:03:11,800 --> 00:03:13,670 keep recursing, but 50 00:03:13,670 --> 00:03:18,960 stop if the tree error does not decrease, don't make the tree more complex. 51 00:03:18,960 --> 00:03:22,190 Now there's a few practical caveats about this. 52 00:03:22,190 --> 00:03:26,490 First, typically you don't just stop, the error doesn't decrease at all. 53 00:03:26,490 --> 00:03:27,695 You put a magic parameter, 54 00:03:27,695 --> 00:03:30,587 you say stop if the error doesn't increase by more than absulum. 55 00:03:30,587 --> 00:03:36,480 Let's say If it doesn't increase by more than 1% we just stop. 56 00:03:36,480 --> 00:03:40,100 Then there's a few pitfalls in practice for this rule and 57 00:03:40,100 --> 00:03:43,100 we talk about those a little bit in the context of pruning. 58 00:03:43,100 --> 00:03:45,840 So this is a rule that you should be super careful about, but 59 00:03:45,840 --> 00:03:47,590 it's very useful in practice. 60 00:03:47,590 --> 00:03:50,250 So something to definitely keep in mind. 61 00:03:51,660 --> 00:03:55,810 The third type of early stopping rule is extremely important, and 62 00:03:55,810 --> 00:03:57,430 you should always include it. 63 00:03:57,430 --> 00:04:01,640 And it just says, if there is only a few data points left in a node, 64 00:04:01,640 --> 00:04:05,470 don't split any further, because then you're really risking overfitting. 65 00:04:05,470 --> 00:04:08,300 So an example would be looking at, we had that case or 66 00:04:08,300 --> 00:04:10,040 credit that was equal to fair. 67 00:04:10,040 --> 00:04:14,400 Well there is only one safe loan there and two risky loans. 68 00:04:14,400 --> 00:04:18,330 So total of only three data points fall inside of that node. 69 00:04:18,330 --> 00:04:21,314 In that case, if we are trying to spilt further we are going to find lots of 70 00:04:21,314 --> 00:04:24,855 patterns in the data that don't make much sense because there is very little data to 71 00:04:24,855 --> 00:04:26,580 support those patterns. 72 00:04:26,580 --> 00:04:27,700 With three data points, 73 00:04:27,700 --> 00:04:32,750 I don't know if I really trust a decision tree algorithm to make tons of decision. 74 00:04:32,750 --> 00:04:35,610 So I'll just stop recursing right there. 75 00:04:35,610 --> 00:04:40,423 In general, we should always set kind of parameters, say Nmin, 76 00:04:40,423 --> 00:04:44,122 which says if a node has few data points then Nmin. 77 00:04:44,122 --> 00:04:47,060 Or Nmin or less, 78 00:04:47,060 --> 00:04:51,860 then just stop recursing right there and call that early stopping condition three. 79 00:04:51,860 --> 00:04:54,550 This one you should always implement no matter what. 80 00:04:56,385 --> 00:04:59,755 For example, here I'm suggesting setting Nmin to 10. 81 00:04:59,755 --> 00:05:04,465 10 is quite a nice number for Nmin if you have a relatively small dataset size. 82 00:05:04,465 --> 00:05:06,985 If you have a huge dataset size you might set it to 100. 83 00:05:06,985 --> 00:05:11,921 But somewhere between 10 and 100 is a good value for Nmin. 84 00:05:11,921 --> 00:05:16,611 So let's summarize the early stopping conditions that we described in this 85 00:05:16,611 --> 00:05:21,370 approach to try to find decision trees that are not too complex. 86 00:05:21,370 --> 00:05:22,280 There are three. 87 00:05:22,280 --> 00:05:25,250 The first one is limiting the depth of the tree. 88 00:05:25,250 --> 00:05:27,900 That's a relatively good condition to set. 89 00:05:27,900 --> 00:05:33,480 You should set that one to be something pretty carefully using cross validation. 90 00:05:33,480 --> 00:05:38,331 The second one is stop recursing if the classification error does not decrease by 91 00:05:38,331 --> 00:05:42,629 enough, increase by zero or doesn't increase by more than absolute. 92 00:05:42,629 --> 00:05:45,588 It is a little risky but can be very useful. 93 00:05:45,588 --> 00:05:49,400 The third one, extremely useful, we should always do it, 94 00:05:49,400 --> 00:05:53,170 stop recursing if there is not that many data points left inside of a node. 95 00:05:54,230 --> 00:05:57,110 So these are some of the key early stopping conditions 96 00:05:57,110 --> 00:05:58,980 a decision tree algorithm should be thinking about. 97 00:06:00,070 --> 00:06:03,987 If we go back to the decision tree video learning for the decision tree algorithm 98 00:06:03,987 --> 00:06:07,558 that we discussed in the previous module, we start with the empty tree. 99 00:06:07,558 --> 00:06:09,954 We selected a feature to split on, 100 00:06:09,954 --> 00:06:15,337 then we said stop if there was nothing left to do and just make the prediction. 101 00:06:15,337 --> 00:06:16,990 This is the majority class. 102 00:06:18,320 --> 00:06:23,030 And otherwise we want to recurse and go back to step two. 103 00:06:23,030 --> 00:06:29,212 Now, we considered two stopping conditions in the previous module. 104 00:06:31,879 --> 00:06:36,231 And to those we are adding the three early stopping 105 00:06:36,231 --> 00:06:40,340 conditions that we just described. 106 00:06:40,340 --> 00:06:44,087 And so small modifications to the code we've already 107 00:06:44,087 --> 00:06:48,925 built gives you this extra factor to help prevent some overfitting. 108 00:06:48,925 --> 00:06:53,249 [MUSIC]