1 00:00:00,000 --> 00:00:04,754 [MUSIC] 2 00:00:04,754 --> 00:00:10,350 Often, the best way to prevent overfitting decision trees however, is to use pruning. 3 00:00:10,350 --> 00:00:13,260 Where you grow the tree to be much bigger than you need, 4 00:00:13,260 --> 00:00:17,140 and then you chop off parts that turn out to be less important. 5 00:00:18,270 --> 00:00:21,640 Now, pruning is very important if you're implementing 6 00:00:21,640 --> 00:00:25,110 a real world decision tree algorithm. 7 00:00:25,110 --> 00:00:28,530 However, the details can get a little tedious. 8 00:00:28,530 --> 00:00:32,770 Not too mathematical but just a lot of little things you have to, 9 00:00:32,770 --> 00:00:35,570 a lot of Ts to cross and Is to dot. 10 00:00:35,570 --> 00:00:38,260 And so we've made this section optional. 11 00:00:38,260 --> 00:00:40,450 You're welcome to watch it. 12 00:00:40,450 --> 00:00:43,790 Not a lot of math, just a lot of detail. 13 00:00:43,790 --> 00:00:47,080 And so I would recommend you especially pay attention 14 00:00:47,080 --> 00:00:50,330 if you're implementing your own decision tree, learning from scratch. 15 00:00:50,330 --> 00:00:53,750 But if not, just know what pruning is, we just talked about it. 16 00:00:53,750 --> 00:00:56,000 And know that most implementations, 17 00:00:56,000 --> 00:00:59,240 most great implementations out there will use some kind of pruning. 18 00:01:00,520 --> 00:01:03,920 We discussed lots of recursions where we are building the decision tree from 19 00:01:03,920 --> 00:01:09,110 the top down and we discussed two types of list stopping L stopping conditions, 20 00:01:09,110 --> 00:01:11,520 the ones that we should always do, the stopping conditions for 21 00:01:11,520 --> 00:01:15,380 in the previous module and the three list stopping conditions from this module. 22 00:01:15,380 --> 00:01:19,370 Now let's explore some of the challenges that come about, 23 00:01:19,370 --> 00:01:22,230 with this early stopping conditions. 24 00:01:22,230 --> 00:01:25,500 Early stopping condition once was, let's pick a max depth, 25 00:01:25,500 --> 00:01:28,430 and stop recursing, once you reach that max depth. 26 00:01:28,430 --> 00:01:32,760 The question is, how do you really pick the max depth? 27 00:01:33,810 --> 00:01:37,320 We said that you should use cross validation of validation set, but 28 00:01:37,320 --> 00:01:41,760 there can be challenges with setting data side for that, you had to rerun your 29 00:01:41,760 --> 00:01:47,910 algorithms many times, and you might not always reach the desired answer you like. 30 00:01:47,910 --> 00:01:51,390 And plus you want some parts of the tree to grow more than others. 31 00:01:51,390 --> 00:01:54,640 It becomes really kind of complicated to set this parameter. 32 00:01:54,640 --> 00:01:57,940 Now, early stopping condition two, is the most dangerous one. 33 00:01:57,940 --> 00:02:04,880 Early stopping condition two says stop, if the training error stops decreasing. 34 00:02:04,880 --> 00:02:09,680 So if that, classification error for the training data stops going down, so 35 00:02:09,680 --> 00:02:10,600 flattens out. 36 00:02:10,600 --> 00:02:13,150 However it turns out that in practice 37 00:02:13,150 --> 00:02:15,420 the curve might not be as beautiful as what I drew. 38 00:02:15,420 --> 00:02:18,250 It might not be smoothly going down, it might jump and 39 00:02:18,250 --> 00:02:20,100 it might do some weird stuff. 40 00:02:20,100 --> 00:02:21,770 Let's see why that can come about. 41 00:02:23,110 --> 00:02:26,300 As we discussed early stopping condition two 42 00:02:26,300 --> 00:02:28,710 stop when error doesn't decrease can be dangerous. 43 00:02:28,710 --> 00:02:32,470 And to show how it's dangerous, we need a counter example, and we'll use the counter 44 00:02:32,470 --> 00:02:38,680 example for everything, which if you saw the course one you know that it is xor. 45 00:02:39,720 --> 00:02:42,720 So, here's our data and 46 00:02:42,720 --> 00:02:48,640 this data corresponds to the xor of the first input and the second input. 47 00:02:48,640 --> 00:02:52,040 So, when they're both false the output is false and 48 00:02:52,040 --> 00:02:55,860 when it's false the other is true output is true which happens in the two 49 00:02:55,860 --> 00:03:00,910 middle rows here and then when they're both true the output is false. 50 00:03:00,910 --> 00:03:04,810 If you are ready to find the counter example for anything just say xor. 51 00:03:04,810 --> 00:03:07,240 In fact if someone calls you in the middle of the night and 52 00:03:07,240 --> 00:03:11,370 asks what is the counter example for this crazy problem I'm having? 53 00:03:11,370 --> 00:03:13,280 You say, xor. 54 00:03:14,920 --> 00:03:20,510 So what happens with this xor when we just look at the root of decision tree. 55 00:03:20,510 --> 00:03:28,030 Well we have two rows with output true, and two rows of output false. 56 00:03:28,030 --> 00:03:33,230 And so what is the error of this decision tree we're just going to top level. 57 00:03:33,230 --> 00:03:37,540 Well you just choose one of these because they're both the same, let's say we choose 58 00:03:37,540 --> 00:03:42,350 true because we're optimists, we like to think positively. 59 00:03:42,350 --> 00:03:46,990 And so we're going to say that the output here is +1. 60 00:03:46,990 --> 00:03:52,150 And so that means that we're making two mistakes. 61 00:03:53,940 --> 00:03:58,519 Two mistakes, so two out of four datapoints, so 62 00:03:58,519 --> 00:04:02,440 our error is 0.5, which is bad. 63 00:04:02,440 --> 00:04:07,496 That means that you are as good as random, just error 0.5. 64 00:04:07,496 --> 00:04:10,656 Okay, let's see what happens if we split on x[1]. 65 00:04:10,656 --> 00:04:16,370 Well, if we split on x[1] and you are actually 66 00:04:16,370 --> 00:04:21,110 splitting the table like this and you see that because xor is so beautiful, we, 67 00:04:21,110 --> 00:04:24,790 again, for each one of these subsequent nodes, we're in the same situation. 68 00:04:24,790 --> 00:04:31,170 So, for example, for an x-1 equals 2, you have one y true and one y false. 69 00:04:31,170 --> 00:04:35,920 So let's suppose that you predict plus one for this side and also plus one for 70 00:04:35,920 --> 00:04:41,540 the side where x-1 is false, because it's all even stevens. 71 00:04:41,540 --> 00:04:42,325 So what's the error? 72 00:04:42,325 --> 00:04:47,920 Well, we made mistakes, mistakes were made. 73 00:04:47,920 --> 00:04:52,698 Mistakes were made here and here so we made 1 + 1 74 00:04:52,698 --> 00:04:57,487 mistakes which is 2 out of 4, so it's 0.5. 75 00:04:57,487 --> 00:05:01,160 So it turns out that splitting on x[1] does not help. 76 00:05:02,710 --> 00:05:08,100 Okay, and see what happens when you try to split on x[2]? 77 00:05:08,100 --> 00:05:10,960 So if I try to split on x[2] I end up with 78 00:05:10,960 --> 00:05:15,480 the same situation expect I change this label here from x[1] to x[2]. 79 00:05:15,480 --> 00:05:20,810 But again, we'll be in exactly the same situation 80 00:05:20,810 --> 00:05:26,350 where we're going to output 1, on each one of these outcomes. 81 00:05:26,350 --> 00:05:28,760 And make the same number of mistakes. 82 00:05:28,760 --> 00:05:35,727 So one mistake for each, which gives me, total error of 0.5. 83 00:05:35,727 --> 00:05:39,278 And so, if I look, all in all, 84 00:05:39,278 --> 00:05:43,940 neither are splitting on x1, nor x2. 85 00:05:45,040 --> 00:05:47,590 That helps me reduce the error. 86 00:05:47,590 --> 00:05:50,720 So both splits give you the same error. 87 00:05:52,570 --> 00:05:54,710 So the question is, should I start now? 88 00:05:54,710 --> 00:05:59,510 If neither feature improves and your using early stopping condition 89 00:05:59,510 --> 00:06:04,460 two then you just say no don't do it take it in the splits. 90 00:06:04,460 --> 00:06:07,730 You just stop splitting the decision tree. 91 00:06:07,730 --> 00:06:13,950 At the very very root output to be true, make two mistakes 92 00:06:13,950 --> 00:06:19,115 and say that the error that you get in the end is 0.5. 93 00:06:19,115 --> 00:06:25,000 So, [LAUGH] this is why xor is a counter example to everything. 94 00:06:25,000 --> 00:06:28,422 What would happen if you didn't take early stopping condition two and 95 00:06:28,422 --> 00:06:29,470 you just kept going? 96 00:06:31,699 --> 00:06:36,580 If you just kept going, you get this tree that looks here. 97 00:06:36,580 --> 00:06:41,680 And if you see, for each one of these leaves, there is zero mistakes. 98 00:06:41,680 --> 00:06:47,680 So the leaf from the left outputs false, or minus one. 99 00:06:47,680 --> 00:06:51,200 And, makes zero mistakes, next one makes zero mistakes. 100 00:06:51,200 --> 00:06:54,510 The third one makes zero mistakes, and the last one makes zero mistakes. 101 00:06:54,510 --> 00:06:56,100 So, if you kept going. 102 00:06:56,100 --> 00:07:00,979 If you didn't use early stopping condition two, 103 00:07:00,979 --> 00:07:04,480 you would get three errors of zero. 104 00:07:04,480 --> 00:07:08,908 And so, in this case, there is a huge gap, 105 00:07:08,908 --> 00:07:13,582 huge gap between training error was 0.5, 106 00:07:13,582 --> 00:07:19,363 which is as good as random, and training error of zero, 107 00:07:19,363 --> 00:07:22,930 which is perfect training error. 108 00:07:24,540 --> 00:07:27,520 So in this case, early stopping condition two 109 00:07:28,740 --> 00:07:33,610 created this arbitrary large gap between the best he could have done. 110 00:07:33,610 --> 00:07:34,880 And what you end up outputting. 111 00:07:37,170 --> 00:07:39,590 So we saw a list of in condition two. 112 00:07:39,590 --> 00:07:43,980 Don't stop recursing if the training error does not decrease enough. 113 00:07:43,980 --> 00:07:50,090 It seems like a reasonable heuristic, but it has some dangerous pitfalls. 114 00:07:50,090 --> 00:07:53,720 Because two short, short-sighted, this doesn't explore branches far enough. 115 00:07:53,720 --> 00:07:56,540 So it's kind of generally a good principle. 116 00:07:56,540 --> 00:08:00,560 But if you apply top down you might get yourself into trouble. 117 00:08:00,560 --> 00:08:02,040 So what can you do about this? 118 00:08:02,040 --> 00:08:06,360 This is where pruning comes in, uses something like this criterian but 119 00:08:06,360 --> 00:08:07,515 uses it bottom up. 120 00:08:07,515 --> 00:08:11,709 [MUSIC]