1 00:00:00,170 --> 00:00:04,438 [MUSIC] 2 00:00:04,438 --> 00:00:08,538 So what does it mean to avoid overfitting in decision trees, what does it mean to 3 00:00:08,538 --> 00:00:12,272 stop a little earlier, or what does it mean to try to simplify the types of 4 00:00:12,272 --> 00:00:17,080 trees that we learned so that we don't have these crazy overfitting problems. 5 00:00:17,080 --> 00:00:20,270 And what we'll use is a type of regularization 6 00:00:20,270 --> 00:00:23,500 which bias is a sort simpler trees, and 7 00:00:23,500 --> 00:00:27,690 that's also can philosophically relate to the concept called Occam's razor. 8 00:00:27,690 --> 00:00:29,870 And so let's talk about that principal. 9 00:00:31,405 --> 00:00:35,610 It's interesting that this idea of finding simpler trees simpler explanations for 10 00:00:35,610 --> 00:00:41,180 data is a concept that goes back to a long, long time, so 11 00:00:41,180 --> 00:00:45,620 many people call it Occam's razor, because of William of Occam, 12 00:00:46,680 --> 00:00:50,550 who was talking about this concept around the 13th Century. 13 00:00:50,550 --> 00:00:53,630 Although, this goes back to Pythagoras and Aristotle. 14 00:00:53,630 --> 00:00:57,030 We know 2,000 years or so before that, but 15 00:00:57,030 --> 00:01:01,790 basically the concept there has been of the possible hypothesis of 16 00:01:01,790 --> 00:01:06,410 the possible explanations as we'll of the possible trees. 17 00:01:06,410 --> 00:01:09,277 You should always pick the one that explains the data 18 00:01:09,277 --> 00:01:13,902 with as simple an explanation as possible, so for example, let's say that you have 19 00:01:13,902 --> 00:01:17,830 that you go to the doctor, and say I have these two symptoms S1 and S2. 20 00:01:17,830 --> 00:01:23,480 Let's say I have a terrible red headache and a terrible soar throat. 21 00:01:24,540 --> 00:01:27,930 So, there's two options the doctor might see. 22 00:01:27,930 --> 00:01:34,885 A brain tumor explains the headache and 23 00:01:34,885 --> 00:01:41,830 a throat infection explains a soar throat. 24 00:01:43,520 --> 00:01:49,010 Or it could say, okay, there's another disease, let's say common cold, 25 00:01:49,010 --> 00:01:53,290 that explains both the headache and the sore throat. 26 00:01:53,290 --> 00:01:56,300 And in that case it's probably better, 27 00:01:56,300 --> 00:02:02,460 probably more likely that it's just one disease that explains both symptoms. 28 00:02:02,460 --> 00:02:05,040 Then having two separate diseases one explains symptom one, 29 00:02:05,040 --> 00:02:06,920 the other one explains symptom two. 30 00:02:08,200 --> 00:02:09,760 Now, it's kind of a general idea. 31 00:02:09,760 --> 00:02:14,437 The way that Occam talked about it was among competing hypothesis 32 00:02:14,437 --> 00:02:18,784 the one that fields assumptions should always be selected. 33 00:02:18,784 --> 00:02:22,440 And so fewest assumptions is simplest in our case. 34 00:02:23,550 --> 00:02:28,620 In a contest of decision trees, the Occam's razor principle says if you have 35 00:02:28,620 --> 00:02:33,180 two trees that have similar classification error, pick the simpler one. 36 00:02:33,180 --> 00:02:36,540 So, let's go through the examples below. 37 00:02:36,540 --> 00:02:41,210 Let's see, we have the training error. 38 00:02:41,210 --> 00:02:44,860 For four different trees, and then we have the validation error for 39 00:02:44,860 --> 00:02:46,700 four different trees. 40 00:02:46,700 --> 00:02:49,680 So we observe a couple different things. 41 00:02:51,870 --> 00:02:56,775 We observe that two trees have the same validation error, 42 00:02:56,775 --> 00:03:00,577 and that one tree has a very low train error but 43 00:03:00,577 --> 00:03:06,430 higher validation error, so the lower tree is clearly overfit. 44 00:03:06,430 --> 00:03:10,180 The super complex tree is definitely bad. 45 00:03:11,910 --> 00:03:16,781 But interestingly, what you do about these two trees, the moderate and one, 46 00:03:16,781 --> 00:03:21,790 and the more complex one they get same validation error, but 47 00:03:21,790 --> 00:03:25,152 one has higher train error, you can say well, 48 00:03:25,152 --> 00:03:29,930 you know what the simpler trees probably better is less likely to overfit. 49 00:03:29,930 --> 00:03:35,390 So, I'm probably going to pick the moderately complex tree 50 00:03:36,470 --> 00:03:41,770 that gives me reasonable train error and no validation error. 51 00:03:41,770 --> 00:03:47,250 So, that's kind of intellectually, conceptually validating kind of idea. 52 00:03:47,250 --> 00:03:50,700 Try to fix simpler trees that explain the data relatively well, and 53 00:03:50,700 --> 00:03:52,111 do well in the validation set. 54 00:03:53,650 --> 00:03:58,519 But to get this right, we have to go and dig in, and define what a simpler tree is. 55 00:03:59,800 --> 00:04:02,780 So the question here is what makes a tree simpler? 56 00:04:02,780 --> 00:04:07,818 And so we have this really complex, high depth, complicated tree on the left or 57 00:04:07,818 --> 00:04:12,690 the shallower, two level decision tree on the right. 58 00:04:12,690 --> 00:04:15,110 And clearly, the one on the right is simpler. 59 00:04:15,110 --> 00:04:19,540 So if they both have similar performance, let's pick the simple trees. 60 00:04:19,540 --> 00:04:22,170 But in general, let's try to bias our algorithms 61 00:04:22,170 --> 00:04:24,510 to find the simpler trees rather than the more complex ones. 62 00:04:26,532 --> 00:04:30,600 We're going to see next, some modification of the decision tree algorithm 63 00:04:30,600 --> 00:04:33,940 that's going to bias ourselves to simpler trees, so 64 00:04:33,940 --> 00:04:38,620 in other words, let's try to avoid complex trees and focus on simpler trees. 65 00:04:38,620 --> 00:04:40,490 And the question is how do we do that? 66 00:04:40,490 --> 00:04:42,340 And there's two main approaches out there for 67 00:04:42,340 --> 00:04:45,570 doing that one is what's called early stopping. 68 00:04:45,570 --> 00:04:49,080 Instead of keeping making your tree deeper and deeper and deeper, 69 00:04:49,080 --> 00:04:54,285 somehow stop earlier to avoid overfitting and stop a simpler trees. 70 00:04:54,285 --> 00:04:57,525 The more advanced approach tends to work much better. 71 00:04:57,525 --> 00:05:00,875 It's called pruning where you build a deeper tree and 72 00:05:00,875 --> 00:05:03,675 then you chop off some leaves at the bottom, some 73 00:05:03,675 --> 00:05:08,405 decisions that were not that important and come up with a simpler tree that way. 74 00:05:08,405 --> 00:05:11,465 We're going to talk about both of these approaches in today's module. 75 00:05:11,465 --> 00:05:16,299 [MUSIC]