1 00:00:00,000 --> 00:00:04,463 [MUSIC] 2 00:00:04,463 --> 00:00:08,316 So far in this module, we've discussed learning decision tree, but 3 00:00:08,316 --> 00:00:12,360 we only used what's called categorical data inputs or features. 4 00:00:12,360 --> 00:00:19,390 So, we looked at, credit could be poor, fair, or excellent. 5 00:00:19,390 --> 00:00:23,210 However, if you look at income, that's what's called a real value feature. 6 00:00:23,210 --> 00:00:25,215 It has continuous possible values. 7 00:00:25,215 --> 00:00:29,351 So 105,000, 73,000, 69,000, and so on. 8 00:00:29,351 --> 00:00:32,860 So the question is how do you build a decision tree with this kind of input? 9 00:00:34,400 --> 00:00:37,440 One natural approach is to just treat income or 10 00:00:37,440 --> 00:00:41,130 the continuous valued feature by of the categorical data. 11 00:00:41,130 --> 00:00:46,280 So let's take that root nulled with 40 datapoints and just split on income. 12 00:00:46,280 --> 00:00:47,700 And see what happens. 13 00:00:47,700 --> 00:00:51,395 Well, there is one datapoint with income of $30,000. 14 00:00:51,395 --> 00:00:54,693 There's one datapoint income of $31,400. 15 00:00:54,693 --> 00:00:57,760 There's one datapoint with income $39,500 and so on. 16 00:00:57,760 --> 00:01:01,390 And it turns out that the nodes that we get out of them 17 00:01:01,390 --> 00:01:06,520 basically only have one datapoint in them. 18 00:01:06,520 --> 00:01:09,180 And this can be really, really bad. 19 00:01:09,180 --> 00:01:12,960 When you have very few data points in the intermediate node in the decision tree 20 00:01:12,960 --> 00:01:14,960 you're very prone to overfitting. 21 00:01:14,960 --> 00:01:17,410 Very prone to make predictions you cannot trust. 22 00:01:17,410 --> 00:01:22,730 So, for example, if you look here you'd predict in this case 23 00:01:22,730 --> 00:01:27,850 that if you're income is $30,000, this is definitely a risky loan, 24 00:01:27,850 --> 00:01:31,920 but if you're income is $31,400 is definitely a safe loan, 25 00:01:31,920 --> 00:01:36,400 however if your income is now 39,500, you're back to risky. 26 00:01:36,400 --> 00:01:43,440 So [LAUGH] it's risky, safe, risky, which doesn't make any sense. 27 00:01:43,440 --> 00:01:45,040 Do you trust it? 28 00:01:45,040 --> 00:01:46,080 I wouldn't. 29 00:01:46,080 --> 00:01:50,920 And so the question is, how do we deal with this real valued features. 30 00:01:52,070 --> 00:01:55,730 As a very natural alternative, we can work on threshold splits. 31 00:01:55,730 --> 00:02:00,430 And these are simply picking a threshold on the value 32 00:02:00,430 --> 00:02:03,466 of that continuous valued feature, so let's say 60,000. 33 00:02:03,466 --> 00:02:07,490 And for the left side of that split will put all the data points have income 34 00:02:07,490 --> 00:02:09,700 lower than $60,000, and on the right, 35 00:02:09,700 --> 00:02:13,050 will put all the data points have incomes higher than or equal to $60,000. 36 00:02:13,050 --> 00:02:17,988 And as we can see, we have a subset of the data here, income higher than $60,000. 37 00:02:17,988 --> 00:02:23,220 And for those we have many data points there. 38 00:02:23,220 --> 00:02:28,160 So, it's a lot less risk of over fitting and we see that 14 of 39 00:02:28,160 --> 00:02:33,070 them have our safe laws so probably predict a safe there. 40 00:02:33,070 --> 00:02:39,150 Well, 13 will risk you on the $60,000 so maybe you'll predict those as risks. 41 00:02:39,150 --> 00:02:42,520 So this is a very natural kind of split that we might want to do 42 00:02:42,520 --> 00:02:43,590 with continuous value data. 43 00:02:45,140 --> 00:02:47,790 Let's now take a moment to visualize what happens when we do this kind of 44 00:02:47,790 --> 00:02:49,050 threshold split. 45 00:02:49,050 --> 00:02:54,490 So for example, I've laid out my data income into this line here that 46 00:02:54,490 --> 00:02:59,500 ranges from 10,000 to 120, and if we pick a threshold split of 60,000 and 47 00:02:59,500 --> 00:03:03,800 we say everything on the left of the split has income less that $60,000 we're 48 00:03:03,800 --> 00:03:05,920 going to predict to be risky loans. 49 00:03:05,920 --> 00:03:10,050 Everything to the right has income higher than $60,000 we're going to predict those 50 00:03:10,050 --> 00:03:11,050 as being safe loans. 51 00:03:12,960 --> 00:03:16,190 Now let's supposed that we have two continuous value to features. 52 00:03:16,190 --> 00:03:21,070 We have income in the y axis and we have age in the x axis, and 53 00:03:21,070 --> 00:03:22,110 let's see what happens here. 54 00:03:22,110 --> 00:03:26,540 And you'll see there are some positive and negative examples laid out in 2D. 55 00:03:26,540 --> 00:03:29,760 Another thing that's interesting is that you see that 56 00:03:29,760 --> 00:03:32,500 older people with higher incomes tend to be safe loans, but 57 00:03:32,500 --> 00:03:35,590 also younger people that may have lower incomes, those might also be 58 00:03:35,590 --> 00:03:39,430 safe loans because those people may make money over time, let's say. 59 00:03:39,430 --> 00:03:44,980 So we might look at this state and decide to split on age first. 60 00:03:44,980 --> 00:03:49,460 And if we split on age, let's say age equals 38, we'll see that for 61 00:03:49,460 --> 00:03:51,590 the folks that are younger than 38, 62 00:03:51,590 --> 00:03:55,590 on average, more of them have risky long, so you might predict risky. 63 00:03:56,610 --> 00:04:00,010 But for the folks that have age greater than 38, 64 00:04:00,010 --> 00:04:01,800 we have more safe loans than risky. 65 00:04:01,800 --> 00:04:02,920 So we might predict safe. 66 00:04:04,230 --> 00:04:06,860 Now to the next split in our decision tree. 67 00:04:06,860 --> 00:04:11,050 We might choose to split for the folks that have age greater than 38 we 68 00:04:11,050 --> 00:04:16,250 might split on the income and ask whether this income greater than $60,000 or not. 69 00:04:16,250 --> 00:04:19,440 And if it is, we put a split there. 70 00:04:19,440 --> 00:04:24,440 And we'll see that the point below Income below $60,000 71 00:04:24,440 --> 00:04:28,550 even the higher age might be negative, so might be predicted negative. 72 00:04:30,590 --> 00:04:34,400 So let's take a moment to visualize the decision tree we've learned so far. 73 00:04:34,400 --> 00:04:42,095 So we start from the root node over here and we made our first split. 74 00:04:42,095 --> 00:04:46,959 And for our first split, we decide to split on age. 75 00:04:46,959 --> 00:04:53,185 And the two possibilities we looked at were, 76 00:04:53,185 --> 00:04:57,700 is the age smaller 77 00:04:59,000 --> 00:05:05,350 than 38 or is the age greater than or equal to 38. 78 00:05:05,350 --> 00:05:09,070 So that was our first threshold split. 79 00:05:09,070 --> 00:05:13,400 And for those with age smaller than 38, let's say that we stopped right here, 80 00:05:13,400 --> 00:05:17,170 we'd see that there's five risky and three safe. 81 00:05:17,170 --> 00:05:18,550 So we'd predict risky. 82 00:05:20,370 --> 00:05:21,690 So that might be our leaf here. 83 00:05:22,890 --> 00:05:30,030 And for age greater than 38 we took another split, which was on income. 84 00:05:30,030 --> 00:05:35,520 And we just ask ourselves is the income Is it 85 00:05:35,520 --> 00:05:42,430 less than 60,000 or is it greater than or equal to 60,000? 86 00:05:42,430 --> 00:05:46,170 Now for the ones that have income greater than or 87 00:05:46,170 --> 00:05:51,770 equal to 60,000 in age greater than 38 we predicted those were safe loans. 88 00:05:52,790 --> 00:05:55,890 While the ones that had age greater than 38 and 89 00:05:55,890 --> 00:06:00,130 income less than $60,000, we predicted those to be risky loans. 90 00:06:01,420 --> 00:06:05,429 And this is an example for the tree where we're making 91 00:06:05,429 --> 00:06:09,527 these binary splits on the data for the continuous variables. 92 00:06:09,527 --> 00:06:13,625 [MUSIC]