1 00:00:00,000 --> 00:00:04,570 [MUSIC] 2 00:00:04,570 --> 00:00:09,557 This next section will talk about how to pick that threshold split, 3 00:00:09,557 --> 00:00:15,343 H equals 38 over income with 60,000 for continuous valued features. 4 00:00:15,343 --> 00:00:17,210 We're going to make this an optional session. 5 00:00:18,310 --> 00:00:21,820 Its not super complicated but is that a little bit laborious? 6 00:00:21,820 --> 00:00:25,140 So, if you're interested, definitely take a deep dive, but for 7 00:00:25,140 --> 00:00:27,130 those who want to skip it, it's totally okay. 8 00:00:28,450 --> 00:00:33,170 So the goal here is to ask, if I decide to split, on say income, 9 00:00:33,170 --> 00:00:37,800 how do I choose the splitting point, p star, that we won't separate, 10 00:00:37,800 --> 00:00:42,840 now case of 60,000, but we want to split the left and right side of the tree. 11 00:00:42,840 --> 00:00:47,120 Now, this infinite many values that t 12 00:00:47,120 --> 00:00:51,300 star could take, it could be $60,000, $50,999.99, and if this 13 00:00:52,380 --> 00:00:57,270 is truly continuous it could go to infinite remaining decimal places. 14 00:00:57,270 --> 00:01:00,780 The question is do we need to consider all of those. 15 00:01:00,780 --> 00:01:05,130 And do all those decimal places really effect the quality of our decision tree? 16 00:01:06,560 --> 00:01:08,570 Now if you think about it and 17 00:01:08,570 --> 00:01:13,060 if you look at the values that income can take on the data so 18 00:01:13,060 --> 00:01:18,513 actual of income you'll see that if you take two points of say VA, and VB. 19 00:01:18,513 --> 00:01:21,269 Let's say 60,000, and 65,000. 20 00:01:21,269 --> 00:01:25,160 If there are no points in between, whether the split is at 61,000, 62,000, 63,000, 21 00:01:25,160 --> 00:01:30,530 64,000, you're still going to have the same classification error. 22 00:01:30,530 --> 00:01:32,970 The points on the left of that split always going to be the same, 23 00:01:32,970 --> 00:01:35,390 the points on the right of the split cannot be the same. 24 00:01:35,390 --> 00:01:36,800 So all I have to do, 25 00:01:36,800 --> 00:01:41,490 is consider the middle point, between any of the data points that we have, and 26 00:01:41,490 --> 00:01:43,990 just consider those to be the possible splits for my data. 27 00:01:43,990 --> 00:01:46,100 And that's exactly what we're going to do. 28 00:01:47,100 --> 00:01:50,060 Let's now close the section by walking through the algorithm for 29 00:01:50,060 --> 00:01:54,200 picking the best splitting point, for a particular feature. 30 00:01:54,200 --> 00:01:58,630 So let's say that I'm considering splitting, on the feature here, hj, 31 00:01:58,630 --> 00:02:06,110 which might be in our case, income, and, what I can do is go through all my data, 32 00:02:06,110 --> 00:02:10,870 so the column of values of the income might take, and sort them. 33 00:02:10,870 --> 00:02:14,600 Such that V1 is the lowest income, V2 is the next lowest and 34 00:02:14,600 --> 00:02:16,350 VN is the highest income. 35 00:02:16,350 --> 00:02:20,780 And all I need to consider is the splitting points right in between V1 and 36 00:02:20,780 --> 00:02:22,460 V2, V2 and V3 and so on. 37 00:02:22,460 --> 00:02:28,233 So I walk from i = 1 though N-1 and then consider splitting point ti, 38 00:02:28,233 --> 00:02:34,779 which is the midpoint between Vi and Vi+1, and I ask what is the classification 39 00:02:34,779 --> 00:02:40,550 error if I were to build a decision tree, a decision stump in this case, 40 00:02:40,550 --> 00:02:45,203 that splits xj on ti, on greater than ti and less than ti? 41 00:02:45,203 --> 00:02:48,220 So greater than 60,000 and lower than 60,000. 42 00:02:48,220 --> 00:02:49,830 And then we'll pick. 43 00:02:49,830 --> 00:02:53,540 t star, to be the split that leads to decision stump, 44 00:02:53,540 --> 00:02:55,480 with the lowest classification error. 45 00:02:55,480 --> 00:02:56,200 And that's it. 46 00:02:56,200 --> 00:03:00,073 Pretty simple algorithm, pretty easy to take from here. 47 00:03:00,073 --> 00:03:04,059 [MUSIC]