1 00:00:00,000 --> 00:00:04,107 [MUSIC] 2 00:00:04,107 --> 00:00:08,826 We'll start with a high level outline of the learning algorithm for decision 3 00:00:08,826 --> 00:00:13,410 trees and then we'll dig into some core details in order to make it happen. 4 00:00:14,530 --> 00:00:19,517 So typically, we'll start with an empty tree where you have some data, 5 00:00:19,517 --> 00:00:23,296 some loans labeled as safe, others labeled as risky. 6 00:00:23,296 --> 00:00:25,259 And this node in the tree, 7 00:00:25,259 --> 00:00:30,260 which is called the root node contains the whole data set and what we 8 00:00:30,260 --> 00:00:36,390 show here is a histogram of the number of vales that have safe or risky loans. 9 00:00:36,390 --> 00:00:41,129 So this can give you all the data, then I'm going to pick a feature to split on, 10 00:00:41,129 --> 00:00:42,180 split the tree. 11 00:00:42,180 --> 00:00:45,232 So for example here, I'm splitting the tree on credit and 12 00:00:45,232 --> 00:00:47,362 credit can take three possible values. 13 00:00:47,362 --> 00:00:52,111 So, I can take the subset of data that corresponds to excellent credit and 14 00:00:52,111 --> 00:00:55,110 just feed it to here the left side of that tree. 15 00:00:56,260 --> 00:01:00,580 I can take the subset of data that corresponds to fair credit value, 16 00:01:00,580 --> 00:01:02,800 which corresponds middle side of the tree and 17 00:01:02,800 --> 00:01:07,660 then poor is the subset of data to the right side of the tree. 18 00:01:07,660 --> 00:01:11,180 And so as you can see there, there's more data points with say, 19 00:01:11,180 --> 00:01:13,940 poor credit than there are those with excellent credit. 20 00:01:15,970 --> 00:01:20,690 Now that we've assigned subsets of data to which one of the subsequent nodes, 21 00:01:20,690 --> 00:01:24,420 we can visualize that data using histograms again. 22 00:01:24,420 --> 00:01:28,628 So the top histogram here corresponds to everything all the data, 23 00:01:28,628 --> 00:01:31,558 but the histogram on the branch here on the left 24 00:01:31,558 --> 00:01:35,407 corresponds to the data points where credit was excellent. 25 00:01:35,407 --> 00:01:38,951 The ones in the middle corresponds to data points where credit was fair and 26 00:01:38,951 --> 00:01:42,930 the ones on the right corresponds to data points where credit was poor. 27 00:01:42,930 --> 00:01:49,317 As you can see, the set where credit was excellent has a lot of safe loans. 28 00:01:49,317 --> 00:01:53,210 Where the one where credit was poor has more risky than it has safe loans. 29 00:01:54,890 --> 00:01:58,217 And next, we'll continue splitting each one of these branch, 30 00:01:58,217 --> 00:01:59,578 each one of the datasets. 31 00:01:59,578 --> 00:02:04,584 However, if you consider this first node, the node where the credit 32 00:02:04,584 --> 00:02:10,340 was excellent you see that all of the examples in this node are labeled as safe. 33 00:02:10,340 --> 00:02:11,464 These are all safe loans. 34 00:02:11,464 --> 00:02:13,509 So, there's no point in keep on splitting. 35 00:02:13,509 --> 00:02:18,230 We just predict that everything that has excellent credit is a safe loan. 36 00:02:19,330 --> 00:02:22,220 Now for the other two, we need to keep on going. 37 00:02:22,220 --> 00:02:28,810 So for example, for the loans where we have a credit rating of fair. 38 00:02:28,810 --> 00:02:31,190 We're going to build the next tree going down from there. 39 00:02:31,190 --> 00:02:34,296 And for those data points where the credit rating was poor, again, 40 00:02:34,296 --> 00:02:37,220 we'll keep on building trees from there. 41 00:02:37,220 --> 00:02:40,880 So, that's the recursive greedy algorithm that we're going to discuss and 42 00:02:40,880 --> 00:02:41,840 build today. 43 00:02:43,630 --> 00:02:46,538 We can now discuss at a high-level the greedy algorithm for 44 00:02:46,538 --> 00:02:49,461 learning a decision tree and it is a very simple algorithm. 45 00:02:49,461 --> 00:02:55,426 I start let's say, for empty tree and I pick a feature to split on. 46 00:02:55,426 --> 00:02:57,199 In our case, we split on credit. 47 00:02:57,199 --> 00:03:00,474 So we decided to say, take my data and split on which data points have 48 00:03:00,474 --> 00:03:05,060 excellent credit, which ones have fair credit and which ones have poor credit. 49 00:03:05,060 --> 00:03:08,550 And then for each subset of data, excellent, fair, 50 00:03:08,550 --> 00:03:12,280 poor, I continue thinking about what to do next. 51 00:03:12,280 --> 00:03:17,079 So I decide in the case of excellent credit, there was nothing else to do. 52 00:03:17,079 --> 00:03:21,086 So I stop, but in the other two cases, there was more to do and 53 00:03:21,086 --> 00:03:24,020 what I do is what's called a recursion. 54 00:03:24,020 --> 00:03:27,710 I go back to step two, but only look at the subset of the data 55 00:03:27,710 --> 00:03:32,310 that has credit fair and then only look at subset of data that has credit poor. 56 00:03:34,370 --> 00:03:38,080 Now if you look at this algorithm so far, it sounds a little abstract, but 57 00:03:38,080 --> 00:03:41,070 there is a few points that we need to make more concrete. 58 00:03:41,070 --> 00:03:44,230 We have to decide how to pick the feature to split on. 59 00:03:44,230 --> 00:03:46,854 We split on credit in our example, but 60 00:03:46,854 --> 00:03:52,032 we could have split on something else like the term of the loan or my income. 61 00:03:52,032 --> 00:03:56,909 And then since we having recursion here at the end, we have to figure out 62 00:03:56,909 --> 00:04:02,650 when to stop recursion, when to not go and expand another node in the tree. 63 00:04:02,650 --> 00:04:07,580 So, we'll discuss these two important tasks in the rest of this module. 64 00:04:07,580 --> 00:04:12,259 [MUSIC]