1 00:00:00,000 --> 00:00:04,585 [MUSIC] 2 00:00:04,585 --> 00:00:08,752 So now that we've talked about several models, let's dive in into 3 00:00:08,752 --> 00:00:14,330 boosting more generally, and then a specific example of a boosting algorithm. 4 00:00:14,330 --> 00:00:18,000 So think about a learning problem where we take some data, 5 00:00:18,000 --> 00:00:21,630 we learn a classifier which gives us some output, f(x), and 6 00:00:21,630 --> 00:00:26,205 we use it to predict on some data so we say that y hat is a sign of f(x). 7 00:00:27,280 --> 00:00:28,920 So if we take some data and 8 00:00:28,920 --> 00:00:32,520 we try to make a prediction, we might learn see a decision stop. 9 00:00:32,520 --> 00:00:35,720 So let's say you try to split on income, we look at 10 00:00:35,720 --> 00:00:40,020 folks with income greater than 100,000, folks with income less than 100,000, and 11 00:00:40,020 --> 00:00:42,320 how do we learn the actual predictions that we make? 12 00:00:42,320 --> 00:00:44,432 Well, we look at the rows of our data, 13 00:00:44,432 --> 00:00:49,015 the rows where the income was greater than 100,000, and we see that for those, 14 00:00:49,015 --> 00:00:54,820 3 were safe loans and 1 was a risky loan, so we're going to predict y hat = safe. 15 00:00:54,820 --> 00:01:00,369 Now, if we look at incomes of less than 100,000, we'll see that we have 4 safe and 16 00:01:00,369 --> 00:01:05,730 3 risky, so again we predict safe, so both sides we're going to predict safe. 17 00:01:05,730 --> 00:01:08,890 As it turns out that we do this first decision stump, seems okay, but 18 00:01:08,890 --> 00:01:10,830 it doesn't seem great on the data. 19 00:01:10,830 --> 00:01:14,277 And so, decision stump wasn't enough to capture the information with 20 00:01:14,277 --> 00:01:15,760 the limited data. 21 00:01:15,760 --> 00:01:18,320 So a boosting will do is take that decision stump, 22 00:01:18,320 --> 00:01:21,660 evaluate it, look at how well it's doing in our data, and 23 00:01:21,660 --> 00:01:26,420 learn a next decision stump, a next weak classifier and then next 24 00:01:26,420 --> 00:01:30,410 the classifier is going to focus on the data points where the first one was bad. 25 00:01:31,410 --> 00:01:33,803 So in other words, this is a very important point. 26 00:01:33,803 --> 00:01:37,938 In other words, we're going to look at where we're making mistakes so far and 27 00:01:37,938 --> 00:01:40,782 want to increase, see the proportion or the impact or 28 00:01:40,782 --> 00:01:44,610 how much we care about the points where we've made mistakes. 29 00:01:44,610 --> 00:01:47,530 And we load in another classifier that takes care of the points that made 30 00:01:47,530 --> 00:01:49,620 mistakes and then we load another one and another one. 31 00:01:49,620 --> 00:01:52,750 And eventually, as we'll see, will actually converge to a great classifier. 32 00:01:54,120 --> 00:01:57,460 What does it mean to learn the data points were made mistakes? 33 00:01:57,460 --> 00:02:00,640 What it means is that we're going to assign a weight, 34 00:02:00,640 --> 00:02:05,255 alpha i to the positive number to every data point in our dataset. 35 00:02:06,300 --> 00:02:10,740 And that weight, when it's higher, it means that data point is more important. 36 00:02:10,740 --> 00:02:15,370 So you can imagine a learning problem where we have data, not just like 37 00:02:15,370 --> 00:02:19,570 we've done so far in this course, but we have data with weights associated with it. 38 00:02:19,570 --> 00:02:21,760 And now we're going to learn from weighted data. 39 00:02:21,760 --> 00:02:25,350 What there is to learn from weighted data, the way to think about it is that, 40 00:02:25,350 --> 00:02:29,810 those alpha is correspond to kind of 41 00:02:29,810 --> 00:02:33,140 data points counting more than one time if its greater than one. 42 00:02:33,140 --> 00:02:34,840 So for example, if alpha i is 2, 43 00:02:34,840 --> 00:02:37,686 you can think about that data point is counting twice. 44 00:02:37,686 --> 00:02:40,570 If the data point was half you could think about data point counting as 45 00:02:40,570 --> 00:02:41,890 half of a data point. 46 00:02:41,890 --> 00:02:45,308 But everything in the learning algorithm stays exactly the same. 47 00:02:45,308 --> 00:02:49,686 It just that instead of counting data points you count kind of weights of data 48 00:02:49,686 --> 00:02:50,300 points. 49 00:02:50,300 --> 00:02:53,240 What happens in our decision stump approach? 50 00:02:53,240 --> 00:02:57,406 So we had that first decision stump, it was not a great, especially for 51 00:02:57,406 --> 00:03:01,850 folks with lower income and so we learned it's weight, which are higher for 52 00:03:01,850 --> 00:03:06,453 the places where we've made mistakes and now we learn the new decision stump. 53 00:03:06,453 --> 00:03:10,377 Let's say again we split the columns, again, written 100,000 less 100,000. 54 00:03:10,377 --> 00:03:14,167 When we look at the classifications decisions that we make, for greater than 55 00:03:14,167 --> 00:03:18,305 100,000 what we do is we sum the weight of the data points that were risky, and 56 00:03:18,305 --> 00:03:20,315 incomes greater than 100,000. 57 00:03:20,315 --> 00:03:24,221 So in this case we're summing 0.5, 0.8, 58 00:03:24,221 --> 00:03:29,653 0.7 which adds up to 2 and then for the risky ones it's 1.2 and 59 00:03:29,653 --> 00:03:33,970 so we're going to make a prediction of y hat is safe. 60 00:03:33,970 --> 00:03:36,440 So it's the weighted sum of the data points. 61 00:03:36,440 --> 00:03:40,270 For income less than 100,000, same kind of idea. 62 00:03:40,270 --> 00:03:42,430 We'll go through the data points, look at the ones that were risk and 63 00:03:42,430 --> 00:03:44,930 ones that were safe and sum up the weight of those. 64 00:03:44,930 --> 00:03:48,038 And we see the total weight of risky that 65, 65 00:03:48,038 --> 00:03:53,570 the total weight of safe loans is 3, so we're going to predict what is risky. 66 00:03:53,570 --> 00:03:56,140 So this decision stump is now going to be a bit better. 67 00:03:56,140 --> 00:03:59,440 So we're going to combine this one with the previous one and others and 68 00:03:59,440 --> 00:04:03,340 others until we create this ensemble classifier. 69 00:04:04,380 --> 00:04:08,440 Now, this idea of learning from weighted data is not just about decision stumps. 70 00:04:08,440 --> 00:04:13,608 It's the result that most machine learning algorithms accept weighted data. 71 00:04:13,608 --> 00:04:17,120 So I'm going to show you very briefly what happens if you are doing logistical 72 00:04:17,120 --> 00:04:19,410 regression and you have weighted data. 73 00:04:19,410 --> 00:04:22,460 So if you look at the equation on the bottom here, that's exactly the equation 74 00:04:22,460 --> 00:04:26,250 of logistic regressions derivative, or the update function that we do. 75 00:04:26,250 --> 00:04:30,500 This is the thing that you would implement if you run a logistic regression model. 76 00:04:30,500 --> 00:04:32,250 Now, you say, I have this weighted data, 77 00:04:32,250 --> 00:04:34,480 I have to reimplement everything from scratch. 78 00:04:34,480 --> 00:04:36,190 My god, my god, my god. 79 00:04:36,190 --> 00:04:38,192 And it turns out that it's very simple. 80 00:04:38,192 --> 00:04:39,500 You should look at the middle of the equation, 81 00:04:39,500 --> 00:04:42,950 we have the sum over data points over here. 82 00:04:42,950 --> 00:04:44,880 And now, so we just sum our data points. 83 00:04:44,880 --> 00:04:47,740 We're just going to weigh the contribution of each data point. 84 00:04:47,740 --> 00:04:53,920 So we just add that weight alpha i to each term in the sum and we're done. 85 00:04:53,920 --> 00:04:56,040 We now have logistic regression for weighted data. 86 00:04:57,520 --> 00:05:00,640 So we showed you two examples, decision stumps and logistics regression but 87 00:05:00,640 --> 00:05:03,640 in general it's quite easy to learn with the data. 88 00:05:03,640 --> 00:05:06,230 So boosting can be viewed as a greedy algorithm for 89 00:05:06,230 --> 00:05:07,820 learning an ensemble from data. 90 00:05:07,820 --> 00:05:11,326 We train the first classssifier, let's say f1(x), 91 00:05:11,326 --> 00:05:17,150 if you just have f1(x) just predict the sign of f1 to be your output of y hat. 92 00:05:17,150 --> 00:05:19,580 Now, then you re-weight your data 93 00:05:19,580 --> 00:05:23,730 by weighting more data points where we made mistakes, where f1 makes mistakes. 94 00:05:23,730 --> 00:05:26,632 And now we run another classifier, f2 and 95 00:05:26,632 --> 00:05:30,744 we learn the coefficients which for these classifiers, and 96 00:05:30,744 --> 00:05:36,154 now our prediction if we just do 2 steps of this is w hat 1, f1 + w hat 2, f2. 97 00:05:36,154 --> 00:05:38,360 And the sign of that is y hat. 98 00:05:39,770 --> 00:05:44,656 So that is kind of like the keep adding new classifiers, 99 00:05:44,656 --> 00:05:49,458 optimizing the weights to focus on more difficult data points and 100 00:05:49,458 --> 00:05:54,367 then learning the coefficients between different classifiers. 101 00:05:54,367 --> 00:05:58,909 [MUSIC]