1 00:00:00,000 --> 00:00:04,710 [MUSIC] 2 00:00:04,710 --> 00:00:09,440 We've now seen classifiers like logistic regression, and we've also seen decision 3 00:00:09,440 --> 00:00:13,030 trees, which are great ways to build classifiers that fit data. 4 00:00:13,030 --> 00:00:17,910 Now in this module we're going to talk about something called boosting, 5 00:00:17,910 --> 00:00:21,390 which is an amazing kind of trick, they can do machine learning, 6 00:00:21,390 --> 00:00:23,560 and it applies to any classifier. 7 00:00:23,560 --> 00:00:28,090 But it takes its quality or how well it fits data or its error and boosts it and 8 00:00:28,090 --> 00:00:31,430 make it better and better by combining multiple classifiers. 9 00:00:31,430 --> 00:00:34,870 The approach that we're going to discuss today has had amazing impact 10 00:00:34,870 --> 00:00:36,210 in the machine learning world. 11 00:00:36,210 --> 00:00:37,120 It's pretty simple. 12 00:00:37,120 --> 00:00:39,100 There's a little bit of math but not too bad. 13 00:00:39,100 --> 00:00:40,350 Easy to implement and 14 00:00:40,350 --> 00:00:43,040 it's going to make a huge difference when you go into the real world data. 15 00:00:44,250 --> 00:00:48,150 The way we can think intuitively about boosting is that we can start to fork out 16 00:00:48,150 --> 00:00:52,340 weak classifiers, so these are the things like a simple logistic regressor, 17 00:00:52,340 --> 00:00:56,050 a shallow decision tree or maybe even a decision stump. 18 00:00:56,050 --> 00:00:59,980 And the question is what can we say about these simple classifiers or 19 00:00:59,980 --> 00:01:01,160 weak classifiers? 20 00:01:01,160 --> 00:01:03,300 In general, they have some good properties. 21 00:01:03,300 --> 00:01:07,440 They have low variance so they tend not to overfit which is awesome. 22 00:01:07,440 --> 00:01:11,250 However, they tend to have high bias, so they don't fit our data really well. 23 00:01:11,250 --> 00:01:15,328 So, decision stuff is not likely to give you really great accuracy. 24 00:01:15,328 --> 00:01:20,343 And so if you look at the learning curves associated with such models, 25 00:01:20,343 --> 00:01:23,330 let's say a logistic regression model. 26 00:01:23,330 --> 00:01:25,760 If you start from a very simple weak classifier, 27 00:01:25,760 --> 00:01:28,640 you don't have very good fit to the data, so high training error. 28 00:01:28,640 --> 00:01:32,040 But the training error decreases with more and more and more features. 29 00:01:32,040 --> 00:01:34,710 However, as we know, the true error decreases, and 30 00:01:34,710 --> 00:01:37,220 then increases as you start to over fit. 31 00:01:37,220 --> 00:01:42,345 And our goal here is to find kind of this optimal tradeoff between bias and 32 00:01:42,345 --> 00:01:43,420 variance. 33 00:01:43,420 --> 00:01:48,450 Now we know the weak classifiers are great because they have low bias but we 34 00:01:48,450 --> 00:01:53,410 need something that's a little stronger in order to get good quality, low test error. 35 00:01:53,410 --> 00:01:55,572 And the fundamental question is, how do we that? 36 00:01:55,572 --> 00:02:00,020 How do we go from a weak classifier to something that has lower error? 37 00:02:00,020 --> 00:02:02,040 One approach is to add more features. 38 00:02:02,040 --> 00:02:05,640 So for example, if you're using polynomial features in logistic regression, 39 00:02:05,640 --> 00:02:08,450 we can add second order polynomials, third order polynomials, 40 00:02:08,450 --> 00:02:11,290 fourth order polynomials, and so on, try to avoid overfitting. 41 00:02:11,290 --> 00:02:15,560 And decision trees, we can go deeper and deeper in the decision tree. 42 00:02:15,560 --> 00:02:17,998 But the fundamental question here in boosting is, 43 00:02:17,998 --> 00:02:21,684 is there something else that we can do that helps us improve our classifiers, 44 00:02:21,684 --> 00:02:24,990 that takes this weak classifier and makes it a stronger classifier? 45 00:02:26,130 --> 00:02:30,207 This idea of boosting starts from a beautiful mathematical conjecture, 46 00:02:30,207 --> 00:02:33,450 or question, that Kearns and Valiant posed in 1998. 47 00:02:33,450 --> 00:02:36,450 And the question was, can we take weak classifiers, 48 00:02:36,450 --> 00:02:42,300 let's say decision stumps, and boost them, combine them together, multivote them 49 00:02:42,300 --> 00:02:47,020 in some kind of voting schema in order to get a stronger classify error of that? 50 00:02:47,020 --> 00:02:52,180 And amazingly, Rob Schapire and others just a year or two later 51 00:02:52,180 --> 00:02:55,860 came up with an algorithm called boosting that really showed that this was possible. 52 00:02:55,860 --> 00:02:59,430 And this algorithm really changed all of machine learning. 53 00:02:59,430 --> 00:03:04,620 In fact, it's had an amazing impact on the whole field. 54 00:03:04,620 --> 00:03:07,876 It's a default approach for many computer vision tasks, for 55 00:03:07,876 --> 00:03:10,580 many systems that are deployed in industry today. 56 00:03:10,580 --> 00:03:15,900 For example, things that figure out what search results you do when you search for 57 00:03:15,900 --> 00:03:17,400 something on a search engine, or 58 00:03:17,400 --> 00:03:22,230 what ads to show you when you visit a page or search something. 59 00:03:22,230 --> 00:03:24,580 So boosting has had tremendous impact. 60 00:03:24,580 --> 00:03:27,198 It's the technique that wins a lot of those machine learning 61 00:03:27,198 --> 00:03:28,900 competitions you might see out there. 62 00:03:28,900 --> 00:03:31,820 So there's a company called Kaggle that does a bunch of those competitions. 63 00:03:31,820 --> 00:03:34,860 Boosting wins more than half of those competitions. 64 00:03:34,860 --> 00:03:39,912 So it's a really exciting, really useful approach for doing machine learning. 65 00:03:39,912 --> 00:03:44,209 [MUSIC]