1 00:00:00,159 --> 00:00:04,579 [MUSIC] 2 00:00:04,579 --> 00:00:08,798 Next, we're going to take a few minutes to talk about the famous AdaBoost theorem and 3 00:00:08,798 --> 00:00:13,470 what it's practical implications are, which are extremely interesting. 4 00:00:13,470 --> 00:00:18,930 So, if you remember, the way that boosting came about is that Kearns and Valiant 5 00:00:18,930 --> 00:00:23,250 had this famous question, can you combine weak classifiers to make a stronger one? 6 00:00:24,350 --> 00:00:27,000 Actually, Valiant is a touring award winner, so 7 00:00:27,000 --> 00:00:30,120 really famous professor from Harvard. 8 00:00:30,120 --> 00:00:33,510 And Schapire, a couple of years later, 9 00:00:33,510 --> 00:00:37,840 came up with the idea of boosting, which really changed the machine learning field. 10 00:00:37,840 --> 00:00:43,260 And so if you look at the iterations of 11 00:00:43,260 --> 00:00:47,480 boosting in the x axis here, in the training error for our dataset, 12 00:00:47,480 --> 00:00:52,590 we observe a very practical effect that we see in a lot of boosting data. 13 00:00:52,590 --> 00:00:56,630 We're going to start with a really bad training error. 14 00:00:56,630 --> 00:01:01,760 So first decision stump has training error of 22.5%. 15 00:01:01,760 --> 00:01:03,150 So, not good at all. 16 00:01:04,790 --> 00:01:08,230 After thirty iterations, as we saw a little earlier, 17 00:01:08,230 --> 00:01:11,460 we're going to get all the data points right in our training data and 18 00:01:11,460 --> 00:01:14,150 I showed you kind of that crazy decision boundary for this. 19 00:01:14,150 --> 00:01:19,852 So we see a smooth transition where the classification error tends to go down. 20 00:01:19,852 --> 00:01:24,060 Tends to go down and the value goes to zero and actually stays at zero. 21 00:01:24,060 --> 00:01:26,800 And that's a key insight of the boosting theorem. 22 00:01:28,920 --> 00:01:35,930 So here is the famous AdaBoost Theorem which underlines all the choices 23 00:01:35,930 --> 00:01:41,130 we made in the algorithm and really has had a lot of impact on machine learning. 24 00:01:41,130 --> 00:01:44,660 Basically the theorem says under some technical conditions and 25 00:01:44,660 --> 00:01:46,510 all theorems are like this. 26 00:01:46,510 --> 00:01:50,700 Now if you were to be able to say some restrictions apply, see store for details. 27 00:01:50,700 --> 00:01:54,090 And so you say under some technical conditions, 28 00:01:54,090 --> 00:02:00,070 the training error of the classifier goes to zero, ss the number of 29 00:02:00,070 --> 00:02:05,960 iterations of the number and some models considered capital T goes to infinity. 30 00:02:05,960 --> 00:02:10,830 So in other words pictorially we'll see that we expect the training error, 31 00:02:10,830 --> 00:02:15,200 which is this thing here on the y axis, to go eventually to zero. 32 00:02:16,700 --> 00:02:18,910 Now that's eventually. 33 00:02:18,910 --> 00:02:22,410 It might oscillate a little bit in the middle, however, 34 00:02:22,410 --> 00:02:26,390 what the theorem says is that it tends to generally decrease, 35 00:02:26,390 --> 00:02:30,160 eventually become zero, usually, and then stick at zero. 36 00:02:30,160 --> 00:02:33,200 So we're going to see this high in the beginning, wiggle, wiggle, wiggle, but 37 00:02:33,200 --> 00:02:37,580 tends to go down, and then hit a certain value, usually zero, and sticks at zero. 38 00:02:38,690 --> 00:02:41,190 Now, let me just take a minute to talk about those 39 00:02:41,190 --> 00:02:43,140 technical distances starts with details. 40 00:02:43,140 --> 00:02:48,670 It turns out the technical condition is something quite important. 41 00:02:48,670 --> 00:02:52,890 It says that at every iteration T, we can find a weak learner. 42 00:02:52,890 --> 00:02:57,200 So a decision stump that has weighted error at least 43 00:02:57,200 --> 00:02:58,130 a little bit lower than 0.5. 44 00:02:58,130 --> 00:03:00,400 So it is at least better than random. 45 00:03:00,400 --> 00:03:01,660 That's what the theorem says. 46 00:03:03,230 --> 00:03:06,630 It seems like intuitively, it would show us by the classifier 47 00:03:06,630 --> 00:03:09,520 that it's better than random, even with the data. 48 00:03:09,520 --> 00:03:12,120 But, it turns out that this is not always possible. 49 00:03:12,120 --> 00:03:17,490 So, here is a counter example, which is a really extreme counter example, 50 00:03:17,490 --> 00:03:20,120 but there's other examples where it's not possible. 51 00:03:20,120 --> 00:03:22,670 So for example, if you have a negative point here, 52 00:03:22,670 --> 00:03:25,910 and you have a positive point on top of it, there's never going to be a decision 53 00:03:25,910 --> 00:03:29,070 stump that can separate the positive point on top of the negative point. 54 00:03:29,070 --> 00:03:33,380 So the conditions of boosting algorithm might not be satisfied, 55 00:03:33,380 --> 00:03:35,270 of the boosting AdaBoost theorem. 56 00:03:35,270 --> 00:03:40,120 However, nevertheless, boosting usually takes your training error to zero or 57 00:03:40,120 --> 00:03:44,900 somewhere quite low if the number of iterations go to infinity. 58 00:03:44,900 --> 00:03:48,193 So, we do observe that decreasing practice although they're are some technical 59 00:03:48,193 --> 00:03:51,531 conditions in theorems, so it might not go exactly to zero, but it can get really, 60 00:03:51,531 --> 00:03:52,078 really low. 61 00:03:52,078 --> 00:03:56,309 [MUSIC]