[MUSIC] Next, we're going to take a few minutes to
talk about the famous AdaBoost theorem and what it's practical implications are,
which are extremely interesting. So, if you remember, the way that boosting
came about is that Kearns and Valiant had this famous question, can you combine
weak classifiers to make a stronger one? Actually, Valiant is
a touring award winner, so really famous professor from Harvard. And Schapire, a couple of years later, came up with the idea of boosting, which
really changed the machine learning field. And so if you look at the iterations of boosting in the x axis here,
in the training error for our dataset, we observe a very practical effect
that we see in a lot of boosting data. We're going to start with
a really bad training error. So first decision stump has
training error of 22.5%. So, not good at all. After thirty iterations,
as we saw a little earlier, we're going to get all the data
points right in our training data and I showed you kind of that crazy
decision boundary for this. So we see a smooth transition where
the classification error tends to go down. Tends to go down and the value goes
to zero and actually stays at zero. And that's a key insight
of the boosting theorem. So here is the famous AdaBoost Theorem
which underlines all the choices we made in the algorithm and really has
had a lot of impact on machine learning. Basically the theorem says under
some technical conditions and all theorems are like this. Now if you were to be able to say some
restrictions apply, see store for details. And so
you say under some technical conditions, the training error of the classifier
goes to zero, ss the number of iterations of the number and some models
considered capital T goes to infinity. So in other words pictorially we'll
see that we expect the training error, which is this thing here on the y axis,
to go eventually to zero. Now that's eventually. It might oscillate a little
bit in the middle, however, what the theorem says is that
it tends to generally decrease, eventually become zero, usually,
and then stick at zero. So we're going to see this high in
the beginning, wiggle, wiggle, wiggle, but tends to go down, and then hit a certain
value, usually zero, and sticks at zero. Now, let me just take
a minute to talk about those technical distances starts with details. It turns out the technical condition
is something quite important. It says that at every iteration T,
we can find a weak learner. So a decision stump that
has weighted error at least a little bit lower than 0.5. So it is at least better than random. That's what the theorem says. It seems like intuitively,
it would show us by the classifier that it's better than random,
even with the data. But, it turns out that this
is not always possible. So, here is a counter example,
which is a really extreme counter example, but there's other examples
where it's not possible. So for example,
if you have a negative point here, and you have a positive point on top of
it, there's never going to be a decision stump that can separate the positive
point on top of the negative point. So the conditions of boosting
algorithm might not be satisfied, of the boosting AdaBoost theorem. However, nevertheless, boosting usually
takes your training error to zero or somewhere quite low if the number
of iterations go to infinity. So, we do observe that decreasing practice
although they're are some technical conditions in theorems, so it might not go
exactly to zero, but it can get really, really low. [MUSIC]