[MUSIC] So, let's compare a boosted
serial stump to a decision tree. So, this is the decision tree blog that
we saw in the decision tree model. So this is on real a real dataset is
based on the loan applications and we see the training error as
we make the decision tree deeper tends to go down,
down and down and we saw that. But the test error, which is kind of
related to true error goes down for a while and you have here the best depth,
which is maybe seven. But eventually, it goes backup. So let's say after that 18 over here,
the training error's down to 8%. So it's a really low training error, but the test error has gone up to 39% and
we observe over 15. And in fact, we observe a huge gap here. There are things now,
by the way is the best decision tree has classification error in the test
set of about 34% or so. Now, let's see what happens
with the senior stamps that boosted on the same data set. You get a plot that
looks kind of like this, you see that the training arrow
keeps decreasing per iteration. So, this is the train error. So as the theorem predicts,
it decreases as and the test error in this case is actually going down with iterations. So, we're not observing
over 15 at least yet. And after 18 rounds of boosting,
so 18 decision stumps, we have 32% test error,
which is better than a decision tree. Yet, in fact, that's the best decision
tree and the over fitting is much letter. So, this gap here is much smaller. So, the gap between
your training error and the test error is kind of related to
that over fitting quantity that we have. Now we said,
we're not observing over 15 yet. So, let's run booster stamps for
more iterations and see what happens. So let's see what happens when we
keep on boosting and adding more decision stems on the x-axis, this is
adding more and more decision stems and more tree instead of only one tree and see
what happens to the classification error. We see here that the training
error keeps decreasing. Just like we expected by the theorem,
but the test error has stabilized. So, the test performance
tends to stay constant. And in fact, it stays constant for
many iterations. So if I pick T anywhere in this range,
we will do about the same. So, any of these values for T would be fine. So now we seen how boosting the sound
is seems to be stabilizing, but the question is do we observe
overfitting and boosting? And in fact,
we do observe over 15 with boosting, but boosting tends to be
quite robust in overfitting. So if you use too many trees or
too few trees, there's a huge range, but that's really okay. So here,
I'm going up to 5,000 decisions stamps, which is this type of
boosting we are doing here. The best test error was about 31%. But if I kept going too far,
I go to 5000 overfitting that test error is going up, but
is doesn't go up that much. It goes up to 33%, so
there's some over overfitting here. Very importantly in these examples,
I've been showing you the test error and talking about what happens
overfitting mostly. But as we know, we need to be very careful of how we
pick the parameters of our algorithm. So how do we pick capital T,
which is when we stop boosting? We have five decision stamps,
5,000 decision stamps. And this is just like selecting the magic
parameters or all sorts of algorithms and almost every algorithm out there model
has a parameter trades off complexity, that for decision tree,
number of features, or magnitude of weights in
logistic regression and here is the number of rounds of
boosting with the quality of the fit. So, just like lambda in regularization. We can't use the training data,
because the training tends to go down with iterations of boosting, so
you say that T should be infinite. Shouldn't be here really big and
should never, never, never, ever, ever, ever,
ever use the test data, so that's bad. I was just showing you an illustrative
examples, but you should never do that. So, what should we do? Well, you should either use a validation
set if you have a lot of data. If you have a big dataset, you select a subpart of that to
just pick the magic parameters. And if your dataset is not that large
you should use cross-validation. And in the regression course, we talked
about how to use validation sets and how to use cross-validations to
pick magic parameters like lambda, the def decision tree or
the number rounds of boosting Capital T. [MUSIC]