[MUSIC] At the core of boosting is the idea of an ensemble classifier. So let's revisit the idea of, say, decision stump. A decision stump may take our data as input and might look at some particular feature, let's say, income and ask, is income greater than 100K? If yes, we say a loan is safe, and if not, we might say the loan is risky. And so the output here, y hat, is +1 or -1, depending on whether we think the loan is safe or risky. We're going to use the following notation just to be clear in today's module. We're going to use f(x) to denote one of these weak simple classifiers or a classifier in general, so that's the f. And so what boosting does starts not just from a single decision stump, but it combines many of them together in order to improve the prediction. So, for example, you may have a decision stump that splits on income, another one that splits on credit history, another one on how much savings you have, and the market conditions, and so on. So if you're giving a particular input, for example, your income is 120K, so the first decision boost here is going to vote with f1, that the loan is safe, so it's going to vote +1. But your credit history is bad, so the second decision stump here is going to vote -1, f2. And, say, the third one, because your savings are low, is going to vote risky -1, and the third one, market conditions, turned out to be good so you're going to vote +1. And the question here the boosting asks is, how do you combine in an ensemble way this -1, +1 votes from different classifiers, f1 through f4, in this case. And so the way that we combine is by having some sort of weights that deal with what is called the ensemble models. So the weights here are w1, w2, w3, w4. And we multiply the outputs from each of the classifiers, add them up together and then take the sign of that. If the sign is positive, we output +1. If the sign is negative, we output -1. So it's kind of a weighted voting schema. This weighted voting schema, amazing, really awesome. And let's see a little bit more of what that looks like. So let's suppose that we have a particular set of weights and we have multiple decision stumps, so classifiers that have provided their vote. Let's see more explicitly what that output value would be. So, for example, if the weights of each classifier is shown on the right here, then we're going to say that output that we're going to predict is going to be the sign of w1, which is 2. Then multiply the vote of f1, which is +1 plus w2, which is 1.5, so we believe that the second decision stump is not as important as the first one, and the vote there was -1, so that multiplies -1. Then, for the third one, again, we say that the weight is 1.5, and again we say it's not as important as the first one, but as important as the second one, and the prediction there was -1. And finally, for the last one, we might say that the weight is 0.5. So this is the least important of all of these classifiers. And the function here that, the output of f4 here was +1, was a safe loan. So we have these votes, these weights, we multiply them together and we see that the output of this F classifier for the input xi is the sign of -0.5, which is equal to or implies that y hat i is -1. So even though there was two positive votes and two negative votes, and the most important classifier, the first one, was a positive vote, when you add it up and you average the results, you get a risky loan as an output. Now this is a simple example of what's called an ensemble classifier or the combination multiple classifiers. As we'll see, this kind of ensemble model, this kind of combination is what everybody uses in industry to be able to solve complex decision problems, complex classification problems. Just to make sure we're all on the same page, I'm going to just formally define the ensemble learning problem. So we're given some data, where y is either +1 or -1, there's also multiclass version of this, we're just going to talk about +1 or -1 in today's module, and we have some input, x. We have some data that allows us to learn f1, f2, all the way to fT, which are the T weight classifiers, or just classifiers, that we learn from data. And then some coefficients that we learn from data, w-hat1, w-hat2, all the way to w-hatT. And once we learn them, making a prediction is very similar to what you do for logistic regression or a linear classifier. It's just the sign of the weighted sum of the votes from each classifier. And so if you look at this, it looks a lot like a linear classification model, logistic regression and all of those, exactly the same. However, not only are we learning the ws from data, we're actually learning the features. In those models, we had hs to represent our features. Here, the feature are these fs, which are the weight classifiers that we're going to learn from data. So we can think about boosting as an approach to learn features from data. And it's really super exciting. [MUSIC]