[MUSIC] So now that we've talked about several models, let's dive in into boosting more generally, and then a specific example of a boosting algorithm. So think about a learning problem where we take some data, we learn a classifier which gives us some output, f(x), and we use it to predict on some data so we say that y hat is a sign of f(x). So if we take some data and we try to make a prediction, we might learn see a decision stop. So let's say you try to split on income, we look at folks with income greater than 100,000, folks with income less than 100,000, and how do we learn the actual predictions that we make? Well, we look at the rows of our data, the rows where the income was greater than 100,000, and we see that for those, 3 were safe loans and 1 was a risky loan, so we're going to predict y hat = safe. Now, if we look at incomes of less than 100,000, we'll see that we have 4 safe and 3 risky, so again we predict safe, so both sides we're going to predict safe. As it turns out that we do this first decision stump, seems okay, but it doesn't seem great on the data. And so, decision stump wasn't enough to capture the information with the limited data. So a boosting will do is take that decision stump, evaluate it, look at how well it's doing in our data, and learn a next decision stump, a next weak classifier and then next the classifier is going to focus on the data points where the first one was bad. So in other words, this is a very important point. In other words, we're going to look at where we're making mistakes so far and want to increase, see the proportion or the impact or how much we care about the points where we've made mistakes. And we load in another classifier that takes care of the points that made mistakes and then we load another one and another one. And eventually, as we'll see, will actually converge to a great classifier. What does it mean to learn the data points were made mistakes? What it means is that we're going to assign a weight, alpha i to the positive number to every data point in our dataset. And that weight, when it's higher, it means that data point is more important. So you can imagine a learning problem where we have data, not just like we've done so far in this course, but we have data with weights associated with it. And now we're going to learn from weighted data. What there is to learn from weighted data, the way to think about it is that, those alpha is correspond to kind of data points counting more than one time if its greater than one. So for example, if alpha i is 2, you can think about that data point is counting twice. If the data point was half you could think about data point counting as half of a data point. But everything in the learning algorithm stays exactly the same. It just that instead of counting data points you count kind of weights of data points. What happens in our decision stump approach? So we had that first decision stump, it was not a great, especially for folks with lower income and so we learned it's weight, which are higher for the places where we've made mistakes and now we learn the new decision stump. Let's say again we split the columns, again, written 100,000 less 100,000. When we look at the classifications decisions that we make, for greater than 100,000 what we do is we sum the weight of the data points that were risky, and incomes greater than 100,000. So in this case we're summing 0.5, 0.8, 0.7 which adds up to 2 and then for the risky ones it's 1.2 and so we're going to make a prediction of y hat is safe. So it's the weighted sum of the data points. For income less than 100,000, same kind of idea. We'll go through the data points, look at the ones that were risk and ones that were safe and sum up the weight of those. And we see the total weight of risky that 65, the total weight of safe loans is 3, so we're going to predict what is risky. So this decision stump is now going to be a bit better. So we're going to combine this one with the previous one and others and others until we create this ensemble classifier. Now, this idea of learning from weighted data is not just about decision stumps. It's the result that most machine learning algorithms accept weighted data. So I'm going to show you very briefly what happens if you are doing logistical regression and you have weighted data. So if you look at the equation on the bottom here, that's exactly the equation of logistic regressions derivative, or the update function that we do. This is the thing that you would implement if you run a logistic regression model. Now, you say, I have this weighted data, I have to reimplement everything from scratch. My god, my god, my god. And it turns out that it's very simple. You should look at the middle of the equation, we have the sum over data points over here. And now, so we just sum our data points. We're just going to weigh the contribution of each data point. So we just add that weight alpha i to each term in the sum and we're done. We now have logistic regression for weighted data. So we showed you two examples, decision stumps and logistics regression but in general it's quite easy to learn with the data. So boosting can be viewed as a greedy algorithm for learning an ensemble from data. We train the first classssifier, let's say f1(x), if you just have f1(x) just predict the sign of f1 to be your output of y hat. Now, then you re-weight your data by weighting more data points where we made mistakes, where f1 makes mistakes. And now we run another classifier, f2 and we learn the coefficients which for these classifiers, and now our prediction if we just do 2 steps of this is w hat 1, f1 + w hat 2, f2. And the sign of that is y hat. So that is kind of like the keep adding new classifiers, optimizing the weights to focus on more difficult data points and then learning the coefficients between different classifiers. [MUSIC]