[MUSIC] From here we're going to explore one of the most fundamental things that happens in practice with data, the issue of missing data. So here I have a dataset, but unlike what we've done so far in this course in specialization, we assume that all data is observed. But here if you look at the second row, you have a question mark for that person. You don't know if the loan they took was a three-year loan or a five-year loan. So what do we do when we have missing data? We're going to talk about many ways to address that are extremely practical including a modification of decision trees where the tree can be learned to take into account missing data. Make decisions that depend not only on observed values like your credit was excellent, fair, or poor. But what happens if your credit is unobserved? What should you do? And those techniques are going to be extremely useful in practice and widely applicable. The seventh module's going to be amazing. We're going to look at a question that was asked by Kearns and Valiant in 1988. In fact, Valiant is a Turing Award winner. So this is a fundamental question. The question was, can you combine simple classifiers in a way that gives you performance of a really complex classifier? In that question, which is pre-theoretical, was answered a couple of years later by Schapire in the positive, using something called boosting, which is amazing algorithm that has had an incredible impact in practice. In fact, if you know, for example, what a Kaggle competition is, which is one of those online machine learning competitions, more than half of the winners use boosting in their solution. Boosting is a simple technique that has really changed the world, and we're going to learn the fundamentals of the technique and you're going to be able to implement yourself. We're going to talk about one kind of boosting algorithm called AdaBoost where you can take the input of a classifier. For example, this decision tree here might say that a loan is likely to be okay, safe +1. But you might have another one that says no, it's risky, -1. And you have others that might say +1 or -1, and you have the vote of many classifiers in what's called an ensemble. So boosting is about building these many ensembles where many classifiers vote. And we do just want to learn the combinations of this vote to get the best possible prediction. And by learning those combinations using a boosting algorithm, we're going to be able to start very simple classifiers but come up with very complex decision boundaries. And this is exactly the technique that wins most of those Kaggle competitions. In the eighth module we're going to step back and look at fundamental concepts in machine learning. One is called precision and recall. Let me tell you about an example of that. Say I own a restaurant and I want to increase the number of guests that I have, the number of customers, by 30%. How do I do that? Well, I'm going to start a marketing campaign but I don't want to be just like everybody else. I want to be kind of an authentic nice marketing campaign. So when I use the reviews of my restaurant and try to find great things to say, and every time somebody enters a review on our website, I get some sentence shown in my website saying how great my restaurant is. So given the reviews, I want to make a prediction as to what sentence were most positive, like easily the best sushi in Seattle, and also who are great people that we should showcase that say great things about the restaurant. When you have a setting like this, accuracy is not a good metric. What you really care about is what's called precision recall. Precision is going to say, if I pick out a few sentences from the reviews and show them on my website, how likely is it that I'm going to show a really negative sentence? Because if I show a bad sentence like the sushi was terrible, then that's really bad for my website. So precision makes sure that I show only positive sentences. And then when we talk about recall, which is about finding all the great positive things that people are saying. So if a classifier has good precision and recall, it means that I find all the great sentences, and I only show great sentences about my restaurant. And we're going to talk about that in quite a lot of detail, because precision-recall is what you will use most likely if you build a classifying practice. It's what basically every company builds classifiers uses as its core metric. And in the final module we're going to address initial scalability. How do we scale to really massive datasets? And so as you can see the number of web pages on the web is growing tremendously. There is about 4.8 billion pages today. There is about 500 million tweets per day. Hey, follow me on Twitter, by the way. I send them one, maybe less. And if you think about YouTube, there's 5 billion page views on YouTube, video views every day. So there's tons of data out there and gradient type methods don't tend to scale very well when you have massive amounts of data. And so what we're going to show is a technique called stochastic gradient, which converges much faster than gradient to the solution. It's just a very small modification to gradient which gives you amazing performance. So in this simple example from sentiment analysis, we see over 100 times faster performance in the same dataset. However, stochastic gradient is a extremely finicky technique to get to work right. There's many practical problems that you need to address to make it work. And so we're going to talk about the technique and explain why it works, but also explain those practical issues that you must address in order to get it to work well. So as you can see, it's going to be an action-packed course that's going to cover a wide range of topics in machine learning. [MUSIC]