[MUSIC] From here we're going to explore
one of the most fundamental things that happens in practice with data,
the issue of missing data. So here I have a dataset, but
unlike what we've done so far in this course in specialization,
we assume that all data is observed. But here if you look at the second row,
you have a question mark for that person. You don't know if the loan they took was
a three-year loan or a five-year loan. So what do we do when
we have missing data? We're going to talk about many ways to
address that are extremely practical including a modification of decision trees where the tree can be learned to
take into account missing data. Make decisions that depend
not only on observed values like your credit was excellent,
fair, or poor. But what happens if your
credit is unobserved? What should you do? And those techniques are going to
be extremely useful in practice and widely applicable. The seventh module's going to be amazing. We're going to look at a question that
was asked by Kearns and Valiant in 1988. In fact, Valiant is a Turing Award winner. So this is a fundamental question. The question was,
can you combine simple classifiers in a way that gives you performance
of a really complex classifier? In that question, which is
pre-theoretical, was answered a couple of years later by Schapire in the positive,
using something called boosting, which is amazing algorithm that has
had an incredible impact in practice. In fact, if you know, for example,
what a Kaggle competition is, which is one of those online
machine learning competitions, more than half of the winners
use boosting in their solution. Boosting is a simple technique that
has really changed the world, and we're going to learn the fundamentals
of the technique and you're going to be able
to implement yourself. We're going to talk about one kind of
boosting algorithm called AdaBoost where you can take
the input of a classifier. For example, this decision tree here might
say that a loan is likely to be okay, safe +1. But you might have another one
that says no, it's risky, -1. And you have others that might say +1 or
-1, and you have the vote of many classifiers
in what's called an ensemble. So boosting is about building these many
ensembles where many classifiers vote. And we do just want to learn
the combinations of this vote to get the best possible prediction. And by learning those combinations using a
boosting algorithm, we're going to be able to start very simple classifiers but come
up with very complex decision boundaries. And this is exactly the technique that
wins most of those Kaggle competitions. In the eighth module we're
going to step back and look at fundamental concepts
in machine learning. One is called precision and recall. Let me tell you about an example of that. Say I own a restaurant and I want to
increase the number of guests that I have, the number of customers, by 30%. How do I do that? Well, I'm going to start
a marketing campaign but I don't want to be just
like everybody else. I want to be kind of an authentic
nice marketing campaign. So when I use the reviews of my restaurant
and try to find great things to say, and every time somebody enters
a review on our website, I get some sentence shown in my website
saying how great my restaurant is. So given the reviews, I want to make
a prediction as to what sentence were most positive, like easily the best sushi
in Seattle, and also who are great people that we should showcase that
say great things about the restaurant. When you have a setting like this,
accuracy is not a good metric. What you really care about is
what's called precision recall. Precision is going to say, if I pick out
a few sentences from the reviews and show them on my website, how likely is it that I'm going to
show a really negative sentence? Because if I show a bad sentence
like the sushi was terrible, then that's really bad for my website. So precision makes sure that I
show only positive sentences. And then when we talk about recall, which is about finding all the great
positive things that people are saying. So if a classifier has good precision and
recall, it means that I find all the great sentences, and I only show
great sentences about my restaurant. And we're going to talk about
that in quite a lot of detail, because precision-recall
is what you will use most likely if you build
a classifying practice. It's what basically every company builds
classifiers uses as its core metric. And in the final module we're
going to address initial scalability. How do we scale to
really massive datasets? And so as you can see the number of web
pages on the web is growing tremendously. There is about 4.8 billion pages today. There is about 500 million tweets per day. Hey, follow me on Twitter, by the way. I send them one, maybe less. And if you think about YouTube,
there's 5 billion page views on YouTube, video views every day. So there's tons of data out there and
gradient type methods don't tend to scale very well
when you have massive amounts of data. And so what we're going to show is
a technique called stochastic gradient, which converges much faster
than gradient to the solution. It's just a very small
modification to gradient which gives you amazing performance. So in this simple example
from sentiment analysis, we see over 100 times faster
performance in the same dataset. However, stochastic gradient is
a extremely finicky technique to get to work right. There's many practical problems that
you need to address to make it work. And so we're going to talk about the
technique and explain why it works, but also explain those practical
issues that you must address in order to get it to work well. So as you can see, it's going to be an
action-packed course that's going to cover a wide range of topics
in machine learning. [MUSIC]