[MUSIC] The second approach we see when dealing
with missing data is purification by what's called imputation. It's filling in the missing
values with question marks with some best guesses of
what might have happened. I'm personally not a hoarder. I don't collect a lot of things. I don't really care about stuff that much. But when it comes to data,
I really feel bad about throwing it away. I don't want to throw away data. And so going down from nine data points
to six data points, this pains me. I don't think it's a really good idea. Imputation is an alternative to this. It's to say, okay, instead of
throwing away those question marks, let's try to get a best guess of where
that question mark value might be and just fill in those values. And this is the kind of
approach you might take. You take your input data which might
have some missing values in it. You take your best guess at filling
in those missing values and now you have a data set where
everything has been filled in. For example, in a nine row data set
here we have these three question marks. And we could say the term is unknown for
these three question marks. When I fill in those values with my best
guess which might be three year loan. For the three values over here. Why did I choose three
year loan to fill it in? Well if you look at the original data set, there were four three year loans
versus two five year loans, so four here was my best guess. And you can think about it
as my best simple guess. So I just took a simple approach, just whatever was most
popular I filled those in. So the way that imputation might
work in this simple approach, the rule might say for
categorical data like of excellent, fair, poor, three year, five year. You just put in the most popular value and
it's called the mode of distribution. For numerical data, I would suggest you put in either
the average or the median value. Now, these are just simple logistics. There're many more advanced and
interesting ways to impute missing values. There's something called
expectation-maximization, or Algorithm, which is an algorithm that
does this in a very interesting way. Now we just described a very
simple thing in this course. Addressing missing data by imputation
has advantages and disadvantages. It's easy to understand and implement. It can be applied to any model because
after you fill in your data you just fit in to any algorithm you have you don't
have to modify anything, so that's great. And it can be used as a prediction type
because whenever you hit a question mark you fill it in in the same way that
you did with the training data. So if you have a question mark for
term you just fill it in three years, three alone is just like it
did in the training data. So that's great. However, imputation like this, especially
a simple imputation that I describe, can be extremely problematic,
because it introduces a bias. Every question mark in term
will put in three years. We then use any other information
we plug in the same value. And this could result into
really bad systematic errors. So step back and take an example. I live in the state of Washington in
the US, and let's say that in the state of Washington it's illegal to put
the age into the loan application. That means that if the loan applications
come from the state of Washington, age is always a question mark. In other states maybe not a question mark,
but age is always a question mark here. If you train a model
across the United States, everybody from the state of Washington
will have age question mark. You're going to fill in the average
age into the application. Let's say, 40, then you're going to
believe that everybody in the state of Washington who is applying for
a loan is age 40. And that's going to
introduce a systematic bias into the loan applications
in the state of Washington. And that's going to lead to
all sorts of weird behavior, unhappy people, bad predictions, bad idea. So imputation like this has its pluses but it's also a complicated idea because
it can introduce terrible biases. So in the third part of this module, we're
going to talk about an alternative that can address some of the challenges
of the first two methods. [MUSIC]