1
00:00:00,000 --> 00:00:04,917
[MUSIC]

2
00:00:04,917 --> 00:00:08,226
The second approach we see when dealing
with missing data is purification by

3
00:00:08,226 --> 00:00:09,900
what's called imputation.

4
00:00:09,900 --> 00:00:12,410
It's filling in the missing
values with question marks

5
00:00:12,410 --> 00:00:15,030
with some best guesses of
what might have happened.

6
00:00:16,940 --> 00:00:18,180
I'm personally not a hoarder.

7
00:00:18,180 --> 00:00:19,740
I don't collect a lot of things.

8
00:00:19,740 --> 00:00:22,400
I don't really care about stuff that much.

9
00:00:22,400 --> 00:00:27,750
But when it comes to data,
I really feel bad about throwing it away.

10
00:00:27,750 --> 00:00:29,960
I don't want to throw away data.

11
00:00:29,960 --> 00:00:34,330
And so going down from nine data points
to six data points, this pains me.

12
00:00:35,630 --> 00:00:36,930
I don't think it's a really good idea.

13
00:00:38,650 --> 00:00:41,380
Imputation is an alternative to this.

14
00:00:41,380 --> 00:00:45,040
It's to say, okay, instead of
throwing away those question marks,

15
00:00:45,040 --> 00:00:48,855
let's try to get a best guess of where
that question mark value might be and

16
00:00:48,855 --> 00:00:50,890
just fill in those values.

17
00:00:50,890 --> 00:00:53,260
And this is the kind of
approach you might take.

18
00:00:53,260 --> 00:00:56,300
You take your input data which might
have some missing values in it.

19
00:00:56,300 --> 00:00:59,820
You take your best guess at filling
in those missing values and

20
00:00:59,820 --> 00:01:02,770
now you have a data set where
everything has been filled in.

21
00:01:04,020 --> 00:01:09,130
For example, in a nine row data set
here we have these three question marks.

22
00:01:10,200 --> 00:01:14,730
And we could say the term is unknown for
these three question marks.

23
00:01:14,730 --> 00:01:20,420
When I fill in those values with my best
guess which might be three year loan.

24
00:01:22,170 --> 00:01:24,558
For the three values over here.

25
00:01:24,558 --> 00:01:26,555
Why did I choose three
year loan to fill it in?

26
00:01:26,555 --> 00:01:30,693
Well if you look at the original data set,

27
00:01:30,693 --> 00:01:36,723
there were four three year loans
versus two five year loans,

28
00:01:36,723 --> 00:01:39,810
so four here was my best guess.

29
00:01:41,560 --> 00:01:44,310
And you can think about it
as my best simple guess.

30
00:01:45,590 --> 00:01:47,370
So I just took a simple approach,

31
00:01:47,370 --> 00:01:49,420
just whatever was most
popular I filled those in.

32
00:01:50,920 --> 00:01:55,404
So the way that imputation might
work in this simple approach,

33
00:01:55,404 --> 00:02:00,141
the rule might say for
categorical data like of excellent, fair,

34
00:02:00,141 --> 00:02:02,860
poor, three year, five year.

35
00:02:02,860 --> 00:02:07,380
You just put in the most popular value and
it's called the mode of distribution.

36
00:02:07,380 --> 00:02:08,380
For numerical data,

37
00:02:10,200 --> 00:02:14,330
I would suggest you put in either
the average or the median value.

38
00:02:15,710 --> 00:02:17,805
Now, these are just simple logistics.

39
00:02:17,805 --> 00:02:22,820
There're many more advanced and
interesting ways to impute missing values.

40
00:02:22,820 --> 00:02:25,560
There's something called
expectation-maximization,

41
00:02:25,560 --> 00:02:31,240
or Algorithm, which is an algorithm that
does this in a very interesting way.

42
00:02:31,240 --> 00:02:33,950
Now we just described a very
simple thing in this course.

43
00:02:35,860 --> 00:02:40,400
Addressing missing data by imputation
has advantages and disadvantages.

44
00:02:40,400 --> 00:02:43,189
It's easy to understand and implement.

45
00:02:43,189 --> 00:02:47,347
It can be applied to any model because
after you fill in your data you just fit

46
00:02:47,347 --> 00:02:51,980
in to any algorithm you have you don't
have to modify anything, so that's great.

47
00:02:51,980 --> 00:02:52,929
And it can be used as a prediction type
because whenever you hit a question mark

48
00:02:52,929 --> 00:02:53,732
you fill it in in the same way that
you did with the training data.

49
00:02:53,732 --> 00:03:00,243
So if you have a question mark for
term you just fill it in three years,

50
00:03:00,243 --> 00:03:05,840
three alone is just like it
did in the training data.

51
00:03:05,840 --> 00:03:07,510
So that's great.

52
00:03:07,510 --> 00:03:11,870
However, imputation like this, especially
a simple imputation that I describe,

53
00:03:11,870 --> 00:03:15,840
can be extremely problematic,
because it introduces a bias.

54
00:03:15,840 --> 00:03:18,875
Every question mark in term
will put in three years.

55
00:03:18,875 --> 00:03:23,370
We then use any other information
we plug in the same value.

56
00:03:23,370 --> 00:03:28,010
And this could result into
really bad systematic errors.

57
00:03:28,010 --> 00:03:30,400
So step back and take an example.

58
00:03:30,400 --> 00:03:35,890
I live in the state of Washington in
the US, and let's say that in the state of

59
00:03:35,890 --> 00:03:42,180
Washington it's illegal to put
the age into the loan application.

60
00:03:42,180 --> 00:03:45,440
That means that if the loan applications
come from the state of Washington,

61
00:03:45,440 --> 00:03:47,500
age is always a question mark.

62
00:03:47,500 --> 00:03:51,670
In other states maybe not a question mark,
but age is always a question mark here.

63
00:03:51,670 --> 00:03:55,980
If you train a model
across the United States,

64
00:03:55,980 --> 00:03:58,610
everybody from the state of Washington
will have age question mark.

65
00:03:58,610 --> 00:04:02,250
You're going to fill in the average
age into the application.

66
00:04:02,250 --> 00:04:05,490
Let's say, 40, then you're going to
believe that everybody in the state of

67
00:04:05,490 --> 00:04:07,744
Washington who is applying for
a loan is age 40.

68
00:04:09,220 --> 00:04:11,880
And that's going to
introduce a systematic bias

69
00:04:11,880 --> 00:04:14,290
into the loan applications
in the state of Washington.

70
00:04:14,290 --> 00:04:18,570
And that's going to lead to
all sorts of weird behavior,

71
00:04:18,570 --> 00:04:21,540
unhappy people, bad predictions, bad idea.

72
00:04:23,070 --> 00:04:26,850
So imputation like this has its pluses but

73
00:04:26,850 --> 00:04:33,548
it's also a complicated idea because
it can introduce terrible biases.

74
00:04:33,548 --> 00:04:38,140
So in the third part of this module, we're
going to talk about an alternative that

75
00:04:38,140 --> 00:04:41,650
can address some of the challenges
of the first two methods.

76
00:04:41,650 --> 00:04:45,089
[MUSIC]