1 00:00:00,000 --> 00:00:04,917 [MUSIC] 2 00:00:04,917 --> 00:00:08,226 The second approach we see when dealing with missing data is purification by 3 00:00:08,226 --> 00:00:09,900 what's called imputation. 4 00:00:09,900 --> 00:00:12,410 It's filling in the missing values with question marks 5 00:00:12,410 --> 00:00:15,030 with some best guesses of what might have happened. 6 00:00:16,940 --> 00:00:18,180 I'm personally not a hoarder. 7 00:00:18,180 --> 00:00:19,740 I don't collect a lot of things. 8 00:00:19,740 --> 00:00:22,400 I don't really care about stuff that much. 9 00:00:22,400 --> 00:00:27,750 But when it comes to data, I really feel bad about throwing it away. 10 00:00:27,750 --> 00:00:29,960 I don't want to throw away data. 11 00:00:29,960 --> 00:00:34,330 And so going down from nine data points to six data points, this pains me. 12 00:00:35,630 --> 00:00:36,930 I don't think it's a really good idea. 13 00:00:38,650 --> 00:00:41,380 Imputation is an alternative to this. 14 00:00:41,380 --> 00:00:45,040 It's to say, okay, instead of throwing away those question marks, 15 00:00:45,040 --> 00:00:48,855 let's try to get a best guess of where that question mark value might be and 16 00:00:48,855 --> 00:00:50,890 just fill in those values. 17 00:00:50,890 --> 00:00:53,260 And this is the kind of approach you might take. 18 00:00:53,260 --> 00:00:56,300 You take your input data which might have some missing values in it. 19 00:00:56,300 --> 00:00:59,820 You take your best guess at filling in those missing values and 20 00:00:59,820 --> 00:01:02,770 now you have a data set where everything has been filled in. 21 00:01:04,020 --> 00:01:09,130 For example, in a nine row data set here we have these three question marks. 22 00:01:10,200 --> 00:01:14,730 And we could say the term is unknown for these three question marks. 23 00:01:14,730 --> 00:01:20,420 When I fill in those values with my best guess which might be three year loan. 24 00:01:22,170 --> 00:01:24,558 For the three values over here. 25 00:01:24,558 --> 00:01:26,555 Why did I choose three year loan to fill it in? 26 00:01:26,555 --> 00:01:30,693 Well if you look at the original data set, 27 00:01:30,693 --> 00:01:36,723 there were four three year loans versus two five year loans, 28 00:01:36,723 --> 00:01:39,810 so four here was my best guess. 29 00:01:41,560 --> 00:01:44,310 And you can think about it as my best simple guess. 30 00:01:45,590 --> 00:01:47,370 So I just took a simple approach, 31 00:01:47,370 --> 00:01:49,420 just whatever was most popular I filled those in. 32 00:01:50,920 --> 00:01:55,404 So the way that imputation might work in this simple approach, 33 00:01:55,404 --> 00:02:00,141 the rule might say for categorical data like of excellent, fair, 34 00:02:00,141 --> 00:02:02,860 poor, three year, five year. 35 00:02:02,860 --> 00:02:07,380 You just put in the most popular value and it's called the mode of distribution. 36 00:02:07,380 --> 00:02:08,380 For numerical data, 37 00:02:10,200 --> 00:02:14,330 I would suggest you put in either the average or the median value. 38 00:02:15,710 --> 00:02:17,805 Now, these are just simple logistics. 39 00:02:17,805 --> 00:02:22,820 There're many more advanced and interesting ways to impute missing values. 40 00:02:22,820 --> 00:02:25,560 There's something called expectation-maximization, 41 00:02:25,560 --> 00:02:31,240 or Algorithm, which is an algorithm that does this in a very interesting way. 42 00:02:31,240 --> 00:02:33,950 Now we just described a very simple thing in this course. 43 00:02:35,860 --> 00:02:40,400 Addressing missing data by imputation has advantages and disadvantages. 44 00:02:40,400 --> 00:02:43,189 It's easy to understand and implement. 45 00:02:43,189 --> 00:02:47,347 It can be applied to any model because after you fill in your data you just fit 46 00:02:47,347 --> 00:02:51,980 in to any algorithm you have you don't have to modify anything, so that's great. 47 00:02:51,980 --> 00:02:52,929 And it can be used as a prediction type because whenever you hit a question mark 48 00:02:52,929 --> 00:02:53,732 you fill it in in the same way that you did with the training data. 49 00:02:53,732 --> 00:03:00,243 So if you have a question mark for term you just fill it in three years, 50 00:03:00,243 --> 00:03:05,840 three alone is just like it did in the training data. 51 00:03:05,840 --> 00:03:07,510 So that's great. 52 00:03:07,510 --> 00:03:11,870 However, imputation like this, especially a simple imputation that I describe, 53 00:03:11,870 --> 00:03:15,840 can be extremely problematic, because it introduces a bias. 54 00:03:15,840 --> 00:03:18,875 Every question mark in term will put in three years. 55 00:03:18,875 --> 00:03:23,370 We then use any other information we plug in the same value. 56 00:03:23,370 --> 00:03:28,010 And this could result into really bad systematic errors. 57 00:03:28,010 --> 00:03:30,400 So step back and take an example. 58 00:03:30,400 --> 00:03:35,890 I live in the state of Washington in the US, and let's say that in the state of 59 00:03:35,890 --> 00:03:42,180 Washington it's illegal to put the age into the loan application. 60 00:03:42,180 --> 00:03:45,440 That means that if the loan applications come from the state of Washington, 61 00:03:45,440 --> 00:03:47,500 age is always a question mark. 62 00:03:47,500 --> 00:03:51,670 In other states maybe not a question mark, but age is always a question mark here. 63 00:03:51,670 --> 00:03:55,980 If you train a model across the United States, 64 00:03:55,980 --> 00:03:58,610 everybody from the state of Washington will have age question mark. 65 00:03:58,610 --> 00:04:02,250 You're going to fill in the average age into the application. 66 00:04:02,250 --> 00:04:05,490 Let's say, 40, then you're going to believe that everybody in the state of 67 00:04:05,490 --> 00:04:07,744 Washington who is applying for a loan is age 40. 68 00:04:09,220 --> 00:04:11,880 And that's going to introduce a systematic bias 69 00:04:11,880 --> 00:04:14,290 into the loan applications in the state of Washington. 70 00:04:14,290 --> 00:04:18,570 And that's going to lead to all sorts of weird behavior, 71 00:04:18,570 --> 00:04:21,540 unhappy people, bad predictions, bad idea. 72 00:04:23,070 --> 00:04:26,850 So imputation like this has its pluses but 73 00:04:26,850 --> 00:04:33,548 it's also a complicated idea because it can introduce terrible biases. 74 00:04:33,548 --> 00:04:38,140 So in the third part of this module, we're going to talk about an alternative that 75 00:04:38,140 --> 00:04:41,650 can address some of the challenges of the first two methods. 76 00:04:41,650 --> 00:04:45,089 [MUSIC]