1 00:00:00,000 --> 00:00:04,738 [MUSIC] 2 00:00:04,738 --> 00:00:09,312 In this module we'll cover a couple basic strategies for dealing with missing data, 3 00:00:09,312 --> 00:00:13,320 and then we'll cover a modification in the decision tree algorithm. 4 00:00:13,320 --> 00:00:17,060 We're actually going to be able to deal missing data in a much smarter way. 5 00:00:18,120 --> 00:00:22,340 Now the basic most common of dealing with missing data 6 00:00:22,340 --> 00:00:24,170 is what's called purification. 7 00:00:24,170 --> 00:00:28,390 So I'm just going to throw out missing data. 8 00:00:28,390 --> 00:00:31,450 So I start with a data set where for some data points, 9 00:00:31,450 --> 00:00:34,450 some of the feature values are missing. 10 00:00:34,450 --> 00:00:38,300 And somehow when I skip some of these features, some of these data points, 11 00:00:38,300 --> 00:00:46,240 I end up with a data set and output here h(x) where nothing is missing. 12 00:00:46,240 --> 00:00:47,120 Everything's observed. 13 00:00:48,420 --> 00:00:51,540 So skipping data and purifications with skipping data 14 00:00:51,540 --> 00:00:53,990 is the most obvious thing that you might want to do. 15 00:00:53,990 --> 00:00:57,730 So if I have nine data points over here, and 16 00:00:57,730 --> 00:01:02,930 three of them have missing data, so these are the three rows here. 17 00:01:02,930 --> 00:01:08,070 Then, I could just say, okay, there's now only three missing, not too bad. 18 00:01:08,070 --> 00:01:09,850 I'm just going to skip them. 19 00:01:09,850 --> 00:01:13,600 And so, I'm going to take my 9 and 20 00:01:13,600 --> 00:01:19,650 decrease it to just 6 and call that my data set. 21 00:01:19,650 --> 00:01:23,550 And if you only have a few missing values, maybe this is an okay thing to do. 22 00:01:24,657 --> 00:01:31,830 Skipping data points with missing values, however, can be a problematic idea. 23 00:01:31,830 --> 00:01:32,820 So, for example, 24 00:01:32,820 --> 00:01:37,880 in this case, you have the feature term missing in a lot of different data points. 25 00:01:37,880 --> 00:01:42,920 In fact, six out of nine data points the term feature is missing. 26 00:01:42,920 --> 00:01:47,845 So if I were just to skip those, I'll go from a data set with nine features, 27 00:01:47,845 --> 00:01:52,590 so if nine data points, to a data set with only three data points. 28 00:01:52,590 --> 00:01:55,340 So it'll become much, much smaller. 29 00:01:55,340 --> 00:02:00,480 And that's really bad because term here is missing on more than 50% of the data. 30 00:02:00,480 --> 00:02:07,000 So if you look, we go down from 9 to a much, much smaller value. 31 00:02:07,000 --> 00:02:10,860 And that can happen and 32 00:02:10,860 --> 00:02:14,101 makes your training much worse because much less data here. 33 00:02:15,230 --> 00:02:17,520 And so, in this cases, 34 00:02:17,520 --> 00:02:21,610 if you just have one feature which has lots of missing values, one another simple 35 00:02:21,610 --> 00:02:25,230 approach is to say instead of skipping data points you could skip features, and 36 00:02:25,230 --> 00:02:29,062 now instead of having fewer data points you just have fewer features. 37 00:02:29,062 --> 00:02:33,900 [BLANK AUDIO] So that's a reasonable alternative in this case. 38 00:02:34,910 --> 00:02:38,630 So there are two basic kinds of skipping that you might want to do 39 00:02:38,630 --> 00:02:40,210 when you have missing data. 40 00:02:40,210 --> 00:02:46,010 You can either skip data points that have missing data or 41 00:02:46,010 --> 00:02:49,200 skip features that have missing data. 42 00:02:49,200 --> 00:02:54,470 And somehow you have to make a decision of whether to skip a data point, 43 00:02:54,470 --> 00:02:56,600 skip features, or skip some data points and 44 00:02:56,600 --> 00:03:00,070 some features and that's a kind of complicated decision to make. 45 00:03:00,070 --> 00:03:03,330 In general this idea of skipping is good because it's easy, 46 00:03:03,330 --> 00:03:06,030 it kind of just takes your data set and simplifies it a bunch. 47 00:03:06,030 --> 00:03:09,230 It can be applied to any algorithm because you just simplify the data and 48 00:03:09,230 --> 00:03:13,820 you just feed it to any algorithm, but it has some challenges. 49 00:03:13,820 --> 00:03:15,270 Now removing data, 50 00:03:15,270 --> 00:03:20,610 removing features is always a kind of painful thing, data is important. 51 00:03:20,610 --> 00:03:25,440 You don't want to do that and its often unclear if you should remove features, 52 00:03:25,440 --> 00:03:28,330 to remove data points, what impact it will have on your answer if you do. 53 00:03:30,880 --> 00:03:35,760 Most fundamentally even if you really skip too much at training time 54 00:03:35,760 --> 00:03:40,590 at prediction time if you had a question mark what do you do? 55 00:03:40,590 --> 00:03:45,100 This approach does not address missing data at prediction time. 56 00:03:45,100 --> 00:03:48,860 And so people do this approach all the time. 57 00:03:48,860 --> 00:03:53,840 And I'm okay with it if you just have kind of one case here or case there. 58 00:03:53,840 --> 00:03:56,530 But it's a pretty dangerous approach to take. 59 00:03:56,530 --> 00:04:00,699 I don't fully recommend skipping as an approach to dealing missing data. 60 00:04:00,699 --> 00:04:05,059 [MUSIC]