1
00:00:03,813 --> 00:00:08,290
Often we have to deal with
missing values in our data.

2
00:00:08,290 --> 00:00:14,982
They could look like not numbers,
empty strings, or outliers like minus 999.

3
00:00:14,982 --> 00:00:18,583
Sometimes they can contain useful
information by themselves,

4
00:00:18,583 --> 00:00:22,840
like what was the reason of
missing value occurring here?

5
00:00:22,840 --> 00:00:24,320
How to use them effectively?

6
00:00:24,320 --> 00:00:26,640
How to engineer new features from them?

7
00:00:26,640 --> 00:00:28,430
We'll do the topic for this video.

8
00:00:29,520 --> 00:00:33,080
So what kind of information
missing values might contain?

9
00:00:33,080 --> 00:00:34,840
How can they look like?

10
00:00:34,840 --> 00:00:38,050
Let's take a look at missing values
in the Springfield competition.

11
00:00:39,870 --> 00:00:42,382
This is metrics of samples and features.

12
00:00:42,382 --> 00:00:47,858
People mainly reviewed each feature, and
found missing values for each column.

13
00:00:47,858 --> 00:00:55,290
This latest could be not a number,
empty string, minus 1, 99, and so on.

14
00:00:55,290 --> 00:01:01,345
For example, how can we found out
that -1 can be the missing value?

15
00:01:01,345 --> 00:01:03,849
We could draw a histogram and

16
00:01:03,849 --> 00:01:09,980
see this variable has uniform
distribution between 0 and 1.

17
00:01:09,980 --> 00:01:13,870
And that it has small peak of -1 values.

18
00:01:13,870 --> 00:01:20,433
So if there are no not numbers there, we
can assume that they were replaced by -1.

19
00:01:20,433 --> 00:01:25,020
Or the feature distribution plot
can look like the second figure.

20
00:01:26,220 --> 00:01:29,050
Note that x axis has lock scale.

21
00:01:29,050 --> 00:01:33,641
In this case, not a numbers probably
were few by features mean value.

22
00:01:33,641 --> 00:01:37,829
You can easily generalize this
logic to apply to other cases.

23
00:01:39,190 --> 00:01:44,950
Okay on this example we just learned this,
missing values can be hidden from us.

24
00:01:44,950 --> 00:01:49,290
And by hidden I mean replaced by some
other value beside not a number.

25
00:01:50,520 --> 00:01:53,940
Great, let's talk about
missing value importation.

26
00:01:53,940 --> 00:01:56,370
The most often examples are first,

27
00:01:56,370 --> 00:02:00,912
replacing not a number with some
value outside fixed value range.

28
00:02:00,912 --> 00:02:04,740
Second, replacing not
a number with mean or median.

29
00:02:04,740 --> 00:02:07,720
And third,
trying to reconstruct value somehow.

30
00:02:08,980 --> 00:02:11,680
First method is useful
in a way that it gives

31
00:02:11,680 --> 00:02:15,910
three possibility to take missing
value into separate category.

32
00:02:15,910 --> 00:02:21,540
The downside of this is that performance
of linear networks can suffer.

33
00:02:22,632 --> 00:02:26,740
Second method usually beneficial for
simple linear models and neural networks.

34
00:02:26,740 --> 00:02:31,520
But again for trees it can be harder to
select object which had missing values

35
00:02:31,520 --> 00:02:32,480
in the first place.

36
00:02:33,740 --> 00:02:37,070
Let's keep the feature value
reconstruction for now, and

37
00:02:37,070 --> 00:02:39,120
turn to feature generation for a moment.

38
00:02:40,410 --> 00:02:44,660
The concern we just have discussed can
be addressed by adding new feature

39
00:02:44,660 --> 00:02:49,350
isnull indicating which rows have
missing values for this feature.

40
00:02:50,490 --> 00:02:52,840
This can solve problems with trees and

41
00:02:52,840 --> 00:02:56,300
neural networks while computing mean or
median.

42
00:02:56,300 --> 00:03:00,840
But the downside of this is that we will
double number of columns in the data set.

43
00:03:02,140 --> 00:03:05,160
Now back to missing values
importation methods.

44
00:03:05,160 --> 00:03:06,510
The third one, and

45
00:03:06,510 --> 00:03:11,680
the last one we will discuss here,
is to reconstruct each value if possible.

46
00:03:11,680 --> 00:03:16,900
One example of such possibility is
having missing values in time series.

47
00:03:16,900 --> 00:03:19,970
For example,
we could have everyday temperature for

48
00:03:19,970 --> 00:03:24,950
a month but several values in
the middle of months are missing.

49
00:03:24,950 --> 00:03:29,270
Well of course, we can approximate
them using nearby observations.

50
00:03:29,270 --> 00:03:33,880
But obviously, this kind of
opportunity is rarely the case.

51
00:03:33,880 --> 00:03:38,450
In most typical scenario rows
of our data set are independent.

52
00:03:38,450 --> 00:03:42,430
And we usually will not find any
proper logic to reconstruct them.

53
00:03:44,030 --> 00:03:49,075
Great, to this moment we already learned
that we can construct new feature,

54
00:03:49,075 --> 00:03:53,600
isnull indicating which
rows contains not numbers.

55
00:03:55,110 --> 00:03:58,529
What are other important moments about
feature generation we should know?

56
00:03:59,550 --> 00:04:02,800
Well there's one general concern about

57
00:04:02,800 --> 00:04:06,840
generating new features from
one with missing values.

58
00:04:06,840 --> 00:04:10,780
That is, if we do this,
we should be very careful with

59
00:04:10,780 --> 00:04:13,480
replacing missing values
before our feature generation.

60
00:04:13,480 --> 00:04:19,900
To illustrate this, let's imagine we have
a year long data set with two features.

61
00:04:19,900 --> 00:04:24,090
Daytime feature and
temperature which had missing values.

62
00:04:24,090 --> 00:04:26,320
We can see all of this on the figure.

63
00:04:28,040 --> 00:04:32,960
Now we fill missing values with some
value, for example with median.

64
00:04:32,960 --> 00:04:39,030
If you have data over the whole year
median probably will be near zero so

65
00:04:39,030 --> 00:04:40,680
it should look like that.

66
00:04:40,680 --> 00:04:44,980
Now we want to add feature like
difference between temperature today and

67
00:04:44,980 --> 00:04:46,770
yesterday, let's do this.

68
00:04:48,210 --> 00:04:49,290
As we can see,

69
00:04:49,290 --> 00:04:53,980
near the missing values this difference
usually will be abnormally huge.

70
00:04:53,980 --> 00:04:56,890
And this can be misleading our model.

71
00:04:56,890 --> 00:05:01,816
But hey, we already know that we can
approximate missing values sometimes here

72
00:05:01,816 --> 00:05:04,780
by interpolation the error by points,
great.

73
00:05:04,780 --> 00:05:09,250
But unfortunately, we usually don't
have enough time to be so careful here.

74
00:05:09,250 --> 00:05:10,726
And more importantly,

75
00:05:10,726 --> 00:05:16,050
these problems can occur in cases when we
can't come up with such specific solution.

76
00:05:17,210 --> 00:05:20,270
Let's review another example
of missing value importation.

77
00:05:20,270 --> 00:05:22,830
Which will be substantially
discussed later

78
00:05:22,830 --> 00:05:25,050
in advanced feature [INAUDIBLE] topic.

79
00:05:26,080 --> 00:05:29,650
Here we have a data set
with independent rows.

80
00:05:29,650 --> 00:05:33,978
And we want to encode the categorical
feature with the numeric feature.

81
00:05:33,978 --> 00:05:39,020
To achieve that we calculate mean
value of numeric feature for

82
00:05:39,020 --> 00:05:42,930
every category, and
replace categories with these mean values.

83
00:05:44,000 --> 00:05:48,200
What happens if we fill not
the numbers in the numeric feature,

84
00:05:48,200 --> 00:05:52,790
with some value outside of
feature range like -999.

85
00:05:54,200 --> 00:05:58,726
As we can see, all values we will
be doing them closer to -999.

86
00:05:58,726 --> 00:06:03,531
And the more the row's corresponding to
particular category will have missing

87
00:06:03,531 --> 00:06:04,310
values.

88
00:06:04,310 --> 00:06:08,925
The closer mean value will be to -999.

89
00:06:08,925 --> 00:06:14,860
The same is true if we fill missing values
with mean or median of the feature.

90
00:06:14,860 --> 00:06:18,617
This kind of missing value importation
definitely can screw up the feature we

91
00:06:18,617 --> 00:06:20,420
are constructing.

92
00:06:20,420 --> 00:06:23,750
The way to handle this
particular case is to simply

93
00:06:23,750 --> 00:06:27,510
ignore missing values while
calculating means for each category.

94
00:06:28,630 --> 00:06:32,480
Again let me repeat the idea
of these two examples.

95
00:06:33,560 --> 00:06:38,190
You should be very careful with early none
importation if you want to generate new

96
00:06:38,190 --> 00:06:39,790
features.

97
00:06:39,790 --> 00:06:42,570
There's one more interesting
thing about missing values.

98
00:06:43,823 --> 00:06:46,476
[INAUDIBLE] boost can
handle a lot of numbers and

99
00:06:46,476 --> 00:06:50,090
sometimes using this approach
can change score drastically.

100
00:06:51,220 --> 00:06:53,320
Besides common approaches
we have discussed,

101
00:06:53,320 --> 00:06:56,400
sometimes we can treat
outliers as missing values.

102
00:06:56,400 --> 00:07:01,293
For example, if we have some easy
classification task with songs which

103
00:07:01,293 --> 00:07:06,932
are thought to be composed even before
ancient Rome, or maybe the year 2025.

104
00:07:06,932 --> 00:07:10,249
We can try to treat these
outliers as missing values.

105
00:07:11,280 --> 00:07:12,980
If you have categorical features,

106
00:07:12,980 --> 00:07:16,640
sometimes it can be beneficial
to change the missing values or

107
00:07:16,640 --> 00:07:21,970
categories which present in the test data
but do not present in the train data.

108
00:07:21,970 --> 00:07:26,010
The intention for doing so appeals to
the fact that the model which didn't have

109
00:07:26,010 --> 00:07:30,148
that category in the train data
will eventually treat it randomly.

110
00:07:30,148 --> 00:07:35,180
Here and
of categorical features can be of help.

111
00:07:35,180 --> 00:07:40,460
As we already discussed in our course, we
can change categories to its frequencies

112
00:07:40,460 --> 00:07:44,560
and thus to it categories was in
before based on their frequency.

113
00:07:45,740 --> 00:07:49,010
Let's walk through
the example on the slide.

114
00:07:49,010 --> 00:07:53,316
There you see from the categorical
feature, they not appear in the train.

115
00:07:53,316 --> 00:07:56,310
Let's generate new feature indicating

116
00:07:56,310 --> 00:07:58,640
number of where the occurrence
is in the data.

117
00:07:59,940 --> 00:08:03,040
We will name this feature
categorical_encoded.

118
00:08:03,040 --> 00:08:07,608
Value A has six occurrences
in both train and test, and

119
00:08:07,608 --> 00:08:12,684
that's value of new feature
related to A will be equal to 6.

120
00:08:12,684 --> 00:08:16,480
The same works for values B, D, or C.

121
00:08:16,480 --> 00:08:22,000
But now new features various related
to D and C are equal to each other.

122
00:08:22,000 --> 00:08:25,960
And if there is some dependence in between
target and number of occurrences for

123
00:08:25,960 --> 00:08:30,090
each category, our model will be
able to successfully visualize that.

124
00:08:31,260 --> 00:08:35,060
To conclude this video, let´s overview
main points we have discussed.

125
00:08:36,510 --> 00:08:41,260
The choice of method to fill not
a numbers depends on the situation.

126
00:08:41,260 --> 00:08:43,760
Sometimes, you can
reconstruct missing values.

127
00:08:43,760 --> 00:08:49,020
But usually, it is easier to
replace them with value outside of

128
00:08:49,020 --> 00:08:53,520
feature range, like -999 or
to replace them with mean or median.

129
00:08:54,890 --> 00:08:59,930
Also missing values already can be
replaced with something by organizers.

130
00:09:01,520 --> 00:09:05,940
In this case if you want know exact
rows which have missing values

131
00:09:05,940 --> 00:09:09,890
you can investigate this
by browsing histograms.

132
00:09:09,890 --> 00:09:14,497
More, the model can improve its results
using binary feature isnull which

133
00:09:14,497 --> 00:09:17,210
indicates what roles have missing values.

134
00:09:18,350 --> 00:09:22,680
In general, avoid replacing missing
values before feature generation,

135
00:09:22,680 --> 00:09:26,530
because it can decrease
usefulness of the features.

136
00:09:26,530 --> 00:09:30,400
And in the end,
Xgboost can handle not a numbers directly,

137
00:09:30,400 --> 00:09:33,330
which sometimes can change the score for
the better.

138
00:09:34,620 --> 00:09:37,569
Using knowledge you have
derived from our discussion,

139
00:09:37,569 --> 00:09:40,275
now you should be able to
identify missing values.

140
00:09:40,275 --> 00:09:42,910
Describe main methods to handle them, and

141
00:09:42,910 --> 00:09:46,798
apply this knowledge to gain
an edge in your next computation.

142
00:09:46,798 --> 00:09:50,864
Try these methods in

143
00:09:50,864 --> 00:09:55,887
different scenarios and

144
00:09:55,887 --> 00:10:01,874
for sure, you will succeed.