Often we have to deal with missing values in our data. They could look like not numbers, empty strings, or outliers like minus 999. Sometimes they can contain useful information by themselves, like what was the reason of missing value occurring here? How to use them effectively? How to engineer new features from them? We'll do the topic for this video. So what kind of information missing values might contain? How can they look like? Let's take a look at missing values in the Springfield competition. This is metrics of samples and features. People mainly reviewed each feature, and found missing values for each column. This latest could be not a number, empty string, minus 1, 99, and so on. For example, how can we found out that -1 can be the missing value? We could draw a histogram and see this variable has uniform distribution between 0 and 1. And that it has small peak of -1 values. So if there are no not numbers there, we can assume that they were replaced by -1. Or the feature distribution plot can look like the second figure. Note that x axis has lock scale. In this case, not a numbers probably were few by features mean value. You can easily generalize this logic to apply to other cases. Okay on this example we just learned this, missing values can be hidden from us. And by hidden I mean replaced by some other value beside not a number. Great, let's talk about missing value importation. The most often examples are first, replacing not a number with some value outside fixed value range. Second, replacing not a number with mean or median. And third, trying to reconstruct value somehow. First method is useful in a way that it gives three possibility to take missing value into separate category. The downside of this is that performance of linear networks can suffer. Second method usually beneficial for simple linear models and neural networks. But again for trees it can be harder to select object which had missing values in the first place. Let's keep the feature value reconstruction for now, and turn to feature generation for a moment. The concern we just have discussed can be addressed by adding new feature isnull indicating which rows have missing values for this feature. This can solve problems with trees and neural networks while computing mean or median. But the downside of this is that we will double number of columns in the data set. Now back to missing values importation methods. The third one, and the last one we will discuss here, is to reconstruct each value if possible. One example of such possibility is having missing values in time series. For example, we could have everyday temperature for a month but several values in the middle of months are missing. Well of course, we can approximate them using nearby observations. But obviously, this kind of opportunity is rarely the case. In most typical scenario rows of our data set are independent. And we usually will not find any proper logic to reconstruct them. Great, to this moment we already learned that we can construct new feature, isnull indicating which rows contains not numbers. What are other important moments about feature generation we should know? Well there's one general concern about generating new features from one with missing values. That is, if we do this, we should be very careful with replacing missing values before our feature generation. To illustrate this, let's imagine we have a year long data set with two features. Daytime feature and temperature which had missing values. We can see all of this on the figure. Now we fill missing values with some value, for example with median. If you have data over the whole year median probably will be near zero so it should look like that. Now we want to add feature like difference between temperature today and yesterday, let's do this. As we can see, near the missing values this difference usually will be abnormally huge. And this can be misleading our model. But hey, we already know that we can approximate missing values sometimes here by interpolation the error by points, great. But unfortunately, we usually don't have enough time to be so careful here. And more importantly, these problems can occur in cases when we can't come up with such specific solution. Let's review another example of missing value importation. Which will be substantially discussed later in advanced feature [INAUDIBLE] topic. Here we have a data set with independent rows. And we want to encode the categorical feature with the numeric feature. To achieve that we calculate mean value of numeric feature for every category, and replace categories with these mean values. What happens if we fill not the numbers in the numeric feature, with some value outside of feature range like -999. As we can see, all values we will be doing them closer to -999. And the more the row's corresponding to particular category will have missing values. The closer mean value will be to -999. The same is true if we fill missing values with mean or median of the feature. This kind of missing value importation definitely can screw up the feature we are constructing. The way to handle this particular case is to simply ignore missing values while calculating means for each category. Again let me repeat the idea of these two examples. You should be very careful with early none importation if you want to generate new features. There's one more interesting thing about missing values. [INAUDIBLE] boost can handle a lot of numbers and sometimes using this approach can change score drastically. Besides common approaches we have discussed, sometimes we can treat outliers as missing values. For example, if we have some easy classification task with songs which are thought to be composed even before ancient Rome, or maybe the year 2025. We can try to treat these outliers as missing values. If you have categorical features, sometimes it can be beneficial to change the missing values or categories which present in the test data but do not present in the train data. The intention for doing so appeals to the fact that the model which didn't have that category in the train data will eventually treat it randomly. Here and of categorical features can be of help. As we already discussed in our course, we can change categories to its frequencies and thus to it categories was in before based on their frequency. Let's walk through the example on the slide. There you see from the categorical feature, they not appear in the train. Let's generate new feature indicating number of where the occurrence is in the data. We will name this feature categorical_encoded. Value A has six occurrences in both train and test, and that's value of new feature related to A will be equal to 6. The same works for values B, D, or C. But now new features various related to D and C are equal to each other. And if there is some dependence in between target and number of occurrences for each category, our model will be able to successfully visualize that. To conclude this video, let´s overview main points we have discussed. The choice of method to fill not a numbers depends on the situation. Sometimes, you can reconstruct missing values. But usually, it is easier to replace them with value outside of feature range, like -999 or to replace them with mean or median. Also missing values already can be replaced with something by organizers. In this case if you want know exact rows which have missing values you can investigate this by browsing histograms. More, the model can improve its results using binary feature isnull which indicates what roles have missing values. In general, avoid replacing missing values before feature generation, because it can decrease usefulness of the features. And in the end, Xgboost can handle not a numbers directly, which sometimes can change the score for the better. Using knowledge you have derived from our discussion, now you should be able to identify missing values. Describe main methods to handle them, and apply this knowledge to gain an edge in your next computation. Try these methods in different scenarios and for sure, you will succeed.