[NOISE]
Hi. In every competition,
we need to pre-process given data set and generate new features from existing ones. This is often required to stay on
the same track with other competitors and sometimes careful feature
preprocessing and efficient engineering can give you
the edge you strive into achieve. Thus, in the next videos, we will cover
a very useful topic of basic feature preprocessing and basic feature generation
for different types of features. Namely, we will go through numeric
features, categorical features, datetime features and coordinate features. And in the last video,
we will discus mission values. Beside that, we also will discus
dependence of preprocessing and generation on a model we're going to use. So the broad goal of the next
videos is to help you acquire these highly required skills. To get an idea of following topics, let's
start with an example of data similar to what we may encounter in competition. And take a look at well
known Titanic dataset. It stores the data about people who were
on the Titanic liner during its last trip. Here we have a typical dataframe
to work with in competitions. Each row represents a person and
each column is a feature. We have different kinds of features here. For example, the values in
Survived column are either 0 or 1. The feature is binary. And by the way, it is what we
need to predict in this task. It is our target. So, age and fare are numeric features. Sibims p and parch accounts statement and
embarked a categorical features. Ticket is just an ID and name is text. So indeed,
we have different feature types here, but do we understand why we should care about
different features having different types? Well, there are two main reasons for it, namely, strong connection between
preprocessing at our model and common feature generation methods for
each feature type. First, let's discuss
feature preprocessing. Most of times, we can just take our
features, fit our favorite model and expect it to get great results. Each type of feature has its own ways
to be preprocessed in order to improve quality of the model. In other words,
joys of preprocessing matter, depends on the model we're going to use. For example, let's suppose that target has nonlinear
dependency on the pclass feature. Pclass linear of 1 usually leads
to target of 1, 2 leads to 0, and 3 leads to 1 again. Clearly, because this is not
a linear dependency linear model, one get a good result here. So in order to improve
a linear model's quality, we would want to preprocess
pclass feature in some way. For example, with the so-called which
will replace our feature with three, one for each of pclass values. The linear model will fit much better
now than in the previous case. However, random forest does not require
this feature to be transformed at all. Random forest can easily put
each pclass in separately and predict fine probabilities. So, that was an example of preprocessing. The second reason why we should be
aware of different feature text is to ease generation of new features. Feature types different in this and comprehends in common
feature generation methods. While gaining an ability to
improve your model through them. Also understanding of basics of feature
generation will aid you greatly in upcoming advanced feature
topics from our course. As in the first point, understanding of a model here can
help us to create useful features. Let me show you an example. Say, we have to predict the number of
apples a shop will sell each day next week and we already have a couple of months
sales history as train in data. Let's consider that we have an obvious
linear trend through out the data and we want to inform the model about it. To provide you a visual example, we prepare the second table with last
days from train and first days from test. One way to help module
neutralize linear train is to add feature indicating
the week number past. With this feature, linear model can successfully find
an existing lineer and dependency. On the other hand,
a gradient boosted decision tree will use this feature to calculate something
like mean target value for each week. Here, I calculated mean values manually
and printed them in the dataframe. We're going to predict number
of apples for the sixth week. node that we indeed have here. So let's plot how a gradient
within the decision tree will complete the weak feature. As we do not train Gradient goosting
decision tree on the sixth week, it will not put splits
between the fifth and the sixth weeks, then,
when we will bring the numbers for the 6th week, the model will end up
using the wave from the 5th week. As we can see unfortunately,
no users shall land their train here. And vise versa,
we can come up with an example of generated feature that will be
beneficial for decisions three. And useful spoliniar model. So this example shows us,
that our approach to feature generation should rely on
understanding of employed model. To summarize this feature,
first feature preprocessing is necessary instrument you have to use to
adapt data to your model.` Second, feature generation is a very
powerful technique which can aid you significantly in competitions and
sometimes provide you the required edge. And at last,
both feature preprocessing and feature generation depend on
the model you are going to use. So these three topics,
in connection to feature types, will be general theme of the next videos. We will thoroughly examine
most frequent methods which you can be able to
incorporate in your solutions. Good luck. [SOUND] [MUSIC]