1
00:00:00,025 --> 00:00:04,750
[NOISE]
Hi.

2
00:00:04,750 --> 00:00:08,641
In every competition,
we need to pre-process given data set and

3
00:00:08,641 --> 00:00:11,990
generate new features from existing ones.

4
00:00:11,990 --> 00:00:16,082
This is often required to stay on
the same track with other competitors and

5
00:00:16,082 --> 00:00:18,656
sometimes careful feature
preprocessing and

6
00:00:18,656 --> 00:00:23,520
efficient engineering can give you
the edge you strive into achieve.

7
00:00:23,520 --> 00:00:28,490
Thus, in the next videos, we will cover
a very useful topic of basic feature

8
00:00:28,490 --> 00:00:33,480
preprocessing and basic feature generation
for different types of features.

9
00:00:33,480 --> 00:00:37,550
Namely, we will go through numeric
features, categorical features,

10
00:00:37,550 --> 00:00:39,804
datetime features and coordinate features.

11
00:00:39,804 --> 00:00:45,337
And in the last video,
we will discus mission values.

12
00:00:45,337 --> 00:00:48,590
Beside that, we also will discus
dependence of preprocessing and

13
00:00:48,590 --> 00:00:50,690
generation on a model we're going to use.

14
00:00:50,690 --> 00:00:55,468
So the broad goal of the next
videos is to help you acquire

15
00:00:55,468 --> 00:00:58,220
these highly required skills.

16
00:00:58,220 --> 00:01:03,016
To get an idea of following topics, let's
start with an example of data similar

17
00:01:03,016 --> 00:01:05,950
to what we may encounter in competition.

18
00:01:05,950 --> 00:01:08,930
And take a look at well
known Titanic dataset.

19
00:01:09,970 --> 00:01:14,600
It stores the data about people who were
on the Titanic liner during its last trip.

20
00:01:15,780 --> 00:01:20,730
Here we have a typical dataframe
to work with in competitions.

21
00:01:20,730 --> 00:01:24,110
Each row represents a person and
each column is a feature.

22
00:01:25,270 --> 00:01:27,440
We have different kinds of features here.

23
00:01:27,440 --> 00:01:31,820
For example, the values in
Survived column are either 0 or 1.

24
00:01:31,820 --> 00:01:33,290
The feature is binary.

25
00:01:33,290 --> 00:01:36,670
And by the way, it is what we
need to predict in this task.

26
00:01:36,670 --> 00:01:38,330
It is our target.

27
00:01:38,330 --> 00:01:41,420
So, age and fare are numeric features.

28
00:01:41,420 --> 00:01:46,550
Sibims p and parch accounts statement and
embarked a categorical features.

29
00:01:46,550 --> 00:01:49,390
Ticket is just an ID and name is text.

30
00:01:50,480 --> 00:01:54,100
So indeed,
we have different feature types here, but

31
00:01:54,100 --> 00:01:58,790
do we understand why we should care about
different features having different types?

32
00:01:58,790 --> 00:02:00,300
Well, there are two main reasons for

33
00:02:00,300 --> 00:02:05,110
it, namely, strong connection between
preprocessing at our model and

34
00:02:05,110 --> 00:02:07,850
common feature generation methods for
each feature type.

35
00:02:09,130 --> 00:02:11,835
First, let's discuss
feature preprocessing.

36
00:02:13,110 --> 00:02:17,520
Most of times, we can just take our
features, fit our favorite model and

37
00:02:17,520 --> 00:02:19,470
expect it to get great results.

38
00:02:20,470 --> 00:02:25,120
Each type of feature has its own ways
to be preprocessed in order to improve

39
00:02:25,120 --> 00:02:27,290
quality of the model.

40
00:02:27,290 --> 00:02:30,394
In other words,
joys of preprocessing matter,

41
00:02:30,394 --> 00:02:33,360
depends on the model we're going to use.

42
00:02:33,360 --> 00:02:34,185
For example,

43
00:02:34,185 --> 00:02:39,320
let's suppose that target has nonlinear
dependency on the pclass feature.

44
00:02:39,320 --> 00:02:44,280
Pclass linear of 1 usually leads
to target of 1, 2 leads to 0, and

45
00:02:44,280 --> 00:02:46,620
3 leads to 1 again.

46
00:02:46,620 --> 00:02:50,700
Clearly, because this is not
a linear dependency linear model,

47
00:02:50,700 --> 00:02:52,760
one get a good result here.

48
00:02:52,760 --> 00:02:56,280
So in order to improve
a linear model's quality,

49
00:02:56,280 --> 00:02:59,650
we would want to preprocess
pclass feature in some way.

50
00:02:59,650 --> 00:03:05,930
For example, with the so-called which
will replace our feature with three,

51
00:03:05,930 --> 00:03:08,830
one for each of pclass values.

52
00:03:08,830 --> 00:03:13,020
The linear model will fit much better
now than in the previous case.

53
00:03:14,070 --> 00:03:19,120
However, random forest does not require
this feature to be transformed at all.

54
00:03:19,120 --> 00:03:24,050
Random forest can easily put
each pclass in separately and

55
00:03:24,050 --> 00:03:26,590
predict fine probabilities.

56
00:03:26,590 --> 00:03:29,460
So, that was an example of preprocessing.

57
00:03:29,460 --> 00:03:34,035
The second reason why we should be
aware of different feature text is

58
00:03:34,035 --> 00:03:36,542
to ease generation of new features.

59
00:03:36,542 --> 00:03:38,732
Feature types different in this and

60
00:03:38,732 --> 00:03:42,390
comprehends in common
feature generation methods.

61
00:03:42,390 --> 00:03:46,470
While gaining an ability to
improve your model through them.

62
00:03:46,470 --> 00:03:51,490
Also understanding of basics of feature
generation will aid you greatly

63
00:03:51,490 --> 00:03:55,050
in upcoming advanced feature
topics from our course.

64
00:03:56,300 --> 00:03:57,960
As in the first point,

65
00:03:57,960 --> 00:04:02,000
understanding of a model here can
help us to create useful features.

66
00:04:02,000 --> 00:04:04,230
Let me show you an example.

67
00:04:04,230 --> 00:04:09,250
Say, we have to predict the number of
apples a shop will sell each day next week

68
00:04:09,250 --> 00:04:12,933
and we already have a couple of months
sales history as train in data.

69
00:04:14,310 --> 00:04:19,640
Let's consider that we have an obvious
linear trend through out the data and

70
00:04:19,640 --> 00:04:23,310
we want to inform the model about it.

71
00:04:23,310 --> 00:04:24,950
To provide you a visual example,

72
00:04:24,950 --> 00:04:30,190
we prepare the second table with last
days from train and first days from test.

73
00:04:31,510 --> 00:04:34,280
One way to help module
neutralize linear train

74
00:04:34,280 --> 00:04:38,180
is to add feature indicating
the week number past.

75
00:04:38,180 --> 00:04:39,160
With this feature,

76
00:04:39,160 --> 00:04:42,579
linear model can successfully find
an existing lineer and dependency.

77
00:04:43,880 --> 00:04:46,660
On the other hand,
a gradient boosted decision tree will use

78
00:04:46,660 --> 00:04:51,540
this feature to calculate something
like mean target value for each week.

79
00:04:51,540 --> 00:04:56,690
Here, I calculated mean values manually
and printed them in the dataframe.

80
00:04:56,690 --> 00:04:59,810
We're going to predict number
of apples for the sixth week.

81
00:05:00,850 --> 00:05:04,400
node that we indeed have here.

82
00:05:04,400 --> 00:05:07,330
So let's plot how a gradient
within the decision tree

83
00:05:07,330 --> 00:05:09,130
will complete the weak feature.

84
00:05:10,240 --> 00:05:13,960
As we do not train Gradient goosting
decision tree on the sixth week,

85
00:05:13,960 --> 00:05:16,410
it will not put splits
between the fifth and

86
00:05:16,410 --> 00:05:21,120
the sixth weeks, then,
when we will bring the numbers for

87
00:05:21,120 --> 00:05:25,500
the 6th week, the model will end up
using the wave from the 5th week.

88
00:05:25,500 --> 00:05:28,868
As we can see unfortunately,
no users shall land their train here.

89
00:05:28,868 --> 00:05:32,011
And vise versa,
we can come up with an example

90
00:05:32,011 --> 00:05:36,743
of generated feature that will be
beneficial for decisions three.

91
00:05:36,743 --> 00:05:39,450
And useful spoliniar model.

92
00:05:39,450 --> 00:05:43,015
So this example shows us,
that our approach to feature

93
00:05:43,015 --> 00:05:47,053
generation should rely on
understanding of employed model.

94
00:05:47,053 --> 00:05:51,585
To summarize this feature,
first feature preprocessing is necessary

95
00:05:51,585 --> 00:05:55,812
instrument you have to use to
adapt data to your model.` Second,

96
00:05:55,812 --> 00:05:59,963
feature generation is a very
powerful technique which can aid you

97
00:05:59,963 --> 00:06:05,810
significantly in competitions and
sometimes provide you the required edge.

98
00:06:05,810 --> 00:06:08,785
And at last,
both feature preprocessing and

99
00:06:08,785 --> 00:06:13,680
feature generation depend on
the model you are going to use.

100
00:06:13,680 --> 00:06:17,050
So these three topics,
in connection to feature types,

101
00:06:17,050 --> 00:06:20,410
will be general theme of the next videos.

102
00:06:20,410 --> 00:06:22,970
We will thoroughly examine
most frequent methods

103
00:06:22,970 --> 00:06:25,788
which you can be able to
incorporate in your solutions.

104
00:06:25,788 --> 00:06:29,364
Good luck.

105
00:06:29,364 --> 00:06:32,889
[SOUND]

106
00:06:32,889 --> 00:06:40,219
[MUSIC]