1
00:00:04,110 --> 00:00:04,850
Hi.

2
00:00:04,850 --> 00:00:08,795
In this video, we will discuss basic
visual generation approaches for

3
00:00:08,795 --> 00:00:11,710
datetime and coordinate features.

4
00:00:11,710 --> 00:00:16,230
They both differ significantly from
numeric and categorical features.

5
00:00:16,230 --> 00:00:21,160
Because we can interpret the meaning of
datetime and coordinates, we can came

6
00:00:21,160 --> 00:00:25,650
up with specific ideas about future
generation which we'll discuss here.

7
00:00:25,650 --> 00:00:27,760
Now, let's start with datetime.

8
00:00:29,030 --> 00:00:33,200
Datetime is quite a distinct feature
because it isn't relying on your nature,

9
00:00:33,200 --> 00:00:38,390
it also has several different
tiers like year, day or week.

10
00:00:38,390 --> 00:00:42,990
Most new features generated from datetime
can be divided into two categories.

11
00:00:42,990 --> 00:00:45,940
The first one,
time moments in a period, and

12
00:00:45,940 --> 00:00:50,050
the second one,
time passed since particular event.

13
00:00:50,050 --> 00:00:51,720
First one is very simple.

14
00:00:51,720 --> 00:00:56,680
We can add features like second,
minute, hour, day in a week,

15
00:00:56,680 --> 00:00:59,600
in a month, on the year and
so on and so forth.

16
00:01:00,630 --> 00:01:04,434
This is useful to capture
repetitive patterns in the data.

17
00:01:04,434 --> 00:01:08,754
If we know about some non-common
materials which influence the data,

18
00:01:08,754 --> 00:01:10,198
we can add them as well.

19
00:01:10,198 --> 00:01:14,682
For example, if we are to predict
efficiency of medication, but

20
00:01:14,682 --> 00:01:18,188
patients receive pills one
time every three days,

21
00:01:18,188 --> 00:01:21,380
we can consider this as
a special time period.

22
00:01:22,520 --> 00:01:25,320
Okay now, time seems particular event.

23
00:01:26,450 --> 00:01:31,660
This event can be either
row-independent or row-dependent.

24
00:01:31,660 --> 00:01:36,540
In the first case, we just calculate
time passed from one general moment for

25
00:01:36,540 --> 00:01:37,540
all data.

26
00:01:37,540 --> 00:01:40,420
For example, from here to thousand.

27
00:01:40,420 --> 00:01:44,970
Here, all samples will become pairable
between each other on one time scale.

28
00:01:46,200 --> 00:01:49,318
As the second variant of
time since particular event,

29
00:01:49,318 --> 00:01:53,990
that date will depend on the sample
we are calculating this for.

30
00:01:53,990 --> 00:01:56,705
For example,
if we are to predict sales in a shop,

31
00:01:56,705 --> 00:02:00,000
like in the ROSSMANN's
store sales competition.

32
00:02:00,000 --> 00:02:05,370
We can add the number of days passed
since the last holiday, weekend or

33
00:02:05,370 --> 00:02:12,080
since the last sales campaign, or maybe
the number of days left to these events.

34
00:02:12,080 --> 00:02:16,690
So, after adding these features,
our dataframe can look like this.

35
00:02:17,860 --> 00:02:22,390
Date is obviously a date, and
sales are the target of this task.

36
00:02:23,400 --> 00:02:26,406
While other columns
are generated features.

37
00:02:26,406 --> 00:02:31,554
Week day feature indicates which day
in the week is this, daynumber since

38
00:02:31,554 --> 00:02:37,377
year 2014 indicates how many days
have passed since January 1st, 2014.

39
00:02:37,377 --> 00:02:42,883
is_holiday is a binary feature indicating
whether this day is a holiday and

40
00:02:42,883 --> 00:02:49,001
days_ till_ holidays indicate how many
days are left before the closest holiday.

41
00:02:49,001 --> 00:02:52,760
Sometimes we have several
datetime columns in our data.

42
00:02:52,760 --> 00:02:58,030
The most for data here is to
subtract one feature from another.

43
00:02:58,030 --> 00:03:03,390
Or perhaps subtract generated features,
like once we have, we just have discussed.

44
00:03:04,550 --> 00:03:09,420
Time moment inside the period or
time passed in zero dependent events.

45
00:03:09,420 --> 00:03:14,240
One simple example of third generation
can be found in churn prediction task.

46
00:03:14,240 --> 00:03:17,850
Basically churn prediction
is about estimating

47
00:03:17,850 --> 00:03:20,200
the likelihood that customers will churn.

48
00:03:21,520 --> 00:03:26,170
We may receive a valuable feature here
by subtracting user registration date

49
00:03:26,170 --> 00:03:29,940
from the date of some action of his,
like purchasing a product, or

50
00:03:29,940 --> 00:03:32,120
calling to the customer service.

51
00:03:32,120 --> 00:03:35,000
We can see how this works
on this data dataframe.

52
00:03:35,000 --> 00:03:39,529
For every user, we know
last_purchase_date and last_call_date.

53
00:03:39,529 --> 00:03:44,980
Here we add the difference between
them as new feature named date_diff.

54
00:03:44,980 --> 00:03:48,010
For clarity,
let's take a look at this figure.

55
00:03:48,010 --> 00:03:52,350
For every user, we have his
last_purchase_date and his last_call_date.

56
00:03:52,350 --> 00:03:55,570
Thus, we can add date_diff
feature which indicates

57
00:03:55,570 --> 00:03:58,830
number of days between these events.

58
00:03:58,830 --> 00:04:03,140
Note that after generation feature is
from date time, you usually will get

59
00:04:03,140 --> 00:04:06,550
either numeric features like
time passed since the year 2000,

60
00:04:06,550 --> 00:04:10,520
or categorical features like day of week.

61
00:04:10,520 --> 00:04:15,070
And these features now are need
to be treated accordingly

62
00:04:15,070 --> 00:04:18,610
with necessary pre-processings
we have discussed earlier.

63
00:04:19,630 --> 00:04:22,630
Now having discussed feature
generation for datetime,

64
00:04:22,630 --> 00:04:26,582
let's move onto feature generation for
coordinates.

65
00:04:26,582 --> 00:04:30,340
Let's imagine that we're trying to
estimate the real estate price.

66
00:04:30,340 --> 00:04:35,300
Like in the Deloitte competition named
Western Australia Rental Prices,

67
00:04:35,300 --> 00:04:38,440
or in the Sberbank Russian Housing Market
competition.

68
00:04:39,610 --> 00:04:44,160
Generally, you can calculate distances
to important points on the map.

69
00:04:45,310 --> 00:04:46,660
Keep this wonderful map.

70
00:04:46,660 --> 00:04:50,590
If you have additional data with
infrastructural buildings, you can

71
00:04:50,590 --> 00:04:55,640
add as a feature distance to the nearest
shop to the second by distance hospital,

72
00:04:55,640 --> 00:04:58,670
to the best school in the neighborhood and
so on.

73
00:04:59,680 --> 00:05:01,552
If you do not have such data,

74
00:05:01,552 --> 00:05:06,480
you can extract interesting points on
the map from your trained test data.

75
00:05:06,480 --> 00:05:10,646
For example, you can do a new
map to squares, with a grid, and

76
00:05:10,646 --> 00:05:14,414
within each square,
find the most expensive flat, and

77
00:05:14,414 --> 00:05:19,170
for every other object in this square,
add the distance to that flat.

78
00:05:19,170 --> 00:05:23,085
Or you can organize your data
points into clusters, and

79
00:05:23,085 --> 00:05:27,010
then use centers of clusters
as such important points.

80
00:05:28,050 --> 00:05:29,460
Or again, another way.

81
00:05:29,460 --> 00:05:34,236
You can find some special areas,
like the area with very old buildings and

82
00:05:34,236 --> 00:05:35,940
add distance to this one.

83
00:05:37,290 --> 00:05:42,500
Another major approach to use coordinates
is to calculate aggregated statistics for

84
00:05:42,500 --> 00:05:44,880
objects surrounding area.

85
00:05:44,880 --> 00:05:48,180
This can include number of lets
around this particular point,

86
00:05:48,180 --> 00:05:51,530
which can then be interpreted as areas or
polarity.

87
00:05:52,650 --> 00:05:55,053
Or we can add mean realty price,

88
00:05:55,053 --> 00:06:00,050
which will indicate how expensive
area around selected point is.

89
00:06:01,460 --> 00:06:03,110
Both distances and

90
00:06:03,110 --> 00:06:07,580
aggregate statistics are often
useful in tasks with coordinates.

91
00:06:08,780 --> 00:06:13,342
One more trick you need to know about
coordinates, that if you train decision

92
00:06:13,342 --> 00:06:18,670
trees from them, you can add slightly
rotated coordinates is new features.

93
00:06:18,670 --> 00:06:22,930
And this will help a model make
more precise selections on the map.

94
00:06:24,350 --> 00:06:28,320
It can be hard to know what exact
rotation we should make, so

95
00:06:28,320 --> 00:06:34,040
we may want to add all rotations to 45 or
22.5 degrees.

96
00:06:35,220 --> 00:06:38,510
Let's look at the next example
of a relative price prediction.

97
00:06:39,990 --> 00:06:43,550
Here the street is dividing
an area in two parts.

98
00:06:43,550 --> 00:06:48,570
The high priced district above the street,
and the low priced district below it.

99
00:06:48,570 --> 00:06:55,350
If the street is slightly rotated, trees
will try to make a lot of space here.

100
00:06:55,350 --> 00:07:00,582
But if we will add new coordinates in
which these two districts can be divided

101
00:07:00,582 --> 00:07:05,670
by a single split, this will hugely
facilitate the rebuilding process.

102
00:07:06,710 --> 00:07:10,750
Great, we just summarize the most
frequent methods used for

103
00:07:10,750 --> 00:07:13,802
future generation from datetime and
coordinates.

104
00:07:13,802 --> 00:07:18,930
For datetime, these are applying
periodicity, calculates in time passed

105
00:07:18,930 --> 00:07:24,550
since particular event, and engine
differences between two datetime features.

106
00:07:25,680 --> 00:07:29,818
For coordinates, we should recall
extracting interesting samples from

107
00:07:29,818 --> 00:07:35,070
trained test data, using places from
additional data, calculating distances to

108
00:07:35,070 --> 00:07:40,520
centers of clusters, and adding aggregated
statistics for surrounding area.

109
00:07:42,380 --> 00:07:47,660
Knowing how to effectively handle datetime
and coordinates, as well as numeric and

110
00:07:47,660 --> 00:07:52,750
categorical features, will provide you
reliable way to improve your score.

111
00:07:52,750 --> 00:07:57,340
And to help you devise that
specific part of solution which is

112
00:07:57,340 --> 00:08:00,401
often required to beat very top scores.

113
00:08:00,401 --> 00:08:10,401
[SOUND]