1 00:00:04,110 --> 00:00:04,850 Hi. 2 00:00:04,850 --> 00:00:08,795 In this video, we will discuss basic visual generation approaches for 3 00:00:08,795 --> 00:00:11,710 datetime and coordinate features. 4 00:00:11,710 --> 00:00:16,230 They both differ significantly from numeric and categorical features. 5 00:00:16,230 --> 00:00:21,160 Because we can interpret the meaning of datetime and coordinates, we can came 6 00:00:21,160 --> 00:00:25,650 up with specific ideas about future generation which we'll discuss here. 7 00:00:25,650 --> 00:00:27,760 Now, let's start with datetime. 8 00:00:29,030 --> 00:00:33,200 Datetime is quite a distinct feature because it isn't relying on your nature, 9 00:00:33,200 --> 00:00:38,390 it also has several different tiers like year, day or week. 10 00:00:38,390 --> 00:00:42,990 Most new features generated from datetime can be divided into two categories. 11 00:00:42,990 --> 00:00:45,940 The first one, time moments in a period, and 12 00:00:45,940 --> 00:00:50,050 the second one, time passed since particular event. 13 00:00:50,050 --> 00:00:51,720 First one is very simple. 14 00:00:51,720 --> 00:00:56,680 We can add features like second, minute, hour, day in a week, 15 00:00:56,680 --> 00:00:59,600 in a month, on the year and so on and so forth. 16 00:01:00,630 --> 00:01:04,434 This is useful to capture repetitive patterns in the data. 17 00:01:04,434 --> 00:01:08,754 If we know about some non-common materials which influence the data, 18 00:01:08,754 --> 00:01:10,198 we can add them as well. 19 00:01:10,198 --> 00:01:14,682 For example, if we are to predict efficiency of medication, but 20 00:01:14,682 --> 00:01:18,188 patients receive pills one time every three days, 21 00:01:18,188 --> 00:01:21,380 we can consider this as a special time period. 22 00:01:22,520 --> 00:01:25,320 Okay now, time seems particular event. 23 00:01:26,450 --> 00:01:31,660 This event can be either row-independent or row-dependent. 24 00:01:31,660 --> 00:01:36,540 In the first case, we just calculate time passed from one general moment for 25 00:01:36,540 --> 00:01:37,540 all data. 26 00:01:37,540 --> 00:01:40,420 For example, from here to thousand. 27 00:01:40,420 --> 00:01:44,970 Here, all samples will become pairable between each other on one time scale. 28 00:01:46,200 --> 00:01:49,318 As the second variant of time since particular event, 29 00:01:49,318 --> 00:01:53,990 that date will depend on the sample we are calculating this for. 30 00:01:53,990 --> 00:01:56,705 For example, if we are to predict sales in a shop, 31 00:01:56,705 --> 00:02:00,000 like in the ROSSMANN's store sales competition. 32 00:02:00,000 --> 00:02:05,370 We can add the number of days passed since the last holiday, weekend or 33 00:02:05,370 --> 00:02:12,080 since the last sales campaign, or maybe the number of days left to these events. 34 00:02:12,080 --> 00:02:16,690 So, after adding these features, our dataframe can look like this. 35 00:02:17,860 --> 00:02:22,390 Date is obviously a date, and sales are the target of this task. 36 00:02:23,400 --> 00:02:26,406 While other columns are generated features. 37 00:02:26,406 --> 00:02:31,554 Week day feature indicates which day in the week is this, daynumber since 38 00:02:31,554 --> 00:02:37,377 year 2014 indicates how many days have passed since January 1st, 2014. 39 00:02:37,377 --> 00:02:42,883 is_holiday is a binary feature indicating whether this day is a holiday and 40 00:02:42,883 --> 00:02:49,001 days_ till_ holidays indicate how many days are left before the closest holiday. 41 00:02:49,001 --> 00:02:52,760 Sometimes we have several datetime columns in our data. 42 00:02:52,760 --> 00:02:58,030 The most for data here is to subtract one feature from another. 43 00:02:58,030 --> 00:03:03,390 Or perhaps subtract generated features, like once we have, we just have discussed. 44 00:03:04,550 --> 00:03:09,420 Time moment inside the period or time passed in zero dependent events. 45 00:03:09,420 --> 00:03:14,240 One simple example of third generation can be found in churn prediction task. 46 00:03:14,240 --> 00:03:17,850 Basically churn prediction is about estimating 47 00:03:17,850 --> 00:03:20,200 the likelihood that customers will churn. 48 00:03:21,520 --> 00:03:26,170 We may receive a valuable feature here by subtracting user registration date 49 00:03:26,170 --> 00:03:29,940 from the date of some action of his, like purchasing a product, or 50 00:03:29,940 --> 00:03:32,120 calling to the customer service. 51 00:03:32,120 --> 00:03:35,000 We can see how this works on this data dataframe. 52 00:03:35,000 --> 00:03:39,529 For every user, we know last_purchase_date and last_call_date. 53 00:03:39,529 --> 00:03:44,980 Here we add the difference between them as new feature named date_diff. 54 00:03:44,980 --> 00:03:48,010 For clarity, let's take a look at this figure. 55 00:03:48,010 --> 00:03:52,350 For every user, we have his last_purchase_date and his last_call_date. 56 00:03:52,350 --> 00:03:55,570 Thus, we can add date_diff feature which indicates 57 00:03:55,570 --> 00:03:58,830 number of days between these events. 58 00:03:58,830 --> 00:04:03,140 Note that after generation feature is from date time, you usually will get 59 00:04:03,140 --> 00:04:06,550 either numeric features like time passed since the year 2000, 60 00:04:06,550 --> 00:04:10,520 or categorical features like day of week. 61 00:04:10,520 --> 00:04:15,070 And these features now are need to be treated accordingly 62 00:04:15,070 --> 00:04:18,610 with necessary pre-processings we have discussed earlier. 63 00:04:19,630 --> 00:04:22,630 Now having discussed feature generation for datetime, 64 00:04:22,630 --> 00:04:26,582 let's move onto feature generation for coordinates. 65 00:04:26,582 --> 00:04:30,340 Let's imagine that we're trying to estimate the real estate price. 66 00:04:30,340 --> 00:04:35,300 Like in the Deloitte competition named Western Australia Rental Prices, 67 00:04:35,300 --> 00:04:38,440 or in the Sberbank Russian Housing Market competition. 68 00:04:39,610 --> 00:04:44,160 Generally, you can calculate distances to important points on the map. 69 00:04:45,310 --> 00:04:46,660 Keep this wonderful map. 70 00:04:46,660 --> 00:04:50,590 If you have additional data with infrastructural buildings, you can 71 00:04:50,590 --> 00:04:55,640 add as a feature distance to the nearest shop to the second by distance hospital, 72 00:04:55,640 --> 00:04:58,670 to the best school in the neighborhood and so on. 73 00:04:59,680 --> 00:05:01,552 If you do not have such data, 74 00:05:01,552 --> 00:05:06,480 you can extract interesting points on the map from your trained test data. 75 00:05:06,480 --> 00:05:10,646 For example, you can do a new map to squares, with a grid, and 76 00:05:10,646 --> 00:05:14,414 within each square, find the most expensive flat, and 77 00:05:14,414 --> 00:05:19,170 for every other object in this square, add the distance to that flat. 78 00:05:19,170 --> 00:05:23,085 Or you can organize your data points into clusters, and 79 00:05:23,085 --> 00:05:27,010 then use centers of clusters as such important points. 80 00:05:28,050 --> 00:05:29,460 Or again, another way. 81 00:05:29,460 --> 00:05:34,236 You can find some special areas, like the area with very old buildings and 82 00:05:34,236 --> 00:05:35,940 add distance to this one. 83 00:05:37,290 --> 00:05:42,500 Another major approach to use coordinates is to calculate aggregated statistics for 84 00:05:42,500 --> 00:05:44,880 objects surrounding area. 85 00:05:44,880 --> 00:05:48,180 This can include number of lets around this particular point, 86 00:05:48,180 --> 00:05:51,530 which can then be interpreted as areas or polarity. 87 00:05:52,650 --> 00:05:55,053 Or we can add mean realty price, 88 00:05:55,053 --> 00:06:00,050 which will indicate how expensive area around selected point is. 89 00:06:01,460 --> 00:06:03,110 Both distances and 90 00:06:03,110 --> 00:06:07,580 aggregate statistics are often useful in tasks with coordinates. 91 00:06:08,780 --> 00:06:13,342 One more trick you need to know about coordinates, that if you train decision 92 00:06:13,342 --> 00:06:18,670 trees from them, you can add slightly rotated coordinates is new features. 93 00:06:18,670 --> 00:06:22,930 And this will help a model make more precise selections on the map. 94 00:06:24,350 --> 00:06:28,320 It can be hard to know what exact rotation we should make, so 95 00:06:28,320 --> 00:06:34,040 we may want to add all rotations to 45 or 22.5 degrees. 96 00:06:35,220 --> 00:06:38,510 Let's look at the next example of a relative price prediction. 97 00:06:39,990 --> 00:06:43,550 Here the street is dividing an area in two parts. 98 00:06:43,550 --> 00:06:48,570 The high priced district above the street, and the low priced district below it. 99 00:06:48,570 --> 00:06:55,350 If the street is slightly rotated, trees will try to make a lot of space here. 100 00:06:55,350 --> 00:07:00,582 But if we will add new coordinates in which these two districts can be divided 101 00:07:00,582 --> 00:07:05,670 by a single split, this will hugely facilitate the rebuilding process. 102 00:07:06,710 --> 00:07:10,750 Great, we just summarize the most frequent methods used for 103 00:07:10,750 --> 00:07:13,802 future generation from datetime and coordinates. 104 00:07:13,802 --> 00:07:18,930 For datetime, these are applying periodicity, calculates in time passed 105 00:07:18,930 --> 00:07:24,550 since particular event, and engine differences between two datetime features. 106 00:07:25,680 --> 00:07:29,818 For coordinates, we should recall extracting interesting samples from 107 00:07:29,818 --> 00:07:35,070 trained test data, using places from additional data, calculating distances to 108 00:07:35,070 --> 00:07:40,520 centers of clusters, and adding aggregated statistics for surrounding area. 109 00:07:42,380 --> 00:07:47,660 Knowing how to effectively handle datetime and coordinates, as well as numeric and 110 00:07:47,660 --> 00:07:52,750 categorical features, will provide you reliable way to improve your score. 111 00:07:52,750 --> 00:07:57,340 And to help you devise that specific part of solution which is 112 00:07:57,340 --> 00:08:00,401 often required to beat very top scores. 113 00:08:00,401 --> 00:08:10,401 [SOUND]