1 00:00:00,031 --> 00:00:05,036 [SOUND] Hi, to this moment, we have already discussed all 2 00:00:05,036 --> 00:00:11,817 basics new things which build up to a big solution like featured generation, 3 00:00:11,817 --> 00:00:15,800 validation, minimalist codings and so on. 4 00:00:15,800 --> 00:00:19,374 We went through several competitions together and 5 00:00:19,374 --> 00:00:24,204 tried our best to unite everything we learn into one huge framework. 6 00:00:24,204 --> 00:00:28,769 But as with any other set of tools, there are a lot of heuristics which 7 00:00:28,769 --> 00:00:32,471 people often find only with a trial and error approach, 8 00:00:32,471 --> 00:00:37,534 spending significant time on learning how to use these tools efficiently. 9 00:00:37,534 --> 00:00:39,216 So to help you out here, 10 00:00:39,216 --> 00:00:45,100 in this video we'll share things we learned the hard way, by experience. 11 00:00:45,100 --> 00:00:48,560 These things may vary from one person to another. 12 00:00:48,560 --> 00:00:53,140 So we decided that everyone on class will present his own guidelines personally, 13 00:00:53,140 --> 00:00:57,770 to stress the possible diversity in a broad issues and 14 00:00:57,770 --> 00:01:00,650 to make an accent on different moments. 15 00:01:00,650 --> 00:01:04,150 Some notes might seem obvious to you, some may not. 16 00:01:04,150 --> 00:01:08,940 But be sure for even some of them or at least no one involve them. 17 00:01:08,940 --> 00:01:10,940 Can save you a lot of time. 18 00:01:10,940 --> 00:01:13,242 So, let's start. 19 00:01:13,242 --> 00:01:16,992 When we want to enter a competition, define your goals and 20 00:01:16,992 --> 00:01:21,470 try to estimate what you can get out of your participation. 21 00:01:21,470 --> 00:01:25,000 You may want to learn more about an interesting problem. 22 00:01:25,000 --> 00:01:27,790 You may want to get acquainted with new software tools and 23 00:01:27,790 --> 00:01:32,320 packages, or you may want to try to hunt for a medal. 24 00:01:32,320 --> 00:01:37,450 Each of these goals will influence what competition you choose to participate in. 25 00:01:37,450 --> 00:01:40,380 If you want to learn more about an interesting problem, 26 00:01:40,380 --> 00:01:44,350 you may want the competition to have a wide discussion on the forums. 27 00:01:44,350 --> 00:01:49,950 For example, if you are interested in data science, in application to medicine, 28 00:01:49,950 --> 00:01:55,210 you can try to predict lung cancer in the Data Science Bowl 2017. 29 00:01:55,210 --> 00:01:59,900 Or to predict seizures in long term human EEG recordings. 30 00:01:59,900 --> 00:02:03,070 In the Melbourne University Seizure Prediction Competition. 31 00:02:04,240 --> 00:02:07,200 If you want to get acquainted with new software tools, 32 00:02:07,200 --> 00:02:10,000 you may want the competition to have required tutorials. 33 00:02:10,000 --> 00:02:13,530 For example, if you want to learn a neural networks library. 34 00:02:13,530 --> 00:02:17,818 You may choose any of competitions with images like the nature conservancy 35 00:02:17,818 --> 00:02:19,880 features, monitoring competition. 36 00:02:19,880 --> 00:02:24,560 Or the planet, understanding the Amazon from space competition. 37 00:02:24,560 --> 00:02:27,100 And if you want to try to hunt for 38 00:02:27,100 --> 00:02:32,260 a medal, you may want to check how many submissions do participants have. 39 00:02:32,260 --> 00:02:36,260 And if the points that people have over one hundred submissions, 40 00:02:36,260 --> 00:02:41,030 it can be a clear sign of legible problem or difficulties in validation 41 00:02:41,030 --> 00:02:45,380 includes an inconsistency of validation and leaderboard scores. 42 00:02:45,380 --> 00:02:49,840 On the other hand, if there are people with few submissions in the top, 43 00:02:49,840 --> 00:02:54,810 that usually means there should be a non-trivial approach to this competition 44 00:02:54,810 --> 00:02:58,100 or it's discovered only by few people. 45 00:02:58,100 --> 00:03:03,030 Beside that, you may want to pay attention to the size of the top teams. 46 00:03:03,030 --> 00:03:07,210 If leaderboard mostly consists of teams with only one participant, 47 00:03:07,210 --> 00:03:10,660 you'll probably have enough chances if you gather a good team. 48 00:03:11,750 --> 00:03:16,300 Now, let's move to the next step after you chose a competition. 49 00:03:16,300 --> 00:03:19,060 As soon as you get familiar with the data, 50 00:03:19,060 --> 00:03:23,560 start to write down your ideas about what you may want to try later. 51 00:03:23,560 --> 00:03:25,320 What things could work here? 52 00:03:25,320 --> 00:03:27,840 What approaches you may want to take. 53 00:03:27,840 --> 00:03:32,860 After you're done, read forums and highlight interesting posts and topics. 54 00:03:32,860 --> 00:03:37,130 Remember, you can get a lot of information and meet new people on forums. 55 00:03:37,130 --> 00:03:42,640 So I strongly encourage you to participate in these discussions. 56 00:03:42,640 --> 00:03:45,000 After the initial pipeline is ready and 57 00:03:45,000 --> 00:03:49,710 you roll down few ideas, you may want to start improving your solution. 58 00:03:49,710 --> 00:03:53,910 Personally, I like to organize these ideas into some structure. 59 00:03:53,910 --> 00:03:58,330 So you may want to sort ideas into priority order. 60 00:03:58,330 --> 00:04:01,470 Most important and promising needs to be implemented first. 61 00:04:02,520 --> 00:04:05,870 Or you may want to organize these ideas into topics. 62 00:04:05,870 --> 00:04:10,540 Ideas about feature generation, validation, metric optimization. 63 00:04:10,540 --> 00:04:11,660 And so on. 64 00:04:11,660 --> 00:04:15,330 Now pick up an idea and implement it. 65 00:04:15,330 --> 00:04:17,571 Try to derive some insights on the way. 66 00:04:17,571 --> 00:04:22,790 Especially, try to understand why something does or doesn't work. 67 00:04:22,790 --> 00:04:23,583 For example, 68 00:04:23,583 --> 00:04:27,826 you have an idea about trying a deep gradient boosting decision tree model. 69 00:04:27,826 --> 00:04:30,420 To your joy, it works. 70 00:04:30,420 --> 00:04:32,500 Now, ask yourself why? 71 00:04:32,500 --> 00:04:36,060 Is there some hidden data structure we didn't notice before? 72 00:04:36,060 --> 00:04:41,260 Maybe you have categorical features with a lot of unique values. 73 00:04:41,260 --> 00:04:42,570 If this is the case, 74 00:04:42,570 --> 00:04:47,560 you as well can make a conclusion that mean encodings may work great here. 75 00:04:47,560 --> 00:04:51,390 So in some sense, the ability to analyze the work and 76 00:04:51,390 --> 00:04:54,810 derive conclusions while you're trying out your ideas 77 00:04:54,810 --> 00:04:58,720 will get you on the right track to reveal hidden data patterns and leaks. 78 00:04:59,830 --> 00:05:02,670 After we checked out most important ideas, 79 00:05:02,670 --> 00:05:05,510 you may want to switch to parameter training. 80 00:05:05,510 --> 00:05:08,820 I personally like the view, everything is a parameter. 81 00:05:08,820 --> 00:05:13,280 From the number of features, through gradient boosting decision through depth. 82 00:05:13,280 --> 00:05:16,620 From the number of layers in convolutional neural network, 83 00:05:16,620 --> 00:05:20,130 to the coefficient you finally submit is multiplied by. 84 00:05:20,130 --> 00:05:22,240 To understand what I should tune and 85 00:05:22,240 --> 00:05:27,020 change first, I like to sort all parameters by these principles. 86 00:05:27,020 --> 00:05:28,475 First, importance. 87 00:05:28,475 --> 00:05:32,850 Arrange parameters from important to not useful at all. 88 00:05:32,850 --> 00:05:34,560 Tune in this order. 89 00:05:34,560 --> 00:05:39,238 These may depend on data structure, on target, on metric, and so on. 90 00:05:39,238 --> 00:05:41,842 Second, feasibility. 91 00:05:41,842 --> 00:05:47,758 Rate parameters from, it is easy to tune, to, tuning this can take forever. 92 00:05:47,758 --> 00:05:49,916 Third, understanding. 93 00:05:49,916 --> 00:05:55,370 Rate parameters from, I know what it's doing, to, I have no idea. 94 00:05:55,370 --> 00:05:59,420 Here it is important to understand what each parameter will change 95 00:05:59,420 --> 00:06:00,510 in the whole pipeline. 96 00:06:00,510 --> 00:06:04,430 For example, if you increase the number of features significantly, 97 00:06:04,430 --> 00:06:08,430 you may want to change ratio of columns which is used to find the best split 98 00:06:08,430 --> 00:06:10,860 in gradient boosting decision tree. 99 00:06:10,860 --> 00:06:14,551 Or, if you change number of layers in convolution neural network, 100 00:06:14,551 --> 00:06:17,284 you will need more reports to train it, and so on. 101 00:06:17,284 --> 00:06:22,945 So let's see, these were some of my practical guidelines, 102 00:06:22,945 --> 00:06:27,169 I hope they will prove useful for you as well. 103 00:06:27,169 --> 00:06:30,890 Every problem starts with data loading and preprocessing. 104 00:06:30,890 --> 00:06:35,540 I usually don't pay much attention to some sub optimal usage of computational 105 00:06:35,540 --> 00:06:40,740 resources but this particular case is of crucial importance. 106 00:06:40,740 --> 00:06:45,030 Doing things right at the very beginning will make your life much simpler and 107 00:06:45,030 --> 00:06:49,670 will allow you to save a lot of time and computational resources. 108 00:06:49,670 --> 00:06:53,530 I usually start with basic data preprocessing like labeling, 109 00:06:53,530 --> 00:06:57,780 coding, category recovery, both enjoying additional data. 110 00:06:57,780 --> 00:07:05,380 Then, I dump resulting data into HDF5 or MPI format. 111 00:07:05,380 --> 00:07:10,820 HDF5 for Panda's dataframes, and MPI for non bit arrays. 112 00:07:10,820 --> 00:07:15,321 Running experiment often require a lot of kernel restarts, 113 00:07:15,321 --> 00:07:18,250 which leads to reloading all the data. 114 00:07:18,250 --> 00:07:23,146 And loading class CSC files may take minutes while 115 00:07:23,146 --> 00:07:28,610 loading data from HDF5 or MPI formats is performed in a matter of seconds. 116 00:07:29,870 --> 00:07:35,275 Another important matter is that by default, Panda is known to store 117 00:07:35,275 --> 00:07:41,251 data in 64-bit arrays, which is unnecessary in most of the situations. 118 00:07:41,251 --> 00:07:48,330 Downcasting everything to 32 bits will result in two-fold memory saving. 119 00:07:48,330 --> 00:07:54,280 Also keep in mind that Panda's support out of the box data relink by chunks, 120 00:07:54,280 --> 00:07:58,530 via chunks ice parameter in recess fee function. 121 00:07:58,530 --> 00:08:03,680 So most of the data sets may be processed without a lot of memory. 122 00:08:03,680 --> 00:08:09,313 When it comes to performance evaluation, I am not a big fan of extensive validation. 123 00:08:09,313 --> 00:08:15,665 Even for medium-sized datasets like 50,000 or 100,000 rows. 124 00:08:15,665 --> 00:08:20,235 You can validate your models with a simple train test split 125 00:08:20,235 --> 00:08:23,412 instead of full cross validation loop. 126 00:08:23,412 --> 00:08:27,490 Switch to full CV only when it is really needed. 127 00:08:27,490 --> 00:08:30,940 For example, when you've already hit some limits and 128 00:08:30,940 --> 00:08:34,970 can move forward only with some marginal improvements. 129 00:08:34,970 --> 00:08:38,020 Same logic applies to initial model choice. 130 00:08:38,020 --> 00:08:42,580 I usually start with LightGBM, find some reasonably good parameters, 131 00:08:42,580 --> 00:08:45,119 and evaluate performance of my features. 132 00:08:46,290 --> 00:08:50,500 I want to emphasize that I use early stopping, so 133 00:08:50,500 --> 00:08:53,820 I don't need to tune number of boosting iterations. 134 00:08:55,150 --> 00:08:59,980 And god forbid start ESVMs, random forks, or 135 00:08:59,980 --> 00:09:05,520 neural networks, you will waste too much time just waiting for them to feed. 136 00:09:05,520 --> 00:09:08,780 I switch to tuning the models, and sampling, and 137 00:09:08,780 --> 00:09:12,700 staking, only when I am satisfied with feature engineering. 138 00:09:13,760 --> 00:09:19,400 In some ways, I describe my approach as, fast and dirty, always better. 139 00:09:19,400 --> 00:09:23,010 Try focusing on what is really important, the data. 140 00:09:23,010 --> 00:09:25,661 Do ED, try different features. 141 00:09:25,661 --> 00:09:28,577 Google domain-specific knowledge. 142 00:09:28,577 --> 00:09:31,020 Your code is secondary. 143 00:09:31,020 --> 00:09:33,210 Creating unnecessary classes and 144 00:09:33,210 --> 00:09:37,260 personal frame box may only make things harder to change and 145 00:09:37,260 --> 00:09:43,170 will result in wasting your time, so keep things simple and reasonable. 146 00:09:43,170 --> 00:09:45,745 Don't track every little change. 147 00:09:45,745 --> 00:09:49,680 By the end of competition, I usually have only a couple of notebooks for 148 00:09:49,680 --> 00:09:55,490 model training and to want notebooks specifically for EDA purposes. 149 00:09:55,490 --> 00:10:01,660 Finally, if you feel really uncomfortable with given computational resources, 150 00:10:01,660 --> 00:10:05,290 don't struggle for weeks, just rent a larger server. 151 00:10:06,710 --> 00:10:09,790 Every competition I start with a very simple basic solution 152 00:10:09,790 --> 00:10:11,590 that can be even primitive. 153 00:10:11,590 --> 00:10:15,950 The main purpose of such solution is not to build a good model but 154 00:10:15,950 --> 00:10:18,960 to debug full pipeline from very beginning 155 00:10:18,960 --> 00:10:23,550 of the data to the very end when we write the submit file into decided format. 156 00:10:23,550 --> 00:10:26,820 I advise you to start with construction of the initial pipeline. 157 00:10:26,820 --> 00:10:30,322 Often you can find it in baseline solutions provided by organizers or 158 00:10:30,322 --> 00:10:31,270 in kernels. 159 00:10:31,270 --> 00:10:33,630 I encourage you to read carefully and write your own. 160 00:10:34,890 --> 00:10:39,270 Also I advise you to follow from simple to complex approach in other things. 161 00:10:39,270 --> 00:10:41,800 For example, I prefer to start with Random Forest rather than 162 00:10:41,800 --> 00:10:43,360 Gradient Boosted Decision Trees. 163 00:10:43,360 --> 00:10:45,438 At least Random Forest works quite fast and 164 00:10:45,438 --> 00:10:47,869 requires almost no tuning of hybrid parameters. 165 00:10:49,060 --> 00:10:52,660 Participation in data science competition implies the analysis of data and 166 00:10:52,660 --> 00:10:55,530 generation of features and manipulations with models. 167 00:10:55,530 --> 00:10:59,570 This process is very similar in spirit to the development of software and 168 00:10:59,570 --> 00:11:02,390 there are many good practices that I advise you to follow. 169 00:11:02,390 --> 00:11:03,750 I will name just a few of them. 170 00:11:04,910 --> 00:11:07,650 First of all, use good variable names. 171 00:11:07,650 --> 00:11:10,470 No matter how ingenious you are, if your code is written badly, 172 00:11:10,470 --> 00:11:14,210 you will surely get confused in it and you'll have a problem sooner or later. 173 00:11:15,450 --> 00:11:17,750 Second, keep your research reproducible. 174 00:11:17,750 --> 00:11:18,973 FIx all random seeds. 175 00:11:18,973 --> 00:11:21,780 Write down exactly how a feature was generated, and 176 00:11:21,780 --> 00:11:24,800 store the code under version control system like git. 177 00:11:24,800 --> 00:11:27,800 Very often there are situation when you need to go back to the model that you 178 00:11:27,800 --> 00:11:30,962 built two weeks ago and edit to the ensemble width. 179 00:11:30,962 --> 00:11:34,920 The last and probably the most important thing, reuse your code. 180 00:11:34,920 --> 00:11:38,568 It's really important to use the same code at training and testing stages. 181 00:11:38,568 --> 00:11:42,040 For example, features should be prepared and transforming by the same code 182 00:11:42,040 --> 00:11:46,565 in order to guarantee that they're produced in a consistent manner. 183 00:11:46,565 --> 00:11:49,290 Here in such places are very difficult to catch, so 184 00:11:49,290 --> 00:11:51,540 it's better to be very careful with it. 185 00:11:51,540 --> 00:11:55,985 I recommend to move reusable code into separate functions or even separate model. 186 00:11:57,090 --> 00:11:59,610 In addition, I advise you to read scientific articles on the topic 187 00:11:59,610 --> 00:12:01,450 of the competition. 188 00:12:01,450 --> 00:12:03,880 They can provide you with information about machine and 189 00:12:03,880 --> 00:12:08,560 correlated things like for example how to better optimize a measure, or AUC. 190 00:12:08,560 --> 00:12:10,300 Or, provide the main knowledge of the problem. 191 00:12:11,710 --> 00:12:14,460 This is often very useful for future generations. 192 00:12:14,460 --> 00:12:18,290 For example, during Microsoft Mobile competition, I read article about mobile 193 00:12:18,290 --> 00:12:21,410 detection and used ideas from them to generate new features. 194 00:12:22,710 --> 00:12:26,535 >> I usually start the competition by monitoring the forums and kernels. 195 00:12:27,560 --> 00:12:32,720 It happens that a competition starts, someone finds a bug in the data. 196 00:12:32,720 --> 00:12:37,010 And the competition data is then completely changed, so 197 00:12:37,010 --> 00:12:39,820 I never join a competition at its very beginning. 198 00:12:41,390 --> 00:12:45,690 I usually start a competition with a quick EDA and a simple baseline. 199 00:12:45,690 --> 00:12:49,080 I tried to check the data for various leakages. 200 00:12:49,080 --> 00:12:54,280 For me, the leaks are one of the most interesting parts in the competition. 201 00:12:54,280 --> 00:12:57,710 I then usually do several submissions to check if validation score 202 00:12:57,710 --> 00:12:59,710 correlates with publicly the board score. 203 00:13:01,480 --> 00:13:05,630 Usually, I try to come up with a list of things to try in a competition, and 204 00:13:05,630 --> 00:13:07,660 I more or less try to follow it. 205 00:13:08,880 --> 00:13:13,270 But sometimes I just try to generate as many features as possible, 206 00:13:13,270 --> 00:13:17,364 put them in extra boost and study what helps and what does not. 207 00:13:17,364 --> 00:13:22,440 When tuning parameters, I first try to make model overfit 208 00:13:22,440 --> 00:13:26,880 to the training set and only then I change parameters to constrain the model. 209 00:13:28,050 --> 00:13:32,908 I had situations when I could not reproduce one of my submissions. 210 00:13:32,908 --> 00:13:36,630 I accidentally changed something in the code and I could not remember what 211 00:13:36,630 --> 00:13:42,200 exactly, so nowadays I'm very careful about my code and script. 212 00:13:44,040 --> 00:13:45,114 Another problem? 213 00:13:45,114 --> 00:13:51,220 Long execution history in notebooks leads to lots of defined global variables. 214 00:13:51,220 --> 00:13:54,340 And global variables surely lead to bugs. 215 00:13:54,340 --> 00:13:57,220 So remember to sometimes restart your notebooks. 216 00:13:58,280 --> 00:14:03,170 It's okay to have ugly code, unless you do not use this to produce a submission. 217 00:14:04,390 --> 00:14:05,240 It would be easier for 218 00:14:05,240 --> 00:14:09,350 you to get into this code later if it has a descriptive variable names. 219 00:14:09,350 --> 00:14:13,720 I always use git and try to make the code for 220 00:14:13,720 --> 00:14:16,850 submissions as transparent as possible. 221 00:14:16,850 --> 00:14:18,840 I usually create a separate notebook for 222 00:14:18,840 --> 00:14:23,320 every submission so I can always run the previous solution and compare. 223 00:14:24,420 --> 00:14:27,720 And I treat the submission notebooks as script. 224 00:14:27,720 --> 00:14:31,550 I restart the kernel and always run them from top to bottom. 225 00:14:32,808 --> 00:14:37,672 I found a convenient way to validate the models that allows to use validation 226 00:14:37,672 --> 00:14:41,937 code with minimal changes to retrain a model on the whole dataset. 227 00:14:41,937 --> 00:14:46,420 In the competition, we are provided with training and test CSV files. 228 00:14:46,420 --> 00:14:47,610 You see we load them in the first cell. 229 00:14:47,610 --> 00:14:52,820 In the second cell, we split training set and actual training and 230 00:14:52,820 --> 00:14:57,367 validation sets, and save those to disk as CSV files with 231 00:14:57,367 --> 00:15:01,460 the same structure as given train CSV and test CSV. 232 00:15:03,060 --> 00:15:08,240 Now, at the top of the notebook, with my model, I define variables. 233 00:15:08,240 --> 00:15:11,320 Path is to train and test sets. 234 00:15:11,320 --> 00:15:12,834 I set them to create a training and 235 00:15:12,834 --> 00:15:15,970 validation sets when working with the model and validating it. 236 00:15:16,990 --> 00:15:22,050 And then it only takes me to switch those paths to original train CSV and 237 00:15:22,050 --> 00:15:24,290 test CSV to produce a submission. 238 00:15:25,450 --> 00:15:27,700 I also use macros. 239 00:15:27,700 --> 00:15:32,230 At one point I was really tired of typing import numpy as np, every time. 240 00:15:33,840 --> 00:15:38,260 So I found that it's possible to define a macro which will load everything for me. 241 00:15:40,110 --> 00:15:42,610 In my case, it takes only five symbols to 242 00:15:44,020 --> 00:15:49,810 type the macro name and this macro immediately loads me everything. 243 00:15:49,810 --> 00:15:50,490 Very convenient. 244 00:15:52,540 --> 00:15:56,860 And finally, I have developed my library with frequently used functions, and 245 00:15:56,860 --> 00:15:58,870 training code for models. 246 00:15:58,870 --> 00:16:04,544 I personally find it useful, as the code, it now becomes much shorter, 247 00:16:04,544 --> 00:16:09,395 and I do not need to remember how to import a particular model. 248 00:16:09,395 --> 00:16:14,117 In my case I just specify a model with its name, and 249 00:16:14,117 --> 00:16:21,795 as an output I get all the information about training that I would possibly need. 250 00:16:21,795 --> 00:16:24,514 [SOUND] 251 00:16:24,514 --> 00:16:31,699 [MUSIC]