[SOUND] Hi, to this moment, we have already discussed all basics new things which build up to a big solution like featured generation, validation, minimalist codings and so on. We went through several competitions together and tried our best to unite everything we learn into one huge framework. But as with any other set of tools, there are a lot of heuristics which people often find only with a trial and error approach, spending significant time on learning how to use these tools efficiently. So to help you out here, in this video we'll share things we learned the hard way, by experience. These things may vary from one person to another. So we decided that everyone on class will present his own guidelines personally, to stress the possible diversity in a broad issues and to make an accent on different moments. Some notes might seem obvious to you, some may not. But be sure for even some of them or at least no one involve them. Can save you a lot of time. So, let's start. When we want to enter a competition, define your goals and try to estimate what you can get out of your participation. You may want to learn more about an interesting problem. You may want to get acquainted with new software tools and packages, or you may want to try to hunt for a medal. Each of these goals will influence what competition you choose to participate in. If you want to learn more about an interesting problem, you may want the competition to have a wide discussion on the forums. For example, if you are interested in data science, in application to medicine, you can try to predict lung cancer in the Data Science Bowl 2017. Or to predict seizures in long term human EEG recordings. In the Melbourne University Seizure Prediction Competition. If you want to get acquainted with new software tools, you may want the competition to have required tutorials. For example, if you want to learn a neural networks library. You may choose any of competitions with images like the nature conservancy features, monitoring competition. Or the planet, understanding the Amazon from space competition. And if you want to try to hunt for a medal, you may want to check how many submissions do participants have. And if the points that people have over one hundred submissions, it can be a clear sign of legible problem or difficulties in validation includes an inconsistency of validation and leaderboard scores. On the other hand, if there are people with few submissions in the top, that usually means there should be a non-trivial approach to this competition or it's discovered only by few people. Beside that, you may want to pay attention to the size of the top teams. If leaderboard mostly consists of teams with only one participant, you'll probably have enough chances if you gather a good team. Now, let's move to the next step after you chose a competition. As soon as you get familiar with the data, start to write down your ideas about what you may want to try later. What things could work here? What approaches you may want to take. After you're done, read forums and highlight interesting posts and topics. Remember, you can get a lot of information and meet new people on forums. So I strongly encourage you to participate in these discussions. After the initial pipeline is ready and you roll down few ideas, you may want to start improving your solution. Personally, I like to organize these ideas into some structure. So you may want to sort ideas into priority order. Most important and promising needs to be implemented first. Or you may want to organize these ideas into topics. Ideas about feature generation, validation, metric optimization. And so on. Now pick up an idea and implement it. Try to derive some insights on the way. Especially, try to understand why something does or doesn't work. For example, you have an idea about trying a deep gradient boosting decision tree model. To your joy, it works. Now, ask yourself why? Is there some hidden data structure we didn't notice before? Maybe you have categorical features with a lot of unique values. If this is the case, you as well can make a conclusion that mean encodings may work great here. So in some sense, the ability to analyze the work and derive conclusions while you're trying out your ideas will get you on the right track to reveal hidden data patterns and leaks. After we checked out most important ideas, you may want to switch to parameter training. I personally like the view, everything is a parameter. From the number of features, through gradient boosting decision through depth. From the number of layers in convolutional neural network, to the coefficient you finally submit is multiplied by. To understand what I should tune and change first, I like to sort all parameters by these principles. First, importance. Arrange parameters from important to not useful at all. Tune in this order. These may depend on data structure, on target, on metric, and so on. Second, feasibility. Rate parameters from, it is easy to tune, to, tuning this can take forever. Third, understanding. Rate parameters from, I know what it's doing, to, I have no idea. Here it is important to understand what each parameter will change in the whole pipeline. For example, if you increase the number of features significantly, you may want to change ratio of columns which is used to find the best split in gradient boosting decision tree. Or, if you change number of layers in convolution neural network, you will need more reports to train it, and so on. So let's see, these were some of my practical guidelines, I hope they will prove useful for you as well. Every problem starts with data loading and preprocessing. I usually don't pay much attention to some sub optimal usage of computational resources but this particular case is of crucial importance. Doing things right at the very beginning will make your life much simpler and will allow you to save a lot of time and computational resources. I usually start with basic data preprocessing like labeling, coding, category recovery, both enjoying additional data. Then, I dump resulting data into HDF5 or MPI format. HDF5 for Panda's dataframes, and MPI for non bit arrays. Running experiment often require a lot of kernel restarts, which leads to reloading all the data. And loading class CSC files may take minutes while loading data from HDF5 or MPI formats is performed in a matter of seconds. Another important matter is that by default, Panda is known to store data in 64-bit arrays, which is unnecessary in most of the situations. Downcasting everything to 32 bits will result in two-fold memory saving. Also keep in mind that Panda's support out of the box data relink by chunks, via chunks ice parameter in recess fee function. So most of the data sets may be processed without a lot of memory. When it comes to performance evaluation, I am not a big fan of extensive validation. Even for medium-sized datasets like 50,000 or 100,000 rows. You can validate your models with a simple train test split instead of full cross validation loop. Switch to full CV only when it is really needed. For example, when you've already hit some limits and can move forward only with some marginal improvements. Same logic applies to initial model choice. I usually start with LightGBM, find some reasonably good parameters, and evaluate performance of my features. I want to emphasize that I use early stopping, so I don't need to tune number of boosting iterations. And god forbid start ESVMs, random forks, or neural networks, you will waste too much time just waiting for them to feed. I switch to tuning the models, and sampling, and staking, only when I am satisfied with feature engineering. In some ways, I describe my approach as, fast and dirty, always better. Try focusing on what is really important, the data. Do ED, try different features. Google domain-specific knowledge. Your code is secondary. Creating unnecessary classes and personal frame box may only make things harder to change and will result in wasting your time, so keep things simple and reasonable. Don't track every little change. By the end of competition, I usually have only a couple of notebooks for model training and to want notebooks specifically for EDA purposes. Finally, if you feel really uncomfortable with given computational resources, don't struggle for weeks, just rent a larger server. Every competition I start with a very simple basic solution that can be even primitive. The main purpose of such solution is not to build a good model but to debug full pipeline from very beginning of the data to the very end when we write the submit file into decided format. I advise you to start with construction of the initial pipeline. Often you can find it in baseline solutions provided by organizers or in kernels. I encourage you to read carefully and write your own. Also I advise you to follow from simple to complex approach in other things. For example, I prefer to start with Random Forest rather than Gradient Boosted Decision Trees. At least Random Forest works quite fast and requires almost no tuning of hybrid parameters. Participation in data science competition implies the analysis of data and generation of features and manipulations with models. This process is very similar in spirit to the development of software and there are many good practices that I advise you to follow. I will name just a few of them. First of all, use good variable names. No matter how ingenious you are, if your code is written badly, you will surely get confused in it and you'll have a problem sooner or later. Second, keep your research reproducible. FIx all random seeds. Write down exactly how a feature was generated, and store the code under version control system like git. Very often there are situation when you need to go back to the model that you built two weeks ago and edit to the ensemble width. The last and probably the most important thing, reuse your code. It's really important to use the same code at training and testing stages. For example, features should be prepared and transforming by the same code in order to guarantee that they're produced in a consistent manner. Here in such places are very difficult to catch, so it's better to be very careful with it. I recommend to move reusable code into separate functions or even separate model. In addition, I advise you to read scientific articles on the topic of the competition. They can provide you with information about machine and correlated things like for example how to better optimize a measure, or AUC. Or, provide the main knowledge of the problem. This is often very useful for future generations. For example, during Microsoft Mobile competition, I read article about mobile detection and used ideas from them to generate new features. >> I usually start the competition by monitoring the forums and kernels. It happens that a competition starts, someone finds a bug in the data. And the competition data is then completely changed, so I never join a competition at its very beginning. I usually start a competition with a quick EDA and a simple baseline. I tried to check the data for various leakages. For me, the leaks are one of the most interesting parts in the competition. I then usually do several submissions to check if validation score correlates with publicly the board score. Usually, I try to come up with a list of things to try in a competition, and I more or less try to follow it. But sometimes I just try to generate as many features as possible, put them in extra boost and study what helps and what does not. When tuning parameters, I first try to make model overfit to the training set and only then I change parameters to constrain the model. I had situations when I could not reproduce one of my submissions. I accidentally changed something in the code and I could not remember what exactly, so nowadays I'm very careful about my code and script. Another problem? Long execution history in notebooks leads to lots of defined global variables. And global variables surely lead to bugs. So remember to sometimes restart your notebooks. It's okay to have ugly code, unless you do not use this to produce a submission. It would be easier for you to get into this code later if it has a descriptive variable names. I always use git and try to make the code for submissions as transparent as possible. I usually create a separate notebook for every submission so I can always run the previous solution and compare. And I treat the submission notebooks as script. I restart the kernel and always run them from top to bottom. I found a convenient way to validate the models that allows to use validation code with minimal changes to retrain a model on the whole dataset. In the competition, we are provided with training and test CSV files. You see we load them in the first cell. In the second cell, we split training set and actual training and validation sets, and save those to disk as CSV files with the same structure as given train CSV and test CSV. Now, at the top of the notebook, with my model, I define variables. Path is to train and test sets. I set them to create a training and validation sets when working with the model and validating it. And then it only takes me to switch those paths to original train CSV and test CSV to produce a submission. I also use macros. At one point I was really tired of typing import numpy as np, every time. So I found that it's possible to define a macro which will load everything for me. In my case, it takes only five symbols to type the macro name and this macro immediately loads me everything. Very convenient. And finally, I have developed my library with frequently used functions, and training code for models. I personally find it useful, as the code, it now becomes much shorter, and I do not need to remember how to import a particular model. In my case I just specify a model with its name, and as an output I get all the information about training that I would possibly need. [SOUND] [MUSIC]