[MUSIC] Hi, in this video, I will tell you about Microsoft Malware Classification Challenge. We were participating in a team with other lecturers of this course, Mikhail Trofimov and Stanislav Semenov. We got the third place in this competition. The plan for the presentation is the following. We will start with the data description and a little bit of EGA. We'll then talk about feature extraction, then we will discuss how we did feature processing and selection. We will see how we did the modeling, and finally we will explain several tricks that allowed us to get higher on the league of board. So let's start with problem description. In this competition, we will provided about half of terabyte of malicious software executables. Each executable was represented in two forms. The first is, HEX dump. HEX dump is just a representation of file as a sequence of bytes, it is row format. The file on the disk is stored as a sequence or bytes, that is exactly what we see in HEX dump. The second form is a listing generated by interactive disassembler, or IDA in short. IDA tries to convert low level sequence of bytes from the HEX dump to a sequence of assembler columns. Take a look at the bottom screenshot. On the left, we see sequences of bytes from HEX dump. And on the right, we see what IDA has converted them to. The task was to classify malware files into nine families. We will provide with train sets of about 10,000 examples with known labels, and a testing set of singular size. And the evaluation metric was multi-class logarithmic loss. Note that the final loss in this competition was very, very low. The corresponding accuracy was more than 99.7%, it's enormously high accuracy. Remember, it is sometimes beneficial to examine metadata. In this competition, we were given seven zip archives with files. And the filenames look like hash values. Actually, we did not find any helpful metadata for our models, but it was interesting how train test split was done by organizers. The organizers were using the first letter of the file names for the split. On this plot, on the x axis, we see the first characters of the file names. And on the y axis, we have the number of files with their names, starting with the given character. The plots actually look cool, but the only information we get from them is the train test split is in fact random, and we cannot use them to improve our models. So in this competition we were given a raw data, so we needed to extract the features ourselves. Let's see what features we extracted. Well, in fact, no one from our team had domain knowledge or previous experience in malware classification. We did not even know what IDA disassembly is. So we started really simple. Remember issue executable in the dataset was represented as two files. So our first two features were the sizes of those files, or equivalently their length. And surprisingly we got 88% accuracy just by using these two features. On the plot you see x axis correspond to an index of an object in the train set. We actually sorted the objects by their class here. And the y axis shows the file sizes of aHEX dump file for every object. And we see this feature is quite demonstrative. The most simple features we could derive out of sequence are account of their elements, right? So this is basically what function value comes from pandas does. So it is what we did. We counted bytes in HEX dump files, and that is how we get 257 features. And 257 is because we have 256 byte values, and there was one special symbol. We achieved almost 99% accuracy using those features. At the same time that we started to read about this assembly format, and papers on malware classification. And so, we got an idea what feature is to generate. We looked into this assembly file. We use regular expressions to extract various things. Many of the features we extracted were thrown away immediately, some were really good. And what we did actually first, we counted system calls. You see that on the slide, they are also called API calls in Windows as we read in the paper. And here is the error rate we got with this feature, it's pretty good. We counted assembler commons like move, push, goal, and assembler, they work rather well. We also try to extract common sequences like common end grams and extract features out of them, but we didn't manage to get a good result. The best features we had were section count. Just count the number of lines in this assembly file, which start with .text or .data, or something like that. The classification accuracy with that feature was more than 99% using these features. By incorporating all of these, we were able to get an error rate less than 0.5%. From our text mining experience, we knew that n-grams could be a good discriminative feature. So we computed big grams. We found a paper which proposed to use 4-grams. We computed them too, and we even went further and extracted 10-grams. Of course, we couldn't work with 10 grams directly, so we performed a nice feature selection, which was one of the most creative ideas for this competition. We will see later in this video how exactly we did it. Interestingly, just by applying the feature selection to 10-grams and fitting in XGBoost, we could get a place in the top 30. Another feature we found interesting is an entropy. We computed entropy of small segments of wide sequence by moving the sliding window over the sequence. And computing entropy inside each window. So we've got another sequence that could contain an explicit information about the encrypted parts of the executable. See, we expect the entropy to be high if the data is encrypted and compressed, and low if the data is structured. So the motivation is that some malwares are injected into a normal executables, and they are stored in encrypted format. When the program starts, the malware extracts itself in background first and then executes. So entropy features could kind of help us to detect and encrypt the trojans in the executables, and thus, detect some classes of malware. But we got an entropy sequence of variable length. We couldn't use those sequences as features, right? So we generated some statistics of entropy distribution like mean and median. And also we computed a 20 percentiles and inverse percentiles, and use them as features. The same features were extracted out of entropy changes, that is we first apply diff function to entropy sequence and then compute the statistics. It was possible to extract three things from hex dump by looking for printable sequences that ends with 0 element. We didn't use strings themselves, but we computed strings lengths. Distribution for each file and extract the similar statistics, the statistics that we extracted from entropy sequences. Okay, we finished with feature extraction. So let's see how we pre-process them and perform feature selection. Those moment when we generated a lot of features. We wanted to incorporate all of them in the classifier, but we could not fit the classifier efficiently when we got say 20,000 features. Most features will probably useless, so we tried different feature selection method and transformation techniques. Let's consider the buys counts features. There are 257 of them, not much, but probably there is still a redundancy. We tried non-negative matrix factorization, NMF. And the principal component analysis, PCA, in order to remove this redundancy. I will remind you that both NMF and PCA are trying to factorize object feature matrix x into the product of two low-rank matrices. You can think of that of as finding a small number of basis vectors in the feature space, so that every object can be approximated by a linear combination of those basis vectors. And this coefficients of approximation can be treated as new features for each object. So the only difference between NMF and PCA is that NMF requires all the components of floor rank matrices to be non-negative. We set the number of basis vectors to 15, and here we see plots between the first two extracted features. One for NMF and one for PCA. So these are the coefficient for most important basis vectors actually. We used 3D based model and it is obvious that NMF features were a lot better for trees. So NMF works good when the non-negativity of the data is essential. And in our case we worked with counts, which are non-negative by nature. Simple trick to get another pack of features by doing almost nothing is to apply a log transform to the data and calculate NMF on the transformed data again. Let's think what happened here. NMF protest originally uses MSE laws to measure the quality of approximation it build. This trick implicitly changes the laws NMF optimizes from MSE to RMSLE. Just recall that RMSLE is MSE in the log space. So now the decomposition process pays attention to different things due to different loss, and thus produces different features. We used the small pipeline to select 4-grams features. We removed rare features, applied linear SVM to the L1 regularization, as such model tends to select features. And after that, we thresholded random forest feature importances to get final feature set of only 131 feature. The pipeline was a little bit more complicated with 10-grams. First, we used the hashing to reuse the dimensionality of original features. We actually did it online while computing 10-grams. We then selected about 800 features based on L1 regularize SVM and the RandomForest importances. This is how we got about 800 features instead of 2 to the power of 28. But we went even further. This is actually the most interesting part for me. The main problem with feature selection, that I've just described, is that we've done it for 10-grams independently of other features that we already had to that moment. After selection, we could end up with really good features. But those features could contain the same information that we already had in other features. And we actually wanted to improve our features with 10-grams. So here is what we did instead. We generated out of fold prediction for the train set using all the features that we had. We sorted the object by their true class predicted probability and try to find the features that would separate the most error prone object from the others. So actually, we use another model for it. We've created a new data set with error prone objects having label 1, and others having label 0. We trained random forest and selected 14 10-grams, well, actually hashes to be precise. We had a nice performance increase on the leaderboard when incorporating those 14 features. This method actually could lead us to sever overfitting, but it fortunately worked out nicely. All right, let's get to modeling. We didn't use stacking in this competition as we usually do now. It became popular slightly after this competition. We first used Random Forest everywhere, but it turned out it needs to calibrated for log loss. So we switched to XGBoost and used it for all our modeling. Every person in the team extracted his own features and trained his own models with his own parameters. So our final solution was a combination of diverse models. We found bagging worked quite well in this data set. We even did the bagging in very peculiar way. We concatenated our twin set with a seven times larger set of object sampled from train set with the replacement. And we used the resulting data set that is eight times larger than original train set, to train XGBoost. Of course, we averaged about 20 models to account for randomness. And finally, let's talk about several more tricks we used in this competition. The accuracy of the models was quite high, and we all know that the more data we have the better. So we've decided to try to use testing data for training our models. We just need some labels for the testers, right? We either use predicted class or we sample the label from the predicted class distribution. Generated test set labels are usually called pseudo labels. So how do we perform cross validation with pseudo labels? We split our train set into fold as we usually do well performing cross validation. In this example, we split data in two folds. But what's different in cross validation is that now, before training the model in a particular fold, we can concatenate this fold with the test data. Then we switch the faults and do the same thing. See we trained the objects to denoted with green color and predict the once shown with red. Okay, and to get predictions for the test set, we again do kind of cross validation thing, like on the previous slide. But now we divide the test set in faults and concatenate train set to each fold of the the test set. In the end, we get out of all predictions for the test set, that is our submission. One of the crucial things to understand about this method, that sometimes we need two different models for it. And this is the case where one model is very confident in its predictions. That is, the predictive probabilities are very close to 0 and 1. In this case, if we train a model, predict task set, sample labels or take it the most probable label and retrim same model with the same features or get no improvement at all. We just did not introduce any information with pseudo labeling in this way, but If I ask say, Stanislav, to predict that with his model. And then I use his labels for a test and create my model, that is where I will get a nice improvement. This actually becomes just another way to incorporate knowledge from two diverse models. And the last thing that helped us a lot is per-class mixing. At the time of the competition, people usually mixed models linearly. We went further and mixed models joining coefficients for each class separately. In fact, if you think about it, it's very similar to stacking with a linear model of a special kind at a second level. But the second became popular in a month after this competition, and what we did here is a very simplest manual form of stacking. We published our source code online, so if you are interested you can check it out. All right, this was an interesting competition. It was challenging from technical view point as we needed to manipulate more than half of terabyte of data, but it was very interesting. The data was given in a raw format and the actual challenge was to come up with nice features. And that is where we could be creative, thank you. [MUSIC]