[MUSIC] Hi, in this video, I will tell you about Microsoft
Malware Classification Challenge. We were participating in a team with
other lecturers of this course, Mikhail Trofimov and Stanislav Semenov. We got the third place
in this competition. The plan for
the presentation is the following. We will start with the data
description and a little bit of EGA. We'll then talk about feature extraction, then we will discuss how we did
feature processing and selection. We will see how we did the modeling,
and finally we will explain several tricks that allowed us
to get higher on the league of board. So let's start with problem description. In this competition, we will provided about half of terabyte
of malicious software executables. Each executable was
represented in two forms. The first is, HEX dump. HEX dump is just a representation of file
as a sequence of bytes, it is row format. The file on the disk is
stored as a sequence or bytes, that is exactly
what we see in HEX dump. The second form is a listing generated by
interactive disassembler, or IDA in short. IDA tries to convert low level
sequence of bytes from the HEX dump to a sequence of assembler columns. Take a look at the bottom screenshot. On the left,
we see sequences of bytes from HEX dump. And on the right,
we see what IDA has converted them to. The task was to classify malware
files into nine families. We will provide with train sets of about
10,000 examples with known labels, and a testing set of singular size. And the evaluation metric was
multi-class logarithmic loss. Note that the final loss in this
competition was very, very low. The corresponding accuracy was more than
99.7%, it's enormously high accuracy. Remember, it is sometimes
beneficial to examine metadata. In this competition, we were given
seven zip archives with files. And the filenames look like hash values. Actually, we did not find any
helpful metadata for our models, but it was interesting how train test
split was done by organizers. The organizers were using the first
letter of the file names for the split. On this plot, on the x axis, we see
the first characters of the file names. And on the y axis, we have the number of files with their
names, starting with the given character. The plots actually look cool, but
the only information we get from them is the train test split is in fact random,
and we cannot use them to improve our models. So in this competition we
were given a raw data, so we needed to extract
the features ourselves. Let's see what features we extracted. Well, in fact, no one from our
team had domain knowledge or previous experience in
malware classification. We did not even know
what IDA disassembly is. So we started really simple. Remember issue executable in the dataset
was represented as two files. So our first two features were
the sizes of those files, or equivalently their length. And surprisingly we got 88% accuracy
just by using these two features. On the plot you see x axis correspond to
an index of an object in the train set. We actually sorted the objects
by their class here. And the y axis shows the file sizes
of aHEX dump file for every object. And we see this feature
is quite demonstrative. The most simple features we
could derive out of sequence are account of their elements, right? So this is basically what function
value comes from pandas does. So it is what we did. We counted bytes in HEX dump files,
and that is how we get 257 features. And 257 is because we
have 256 byte values, and there was one special symbol. We achieved almost 99%
accuracy using those features. At the same time that we started to
read about this assembly format, and papers on malware classification. And so, we got an idea what
feature is to generate. We looked into this assembly file. We use regular expressions
to extract various things. Many of the features we extracted
were thrown away immediately, some were really good. And what we did actually first,
we counted system calls. You see that on the slide, they are also called API calls in
Windows as we read in the paper. And here is the error rate we got
with this feature, it's pretty good. We counted assembler commons like move,
push, goal, and assembler,
they work rather well. We also try to extract common
sequences like common end grams and extract features out of them, but
we didn't manage to get a good result. The best features we
had were section count. Just count the number of
lines in this assembly file, which start with .text or
.data, or something like that. The classification accuracy with
that feature was more than 99% using these features. By incorporating all of these, we were
able to get an error rate less than 0.5%. From our text mining experience, we knew that n-grams could be
a good discriminative feature. So we computed big grams. We found a paper which
proposed to use 4-grams. We computed them too, and we even
went further and extracted 10-grams. Of course, we couldn't work with 10
grams directly, so we performed a nice feature selection, which was one of the
most creative ideas for this competition. We will see later in this
video how exactly we did it. Interestingly, just by applying
the feature selection to 10-grams and fitting in XGBoost,
we could get a place in the top 30. Another feature we found
interesting is an entropy. We computed entropy of small
segments of wide sequence by moving the sliding window
over the sequence. And computing entropy inside each window. So we've got another sequence that could contain an explicit information about
the encrypted parts of the executable. See, we expect the entropy to be
high if the data is encrypted and compressed, and
low if the data is structured. So the motivation is that some malwares
are injected into a normal executables, and they are stored in encrypted format. When the program starts, the malware
extracts itself in background first and then executes. So entropy features could
kind of help us to detect and encrypt the trojans in the executables,
and thus, detect some classes of malware. But we got an entropy
sequence of variable length. We couldn't use those
sequences as features, right? So we generated some statistics of entropy
distribution like mean and median. And also we computed a 20 percentiles and
inverse percentiles, and use them as features. The same features were extracted out
of entropy changes, that is we first apply diff function to entropy sequence
and then compute the statistics. It was possible to extract three
things from hex dump by looking for printable sequences that
ends with 0 element. We didn't use strings themselves,
but we computed strings lengths. Distribution for each file and
extract the similar statistics, the statistics that we extracted
from entropy sequences. Okay, we finished with feature extraction. So let's see how we pre-process them and
perform feature selection. Those moment when we
generated a lot of features. We wanted to incorporate all of them
in the classifier, but we could not fit the classifier efficiently when
we got say 20,000 features. Most features will probably useless, so we tried different feature selection
method and transformation techniques. Let's consider the buys counts features. There are 257 of them, not much, but
probably there is still a redundancy. We tried non-negative
matrix factorization, NMF. And the principal component analysis,
PCA, in order to remove this redundancy. I will remind you that both NMF and
PCA are trying to factorize object feature matrix x into
the product of two low-rank matrices. You can think of that of as finding a small number of basis vectors in the
feature space, so that every object can be approximated by a linear
combination of those basis vectors. And this coefficients of
approximation can be treated as new features for each object. So the only difference between NMF and
PCA is that NMF requires all the components of floor rank
matrices to be non-negative. We set the number of basis vectors to 15,
and here we see plots between
the first two extracted features. One for NMF and one for PCA. So these are the coefficient for
most important basis vectors actually. We used 3D based model and it is obvious that NMF features
were a lot better for trees. So NMF works good when the non-negativity
of the data is essential. And in our case we worked with counts,
which are non-negative by nature. Simple trick to get another pack of
features by doing almost nothing is to apply a log
transform to the data and calculate NMF on
the transformed data again. Let's think what happened here. NMF protest originally uses MSE laws to measure the quality of
approximation it build. This trick implicitly changes the laws
NMF optimizes from MSE to RMSLE. Just recall that RMSLE
is MSE in the log space. So now the decomposition process
pays attention to different things due to different loss, and
thus produces different features. We used the small pipeline
to select 4-grams features. We removed rare features, applied
linear SVM to the L1 regularization, as such model tends to select features. And after that,
we thresholded random forest feature importances to get final feature
set of only 131 feature. The pipeline was a little bit
more complicated with 10-grams. First, we used the hashing to reuse
the dimensionality of original features. We actually did it online
while computing 10-grams. We then selected about 800 features
based on L1 regularize SVM and the RandomForest importances. This is how we got about 800 features
instead of 2 to the power of 28. But we went even further. This is actually the most
interesting part for me. The main problem with feature selection, that I've just described,
is that we've done it for 10-grams independently of other features
that we already had to that moment. After selection,
we could end up with really good features. But those features could
contain the same information that we already had in other features. And we actually wanted to improve
our features with 10-grams. So here is what we did instead. We generated out of fold prediction for the train set using all
the features that we had. We sorted the object by their true class
predicted probability and try to find the features that would separate the most
error prone object from the others. So actually, we use another model for it. We've created a new data set with
error prone objects having label 1, and others having label 0. We trained random forest and selected 14 10-grams, well,
actually hashes to be precise. We had a nice performance
increase on the leaderboard when incorporating those 14 features. This method actually could
lead us to sever overfitting, but it fortunately worked out nicely. All right, let's get to modeling. We didn't use stacking in this
competition as we usually do now. It became popular slightly
after this competition. We first used Random Forest everywhere,
but it turned out it needs to calibrated for
log loss. So we switched to XGBoost and
used it for all our modeling. Every person in the team
extracted his own features and trained his own models
with his own parameters. So our final solution was
a combination of diverse models. We found bagging worked
quite well in this data set. We even did the bagging
in very peculiar way. We concatenated our twin set
with a seven times larger set of object sampled from train
set with the replacement. And we used the resulting data set that is eight times larger than
original train set, to train XGBoost. Of course, we averaged about 20
models to account for randomness. And finally, let's talk about several
more tricks we used in this competition. The accuracy of the models was quite high,
and we all know that the more
data we have the better. So we've decided to try to use
testing data for training our models. We just need some labels for
the testers, right? We either use predicted class or we sample the label from
the predicted class distribution. Generated test set labels
are usually called pseudo labels. So how do we perform cross
validation with pseudo labels? We split our train set into fold
as we usually do well performing cross validation. In this example,
we split data in two folds. But what's different in cross validation
is that now, before training the model in a particular fold, we can concatenate
this fold with the test data. Then we switch the faults and
do the same thing. See we trained the objects to
denoted with green color and predict the once shown with red. Okay, and to get predictions for
the test set, we again do kind of cross validation
thing, like on the previous slide. But now we divide the test
set in faults and concatenate train set to each
fold of the the test set. In the end, we get out of all predictions
for the test set, that is our submission. One of the crucial things to
understand about this method, that sometimes we need two
different models for it. And this is the case where one model
is very confident in its predictions. That is, the predictive probabilities
are very close to 0 and 1. In this case, if we train a model,
predict task set, sample labels or
take it the most probable label and retrim same model with the same
features or get no improvement at all. We just did not introduce any information
with pseudo labeling in this way, but If I ask say, Stanislav,
to predict that with his model. And then I use his labels for a test and create my model, that is where
I will get a nice improvement. This actually becomes just another way to incorporate knowledge
from two diverse models. And the last thing that helped
us a lot is per-class mixing. At the time of the competition,
people usually mixed models linearly. We went further and mixed models joining
coefficients for each class separately. In fact, if you think about it,
it's very similar to stacking with a linear model of a special
kind at a second level. But the second became popular in
a month after this competition, and what we did here is a very
simplest manual form of stacking. We published our source code online, so if
you are interested you can check it out. All right,
this was an interesting competition. It was challenging from technical
view point as we needed to manipulate more than half of terabyte of data,
but it was very interesting. The data was given in a raw format and the actual challenge was to
come up with nice features. And that is where we could be creative,
thank you. [MUSIC]