1
00:00:00,000 --> 00:00:03,229
[MUSIC]

2
00:00:03,229 --> 00:00:04,906
Hi, in this video,

3
00:00:04,906 --> 00:00:10,640
I will tell you about Microsoft
Malware Classification Challenge.

4
00:00:10,640 --> 00:00:15,413
We were participating in a team with
other lecturers of this course,

5
00:00:15,413 --> 00:00:18,391
Mikhail Trofimov and Stanislav Semenov.

6
00:00:18,391 --> 00:00:21,610
We got the third place
in this competition.

7
00:00:22,710 --> 00:00:24,512
The plan for
the presentation is the following.

8
00:00:24,512 --> 00:00:28,970
We will start with the data
description and a little bit of EGA.

9
00:00:28,970 --> 00:00:32,392
We'll then talk about feature extraction,

10
00:00:32,392 --> 00:00:37,444
then we will discuss how we did
feature processing and selection.

11
00:00:37,444 --> 00:00:40,684
We will see how we did the modeling,
and finally we will

12
00:00:40,684 --> 00:00:45,200
explain several tricks that allowed us
to get higher on the league of board.

13
00:00:46,400 --> 00:00:48,240
So let's start with problem description.

14
00:00:49,740 --> 00:00:50,460
In this competition,

15
00:00:50,460 --> 00:00:54,450
we will provided about half of terabyte
of malicious software executables.

16
00:00:55,470 --> 00:00:58,200
Each executable was
represented in two forms.

17
00:00:59,310 --> 00:01:01,150
The first is, HEX dump.

18
00:01:02,230 --> 00:01:09,490
HEX dump is just a representation of file
as a sequence of bytes, it is row format.

19
00:01:09,490 --> 00:01:12,730
The file on the disk is
stored as a sequence or

20
00:01:12,730 --> 00:01:15,940
bytes, that is exactly
what we see in HEX dump.

21
00:01:17,530 --> 00:01:24,290
The second form is a listing generated by
interactive disassembler, or IDA in short.

22
00:01:25,310 --> 00:01:30,130
IDA tries to convert low level
sequence of bytes from the HEX dump

23
00:01:30,130 --> 00:01:33,540
to a sequence of assembler columns.

24
00:01:33,540 --> 00:01:35,380
Take a look at the bottom screenshot.

25
00:01:35,380 --> 00:01:39,170
On the left,
we see sequences of bytes from HEX dump.

26
00:01:39,170 --> 00:01:43,980
And on the right,
we see what IDA has converted them to.

27
00:01:45,640 --> 00:01:50,320
The task was to classify malware
files into nine families.

28
00:01:50,320 --> 00:01:55,751
We will provide with train sets of about
10,000 examples with known labels,

29
00:01:55,751 --> 00:01:58,089
and a testing set of singular size.

30
00:01:58,089 --> 00:02:02,149
And the evaluation metric was
multi-class logarithmic loss.

31
00:02:03,260 --> 00:02:07,370
Note that the final loss in this
competition was very, very low.

32
00:02:07,370 --> 00:02:14,530
The corresponding accuracy was more than
99.7%, it's enormously high accuracy.

33
00:02:15,680 --> 00:02:20,480
Remember, it is sometimes
beneficial to examine metadata.

34
00:02:20,480 --> 00:02:24,760
In this competition, we were given
seven zip archives with files.

35
00:02:24,760 --> 00:02:27,960
And the filenames look like hash values.

36
00:02:29,220 --> 00:02:33,840
Actually, we did not find any
helpful metadata for our models, but

37
00:02:33,840 --> 00:02:38,080
it was interesting how train test
split was done by organizers.

38
00:02:39,230 --> 00:02:44,340
The organizers were using the first
letter of the file names for the split.

39
00:02:45,630 --> 00:02:52,090
On this plot, on the x axis, we see
the first characters of the file names.

40
00:02:52,090 --> 00:02:53,270
And on the y axis,

41
00:02:53,270 --> 00:02:57,870
we have the number of files with their
names, starting with the given character.

42
00:02:59,180 --> 00:03:04,350
The plots actually look cool, but
the only information we get from them

43
00:03:04,350 --> 00:03:09,170
is the train test split is in fact random,
and

44
00:03:09,170 --> 00:03:12,150
we cannot use them to improve our models.

45
00:03:13,580 --> 00:03:16,969
So in this competition we
were given a raw data, so

46
00:03:16,969 --> 00:03:20,130
we needed to extract
the features ourselves.

47
00:03:21,210 --> 00:03:22,630
Let's see what features we extracted.

48
00:03:24,930 --> 00:03:28,690
Well, in fact, no one from our
team had domain knowledge or

49
00:03:28,690 --> 00:03:31,720
previous experience in
malware classification.

50
00:03:31,720 --> 00:03:35,290
We did not even know
what IDA disassembly is.

51
00:03:35,290 --> 00:03:37,823
So we started really simple.

52
00:03:37,823 --> 00:03:43,710
Remember issue executable in the dataset
was represented as two files.

53
00:03:43,710 --> 00:03:48,700
So our first two features were
the sizes of those files, or

54
00:03:48,700 --> 00:03:50,670
equivalently their length.

55
00:03:50,670 --> 00:03:56,810
And surprisingly we got 88% accuracy
just by using these two features.

56
00:03:57,980 --> 00:04:03,460
On the plot you see x axis correspond to
an index of an object in the train set.

57
00:04:03,460 --> 00:04:06,500
We actually sorted the objects
by their class here.

58
00:04:06,500 --> 00:04:13,608
And the y axis shows the file sizes
of aHEX dump file for every object.

59
00:04:13,608 --> 00:04:16,857
And we see this feature
is quite demonstrative.

60
00:04:16,857 --> 00:04:21,095
The most simple features we
could derive out of sequence

61
00:04:21,095 --> 00:04:24,920
are account of their elements, right?

62
00:04:24,920 --> 00:04:29,490
So this is basically what function
value comes from pandas does.

63
00:04:29,490 --> 00:04:32,120
So it is what we did.

64
00:04:32,120 --> 00:04:39,267
We counted bytes in HEX dump files,
and that is how we get 257 features.

65
00:04:39,267 --> 00:04:44,839
And 257 is because we
have 256 byte values,

66
00:04:44,839 --> 00:04:48,697
and there was one special symbol.

67
00:04:48,697 --> 00:04:53,329
We achieved almost 99%
accuracy using those features.

68
00:04:53,329 --> 00:04:57,120
At the same time that we started to
read about this assembly format, and

69
00:04:57,120 --> 00:04:59,700
papers on malware classification.

70
00:04:59,700 --> 00:05:02,550
And so, we got an idea what
feature is to generate.

71
00:05:03,960 --> 00:05:06,500
We looked into this assembly file.

72
00:05:06,500 --> 00:05:10,010
We use regular expressions
to extract various things.

73
00:05:10,010 --> 00:05:13,626
Many of the features we extracted
were thrown away immediately,

74
00:05:13,626 --> 00:05:14,950
some were really good.

75
00:05:14,950 --> 00:05:20,010
And what we did actually first,
we counted system calls.

76
00:05:20,010 --> 00:05:22,201
You see that on the slide,

77
00:05:22,201 --> 00:05:27,590
they are also called API calls in
Windows as we read in the paper.

78
00:05:27,590 --> 00:05:32,768
And here is the error rate we got
with this feature, it's pretty good.

79
00:05:32,768 --> 00:05:37,897
We counted assembler commons like move,
push,

80
00:05:37,897 --> 00:05:42,920
goal, and assembler,
they work rather well.

81
00:05:42,920 --> 00:05:48,090
We also try to extract common
sequences like common end grams and

82
00:05:48,090 --> 00:05:52,340
extract features out of them, but
we didn't manage to get a good result.

83
00:05:53,620 --> 00:05:57,110
The best features we
had were section count.

84
00:05:58,290 --> 00:06:01,060
Just count the number of
lines in this assembly file,

85
00:06:01,060 --> 00:06:07,480
which start with .text or
.data, or something like that.

86
00:06:07,480 --> 00:06:12,305
The classification accuracy with
that feature was more than 99% using

87
00:06:12,305 --> 00:06:13,485
these features.

88
00:06:13,485 --> 00:06:21,010
By incorporating all of these, we were
able to get an error rate less than 0.5%.

89
00:06:21,010 --> 00:06:22,924
From our text mining experience,

90
00:06:22,924 --> 00:06:26,600
we knew that n-grams could be
a good discriminative feature.

91
00:06:26,600 --> 00:06:28,690
So we computed big grams.

92
00:06:29,690 --> 00:06:33,794
We found a paper which
proposed to use 4-grams.

93
00:06:35,260 --> 00:06:41,970
We computed them too, and we even
went further and extracted 10-grams.

94
00:06:41,970 --> 00:06:46,932
Of course, we couldn't work with 10
grams directly, so we performed a nice

95
00:06:46,932 --> 00:06:52,284
feature selection, which was one of the
most creative ideas for this competition.

96
00:06:52,284 --> 00:06:54,810
We will see later in this
video how exactly we did it.

97
00:06:56,040 --> 00:07:01,502
Interestingly, just by applying
the feature selection to 10-grams and

98
00:07:01,502 --> 00:07:05,610
fitting in XGBoost,
we could get a place in the top 30.

99
00:07:05,610 --> 00:07:09,540
Another feature we found
interesting is an entropy.

100
00:07:09,540 --> 00:07:13,150
We computed entropy of small
segments of wide sequence by

101
00:07:13,150 --> 00:07:15,870
moving the sliding window
over the sequence.

102
00:07:15,870 --> 00:07:19,370
And computing entropy inside each window.

103
00:07:19,370 --> 00:07:22,420
So we've got another sequence that could

104
00:07:22,420 --> 00:07:27,700
contain an explicit information about
the encrypted parts of the executable.

105
00:07:27,700 --> 00:07:32,890
See, we expect the entropy to be
high if the data is encrypted and

106
00:07:32,890 --> 00:07:36,355
compressed, and
low if the data is structured.

107
00:07:38,110 --> 00:07:44,020
So the motivation is that some malwares
are injected into a normal executables,

108
00:07:44,020 --> 00:07:46,880
and they are stored in encrypted format.

109
00:07:48,090 --> 00:07:52,520
When the program starts, the malware
extracts itself in background first and

110
00:07:52,520 --> 00:07:53,428
then executes.

111
00:07:53,428 --> 00:07:58,289
So entropy features could
kind of help us to detect and

112
00:07:58,289 --> 00:08:02,394
encrypt the trojans in the executables,
and

113
00:08:02,394 --> 00:08:06,080
thus, detect some classes of malware.

114
00:08:07,160 --> 00:08:11,960
But we got an entropy
sequence of variable length.

115
00:08:11,960 --> 00:08:16,070
We couldn't use those
sequences as features, right?

116
00:08:16,070 --> 00:08:22,910
So we generated some statistics of entropy
distribution like mean and median.

117
00:08:24,070 --> 00:08:28,580
And also we computed a 20 percentiles and
inverse percentiles, and

118
00:08:28,580 --> 00:08:29,930
use them as features.

119
00:08:31,910 --> 00:08:36,994
The same features were extracted out
of entropy changes, that is we first

120
00:08:36,994 --> 00:08:42,170
apply diff function to entropy sequence
and then compute the statistics.

121
00:08:42,170 --> 00:08:47,674
It was possible to extract three
things from hex dump by looking for

122
00:08:47,674 --> 00:08:51,716
printable sequences that
ends with 0 element.

123
00:08:51,716 --> 00:08:57,600
We didn't use strings themselves,
but we computed strings lengths.

124
00:08:57,600 --> 00:09:00,820
Distribution for each file and
extract the similar statistics,

125
00:09:00,820 --> 00:09:04,597
the statistics that we extracted
from entropy sequences.

126
00:09:05,670 --> 00:09:08,730
Okay, we finished with feature extraction.

127
00:09:08,730 --> 00:09:12,950
So let's see how we pre-process them and
perform feature selection.

128
00:09:14,280 --> 00:09:17,670
Those moment when we
generated a lot of features.

129
00:09:17,670 --> 00:09:22,640
We wanted to incorporate all of them
in the classifier, but we could not fit

130
00:09:22,640 --> 00:09:27,700
the classifier efficiently when
we got say 20,000 features.

131
00:09:29,710 --> 00:09:33,040
Most features will probably useless, so

132
00:09:33,040 --> 00:09:37,680
we tried different feature selection
method and transformation techniques.

133
00:09:38,840 --> 00:09:44,380
Let's consider the buys counts features.

134
00:09:44,380 --> 00:09:49,185
There are 257 of them, not much, but
probably there is still a redundancy.

135
00:09:49,185 --> 00:09:53,610
We tried non-negative
matrix factorization, NMF.

136
00:09:53,610 --> 00:09:59,060
And the principal component analysis,
PCA, in order to remove this redundancy.

137
00:10:00,150 --> 00:10:05,310
I will remind you that both NMF and
PCA are trying to factorize

138
00:10:05,310 --> 00:10:09,940
object feature matrix x into
the product of two low-rank matrices.

139
00:10:11,020 --> 00:10:13,710
You can think of that of as finding

140
00:10:13,710 --> 00:10:18,510
a small number of basis vectors in the
feature space, so that every object can be

141
00:10:18,510 --> 00:10:23,880
approximated by a linear
combination of those basis vectors.

142
00:10:23,880 --> 00:10:27,560
And this coefficients of
approximation can be treated as

143
00:10:27,560 --> 00:10:29,090
new features for each object.

144
00:10:30,460 --> 00:10:35,415
So the only difference between NMF and
PCA is that NMF requires

145
00:10:35,415 --> 00:10:40,570
all the components of floor rank
matrices to be non-negative.

146
00:10:40,570 --> 00:10:44,339
We set the number of basis vectors to 15,
and

147
00:10:44,339 --> 00:10:49,280
here we see plots between
the first two extracted features.

148
00:10:49,280 --> 00:10:52,490
One for NMF and one for PCA.

149
00:10:54,210 --> 00:10:59,280
So these are the coefficient for
most important basis vectors actually.

150
00:11:00,400 --> 00:11:02,255
We used 3D based model and

151
00:11:02,255 --> 00:11:07,100
it is obvious that NMF features
were a lot better for trees.

152
00:11:07,100 --> 00:11:13,239
So NMF works good when the non-negativity
of the data is essential.

153
00:11:14,290 --> 00:11:18,150
And in our case we worked with counts,
which are non-negative by nature.

154
00:11:19,326 --> 00:11:23,660
Simple trick to get another pack of
features by doing almost nothing

155
00:11:23,660 --> 00:11:26,390
is to apply a log
transform to the data and

156
00:11:26,390 --> 00:11:30,020
calculate NMF on
the transformed data again.

157
00:11:31,470 --> 00:11:32,740
Let's think what happened here.

158
00:11:34,050 --> 00:11:36,970
NMF protest originally uses MSE laws to

159
00:11:36,970 --> 00:11:39,620
measure the quality of
approximation it build.

160
00:11:40,870 --> 00:11:47,420
This trick implicitly changes the laws
NMF optimizes from MSE to RMSLE.

161
00:11:48,660 --> 00:11:53,650
Just recall that RMSLE
is MSE in the log space.

162
00:11:53,650 --> 00:11:58,550
So now the decomposition process
pays attention to different things

163
00:11:58,550 --> 00:12:02,640
due to different loss, and
thus produces different features.

164
00:12:03,650 --> 00:12:07,330
We used the small pipeline
to select 4-grams features.

165
00:12:07,330 --> 00:12:12,714
We removed rare features, applied
linear SVM to the L1 regularization,

166
00:12:12,714 --> 00:12:15,630
as such model tends to select features.

167
00:12:15,630 --> 00:12:20,472
And after that,
we thresholded random forest feature

168
00:12:20,472 --> 00:12:25,953
importances to get final feature
set of only 131 feature.

169
00:12:25,953 --> 00:12:30,210
The pipeline was a little bit
more complicated with 10-grams.

170
00:12:30,210 --> 00:12:34,659
First, we used the hashing to reuse
the dimensionality of original features.

171
00:12:35,700 --> 00:12:39,200
We actually did it online
while computing 10-grams.

172
00:12:39,200 --> 00:12:44,281
We then selected about 800 features
based on L1 regularize SVM and

173
00:12:44,281 --> 00:12:46,700
the RandomForest importances.

174
00:12:46,700 --> 00:12:53,070
This is how we got about 800 features
instead of 2 to the power of 28.

175
00:12:53,070 --> 00:12:57,090
But we went even further.

176
00:12:58,290 --> 00:13:00,150
This is actually the most
interesting part for me.

177
00:13:01,410 --> 00:13:03,360
The main problem with feature selection,

178
00:13:03,360 --> 00:13:06,855
that I've just described,
is that we've done it for

179
00:13:06,855 --> 00:13:11,670
10-grams independently of other features
that we already had to that moment.

180
00:13:12,880 --> 00:13:15,730
After selection,
we could end up with really good features.

181
00:13:15,730 --> 00:13:19,590
But those features could
contain the same information

182
00:13:19,590 --> 00:13:22,100
that we already had in other features.

183
00:13:22,100 --> 00:13:26,590
And we actually wanted to improve
our features with 10-grams.

184
00:13:26,590 --> 00:13:28,650
So here is what we did instead.

185
00:13:29,840 --> 00:13:32,170
We generated out of fold prediction for

186
00:13:32,170 --> 00:13:35,910
the train set using all
the features that we had.

187
00:13:35,910 --> 00:13:41,770
We sorted the object by their true class
predicted probability and try to find

188
00:13:41,770 --> 00:13:47,270
the features that would separate the most
error prone object from the others.

189
00:13:48,320 --> 00:13:52,580
So actually, we use another model for it.

190
00:13:52,580 --> 00:13:57,920
We've created a new data set with
error prone objects having label 1,

191
00:13:57,920 --> 00:14:00,600
and others having label 0.

192
00:14:00,600 --> 00:14:02,820
We trained random forest and

193
00:14:02,820 --> 00:14:07,440
selected 14 10-grams, well,
actually hashes to be precise.

194
00:14:08,730 --> 00:14:11,620
We had a nice performance
increase on the leaderboard when

195
00:14:11,620 --> 00:14:14,690
incorporating those 14 features.

196
00:14:14,690 --> 00:14:18,350
This method actually could
lead us to sever overfitting,

197
00:14:18,350 --> 00:14:20,940
but it fortunately worked out nicely.

198
00:14:22,100 --> 00:14:23,540
All right, let's get to modeling.

199
00:14:25,320 --> 00:14:29,750
We didn't use stacking in this
competition as we usually do now.

200
00:14:29,750 --> 00:14:33,210
It became popular slightly
after this competition.

201
00:14:33,210 --> 00:14:36,090
We first used Random Forest everywhere,
but

202
00:14:36,090 --> 00:14:39,120
it turned out it needs to calibrated for
log loss.

203
00:14:39,120 --> 00:14:44,160
So we switched to XGBoost and
used it for all our modeling.

204
00:14:45,550 --> 00:14:48,730
Every person in the team
extracted his own features and

205
00:14:48,730 --> 00:14:51,820
trained his own models
with his own parameters.

206
00:14:51,820 --> 00:14:56,930
So our final solution was
a combination of diverse models.

207
00:14:56,930 --> 00:15:00,456
We found bagging worked
quite well in this data set.

208
00:15:00,456 --> 00:15:05,550
We even did the bagging
in very peculiar way.

209
00:15:05,550 --> 00:15:10,020
We concatenated our twin set
with a seven times larger set of

210
00:15:10,020 --> 00:15:13,060
object sampled from train
set with the replacement.

211
00:15:14,220 --> 00:15:15,690
And we used the resulting data set

212
00:15:16,970 --> 00:15:22,890
that is eight times larger than
original train set, to train XGBoost.

213
00:15:22,890 --> 00:15:26,360
Of course, we averaged about 20
models to account for randomness.

214
00:15:28,300 --> 00:15:32,414
And finally, let's talk about several
more tricks we used in this competition.

215
00:15:33,690 --> 00:15:36,990
The accuracy of the models was quite high,
and

216
00:15:36,990 --> 00:15:40,840
we all know that the more
data we have the better.

217
00:15:40,840 --> 00:15:46,120
So we've decided to try to use
testing data for training our models.

218
00:15:47,340 --> 00:15:49,300
We just need some labels for
the testers, right?

219
00:15:50,570 --> 00:15:53,020
We either use predicted class or

220
00:15:53,020 --> 00:15:56,150
we sample the label from
the predicted class distribution.

221
00:15:57,670 --> 00:16:02,148
Generated test set labels
are usually called pseudo labels.

222
00:16:03,170 --> 00:16:07,600
So how do we perform cross
validation with pseudo labels?

223
00:16:07,600 --> 00:16:11,560
We split our train set into fold
as we usually do well performing

224
00:16:11,560 --> 00:16:12,300
cross validation.

225
00:16:13,400 --> 00:16:15,890
In this example,
we split data in two folds.

226
00:16:16,900 --> 00:16:21,400
But what's different in cross validation
is that now, before training the model in

227
00:16:21,400 --> 00:16:25,680
a particular fold, we can concatenate
this fold with the test data.

228
00:16:26,860 --> 00:16:29,730
Then we switch the faults and
do the same thing.

229
00:16:30,880 --> 00:16:34,630
See we trained the objects to
denoted with green color and

230
00:16:34,630 --> 00:16:36,730
predict the once shown with red.

231
00:16:37,840 --> 00:16:41,830
Okay, and to get predictions for
the test set,

232
00:16:41,830 --> 00:16:46,980
we again do kind of cross validation
thing, like on the previous slide.

233
00:16:46,980 --> 00:16:50,106
But now we divide the test
set in faults and

234
00:16:50,106 --> 00:16:53,810
concatenate train set to each
fold of the the test set.

235
00:16:54,950 --> 00:17:00,430
In the end, we get out of all predictions
for the test set, that is our submission.

236
00:17:01,700 --> 00:17:04,710
One of the crucial things to
understand about this method,

237
00:17:04,710 --> 00:17:08,800
that sometimes we need two
different models for it.

238
00:17:08,800 --> 00:17:13,660
And this is the case where one model
is very confident in its predictions.

239
00:17:13,660 --> 00:17:17,060
That is, the predictive probabilities
are very close to 0 and 1.

240
00:17:18,200 --> 00:17:22,676
In this case, if we train a model,
predict task set,

241
00:17:22,676 --> 00:17:26,953
sample labels or
take it the most probable label and

242
00:17:26,953 --> 00:17:32,840
retrim same model with the same
features or get no improvement at all.

243
00:17:32,840 --> 00:17:38,875
We just did not introduce any information
with pseudo labeling in this way,

244
00:17:38,875 --> 00:17:43,790
but If I ask say, Stanislav,
to predict that with his model.

245
00:17:43,790 --> 00:17:46,592
And then I use his labels for a test and

246
00:17:46,592 --> 00:17:51,310
create my model, that is where
I will get a nice improvement.

247
00:17:52,740 --> 00:17:55,400
This actually becomes just another way

248
00:17:55,400 --> 00:17:58,209
to incorporate knowledge
from two diverse models.

249
00:17:59,290 --> 00:18:04,280
And the last thing that helped
us a lot is per-class mixing.

250
00:18:05,300 --> 00:18:09,640
At the time of the competition,
people usually mixed models linearly.

251
00:18:09,640 --> 00:18:15,030
We went further and mixed models joining
coefficients for each class separately.

252
00:18:16,380 --> 00:18:20,130
In fact, if you think about it,
it's very similar to stacking with

253
00:18:20,130 --> 00:18:22,910
a linear model of a special
kind at a second level.

254
00:18:24,270 --> 00:18:28,794
But the second became popular in
a month after this competition, and

255
00:18:28,794 --> 00:18:32,700
what we did here is a very
simplest manual form of stacking.

256
00:18:34,820 --> 00:18:40,280
We published our source code online, so if
you are interested you can check it out.

257
00:18:40,280 --> 00:18:43,610
All right,
this was an interesting competition.

258
00:18:43,610 --> 00:18:48,545
It was challenging from technical
view point as we needed to manipulate

259
00:18:48,545 --> 00:18:53,870
more than half of terabyte of data,
but it was very interesting.

260
00:18:53,870 --> 00:18:56,110
The data was given in a raw format and

261
00:18:56,110 --> 00:18:59,630
the actual challenge was to
come up with nice features.

262
00:18:59,630 --> 00:19:05,121
And that is where we could be creative,
thank you.

263
00:19:05,121 --> 00:19:15,121
[MUSIC]