1 00:00:00,000 --> 00:00:03,229 [MUSIC] 2 00:00:03,229 --> 00:00:04,906 Hi, in this video, 3 00:00:04,906 --> 00:00:10,640 I will tell you about Microsoft Malware Classification Challenge. 4 00:00:10,640 --> 00:00:15,413 We were participating in a team with other lecturers of this course, 5 00:00:15,413 --> 00:00:18,391 Mikhail Trofimov and Stanislav Semenov. 6 00:00:18,391 --> 00:00:21,610 We got the third place in this competition. 7 00:00:22,710 --> 00:00:24,512 The plan for the presentation is the following. 8 00:00:24,512 --> 00:00:28,970 We will start with the data description and a little bit of EGA. 9 00:00:28,970 --> 00:00:32,392 We'll then talk about feature extraction, 10 00:00:32,392 --> 00:00:37,444 then we will discuss how we did feature processing and selection. 11 00:00:37,444 --> 00:00:40,684 We will see how we did the modeling, and finally we will 12 00:00:40,684 --> 00:00:45,200 explain several tricks that allowed us to get higher on the league of board. 13 00:00:46,400 --> 00:00:48,240 So let's start with problem description. 14 00:00:49,740 --> 00:00:50,460 In this competition, 15 00:00:50,460 --> 00:00:54,450 we will provided about half of terabyte of malicious software executables. 16 00:00:55,470 --> 00:00:58,200 Each executable was represented in two forms. 17 00:00:59,310 --> 00:01:01,150 The first is, HEX dump. 18 00:01:02,230 --> 00:01:09,490 HEX dump is just a representation of file as a sequence of bytes, it is row format. 19 00:01:09,490 --> 00:01:12,730 The file on the disk is stored as a sequence or 20 00:01:12,730 --> 00:01:15,940 bytes, that is exactly what we see in HEX dump. 21 00:01:17,530 --> 00:01:24,290 The second form is a listing generated by interactive disassembler, or IDA in short. 22 00:01:25,310 --> 00:01:30,130 IDA tries to convert low level sequence of bytes from the HEX dump 23 00:01:30,130 --> 00:01:33,540 to a sequence of assembler columns. 24 00:01:33,540 --> 00:01:35,380 Take a look at the bottom screenshot. 25 00:01:35,380 --> 00:01:39,170 On the left, we see sequences of bytes from HEX dump. 26 00:01:39,170 --> 00:01:43,980 And on the right, we see what IDA has converted them to. 27 00:01:45,640 --> 00:01:50,320 The task was to classify malware files into nine families. 28 00:01:50,320 --> 00:01:55,751 We will provide with train sets of about 10,000 examples with known labels, 29 00:01:55,751 --> 00:01:58,089 and a testing set of singular size. 30 00:01:58,089 --> 00:02:02,149 And the evaluation metric was multi-class logarithmic loss. 31 00:02:03,260 --> 00:02:07,370 Note that the final loss in this competition was very, very low. 32 00:02:07,370 --> 00:02:14,530 The corresponding accuracy was more than 99.7%, it's enormously high accuracy. 33 00:02:15,680 --> 00:02:20,480 Remember, it is sometimes beneficial to examine metadata. 34 00:02:20,480 --> 00:02:24,760 In this competition, we were given seven zip archives with files. 35 00:02:24,760 --> 00:02:27,960 And the filenames look like hash values. 36 00:02:29,220 --> 00:02:33,840 Actually, we did not find any helpful metadata for our models, but 37 00:02:33,840 --> 00:02:38,080 it was interesting how train test split was done by organizers. 38 00:02:39,230 --> 00:02:44,340 The organizers were using the first letter of the file names for the split. 39 00:02:45,630 --> 00:02:52,090 On this plot, on the x axis, we see the first characters of the file names. 40 00:02:52,090 --> 00:02:53,270 And on the y axis, 41 00:02:53,270 --> 00:02:57,870 we have the number of files with their names, starting with the given character. 42 00:02:59,180 --> 00:03:04,350 The plots actually look cool, but the only information we get from them 43 00:03:04,350 --> 00:03:09,170 is the train test split is in fact random, and 44 00:03:09,170 --> 00:03:12,150 we cannot use them to improve our models. 45 00:03:13,580 --> 00:03:16,969 So in this competition we were given a raw data, so 46 00:03:16,969 --> 00:03:20,130 we needed to extract the features ourselves. 47 00:03:21,210 --> 00:03:22,630 Let's see what features we extracted. 48 00:03:24,930 --> 00:03:28,690 Well, in fact, no one from our team had domain knowledge or 49 00:03:28,690 --> 00:03:31,720 previous experience in malware classification. 50 00:03:31,720 --> 00:03:35,290 We did not even know what IDA disassembly is. 51 00:03:35,290 --> 00:03:37,823 So we started really simple. 52 00:03:37,823 --> 00:03:43,710 Remember issue executable in the dataset was represented as two files. 53 00:03:43,710 --> 00:03:48,700 So our first two features were the sizes of those files, or 54 00:03:48,700 --> 00:03:50,670 equivalently their length. 55 00:03:50,670 --> 00:03:56,810 And surprisingly we got 88% accuracy just by using these two features. 56 00:03:57,980 --> 00:04:03,460 On the plot you see x axis correspond to an index of an object in the train set. 57 00:04:03,460 --> 00:04:06,500 We actually sorted the objects by their class here. 58 00:04:06,500 --> 00:04:13,608 And the y axis shows the file sizes of aHEX dump file for every object. 59 00:04:13,608 --> 00:04:16,857 And we see this feature is quite demonstrative. 60 00:04:16,857 --> 00:04:21,095 The most simple features we could derive out of sequence 61 00:04:21,095 --> 00:04:24,920 are account of their elements, right? 62 00:04:24,920 --> 00:04:29,490 So this is basically what function value comes from pandas does. 63 00:04:29,490 --> 00:04:32,120 So it is what we did. 64 00:04:32,120 --> 00:04:39,267 We counted bytes in HEX dump files, and that is how we get 257 features. 65 00:04:39,267 --> 00:04:44,839 And 257 is because we have 256 byte values, 66 00:04:44,839 --> 00:04:48,697 and there was one special symbol. 67 00:04:48,697 --> 00:04:53,329 We achieved almost 99% accuracy using those features. 68 00:04:53,329 --> 00:04:57,120 At the same time that we started to read about this assembly format, and 69 00:04:57,120 --> 00:04:59,700 papers on malware classification. 70 00:04:59,700 --> 00:05:02,550 And so, we got an idea what feature is to generate. 71 00:05:03,960 --> 00:05:06,500 We looked into this assembly file. 72 00:05:06,500 --> 00:05:10,010 We use regular expressions to extract various things. 73 00:05:10,010 --> 00:05:13,626 Many of the features we extracted were thrown away immediately, 74 00:05:13,626 --> 00:05:14,950 some were really good. 75 00:05:14,950 --> 00:05:20,010 And what we did actually first, we counted system calls. 76 00:05:20,010 --> 00:05:22,201 You see that on the slide, 77 00:05:22,201 --> 00:05:27,590 they are also called API calls in Windows as we read in the paper. 78 00:05:27,590 --> 00:05:32,768 And here is the error rate we got with this feature, it's pretty good. 79 00:05:32,768 --> 00:05:37,897 We counted assembler commons like move, push, 80 00:05:37,897 --> 00:05:42,920 goal, and assembler, they work rather well. 81 00:05:42,920 --> 00:05:48,090 We also try to extract common sequences like common end grams and 82 00:05:48,090 --> 00:05:52,340 extract features out of them, but we didn't manage to get a good result. 83 00:05:53,620 --> 00:05:57,110 The best features we had were section count. 84 00:05:58,290 --> 00:06:01,060 Just count the number of lines in this assembly file, 85 00:06:01,060 --> 00:06:07,480 which start with .text or .data, or something like that. 86 00:06:07,480 --> 00:06:12,305 The classification accuracy with that feature was more than 99% using 87 00:06:12,305 --> 00:06:13,485 these features. 88 00:06:13,485 --> 00:06:21,010 By incorporating all of these, we were able to get an error rate less than 0.5%. 89 00:06:21,010 --> 00:06:22,924 From our text mining experience, 90 00:06:22,924 --> 00:06:26,600 we knew that n-grams could be a good discriminative feature. 91 00:06:26,600 --> 00:06:28,690 So we computed big grams. 92 00:06:29,690 --> 00:06:33,794 We found a paper which proposed to use 4-grams. 93 00:06:35,260 --> 00:06:41,970 We computed them too, and we even went further and extracted 10-grams. 94 00:06:41,970 --> 00:06:46,932 Of course, we couldn't work with 10 grams directly, so we performed a nice 95 00:06:46,932 --> 00:06:52,284 feature selection, which was one of the most creative ideas for this competition. 96 00:06:52,284 --> 00:06:54,810 We will see later in this video how exactly we did it. 97 00:06:56,040 --> 00:07:01,502 Interestingly, just by applying the feature selection to 10-grams and 98 00:07:01,502 --> 00:07:05,610 fitting in XGBoost, we could get a place in the top 30. 99 00:07:05,610 --> 00:07:09,540 Another feature we found interesting is an entropy. 100 00:07:09,540 --> 00:07:13,150 We computed entropy of small segments of wide sequence by 101 00:07:13,150 --> 00:07:15,870 moving the sliding window over the sequence. 102 00:07:15,870 --> 00:07:19,370 And computing entropy inside each window. 103 00:07:19,370 --> 00:07:22,420 So we've got another sequence that could 104 00:07:22,420 --> 00:07:27,700 contain an explicit information about the encrypted parts of the executable. 105 00:07:27,700 --> 00:07:32,890 See, we expect the entropy to be high if the data is encrypted and 106 00:07:32,890 --> 00:07:36,355 compressed, and low if the data is structured. 107 00:07:38,110 --> 00:07:44,020 So the motivation is that some malwares are injected into a normal executables, 108 00:07:44,020 --> 00:07:46,880 and they are stored in encrypted format. 109 00:07:48,090 --> 00:07:52,520 When the program starts, the malware extracts itself in background first and 110 00:07:52,520 --> 00:07:53,428 then executes. 111 00:07:53,428 --> 00:07:58,289 So entropy features could kind of help us to detect and 112 00:07:58,289 --> 00:08:02,394 encrypt the trojans in the executables, and 113 00:08:02,394 --> 00:08:06,080 thus, detect some classes of malware. 114 00:08:07,160 --> 00:08:11,960 But we got an entropy sequence of variable length. 115 00:08:11,960 --> 00:08:16,070 We couldn't use those sequences as features, right? 116 00:08:16,070 --> 00:08:22,910 So we generated some statistics of entropy distribution like mean and median. 117 00:08:24,070 --> 00:08:28,580 And also we computed a 20 percentiles and inverse percentiles, and 118 00:08:28,580 --> 00:08:29,930 use them as features. 119 00:08:31,910 --> 00:08:36,994 The same features were extracted out of entropy changes, that is we first 120 00:08:36,994 --> 00:08:42,170 apply diff function to entropy sequence and then compute the statistics. 121 00:08:42,170 --> 00:08:47,674 It was possible to extract three things from hex dump by looking for 122 00:08:47,674 --> 00:08:51,716 printable sequences that ends with 0 element. 123 00:08:51,716 --> 00:08:57,600 We didn't use strings themselves, but we computed strings lengths. 124 00:08:57,600 --> 00:09:00,820 Distribution for each file and extract the similar statistics, 125 00:09:00,820 --> 00:09:04,597 the statistics that we extracted from entropy sequences. 126 00:09:05,670 --> 00:09:08,730 Okay, we finished with feature extraction. 127 00:09:08,730 --> 00:09:12,950 So let's see how we pre-process them and perform feature selection. 128 00:09:14,280 --> 00:09:17,670 Those moment when we generated a lot of features. 129 00:09:17,670 --> 00:09:22,640 We wanted to incorporate all of them in the classifier, but we could not fit 130 00:09:22,640 --> 00:09:27,700 the classifier efficiently when we got say 20,000 features. 131 00:09:29,710 --> 00:09:33,040 Most features will probably useless, so 132 00:09:33,040 --> 00:09:37,680 we tried different feature selection method and transformation techniques. 133 00:09:38,840 --> 00:09:44,380 Let's consider the buys counts features. 134 00:09:44,380 --> 00:09:49,185 There are 257 of them, not much, but probably there is still a redundancy. 135 00:09:49,185 --> 00:09:53,610 We tried non-negative matrix factorization, NMF. 136 00:09:53,610 --> 00:09:59,060 And the principal component analysis, PCA, in order to remove this redundancy. 137 00:10:00,150 --> 00:10:05,310 I will remind you that both NMF and PCA are trying to factorize 138 00:10:05,310 --> 00:10:09,940 object feature matrix x into the product of two low-rank matrices. 139 00:10:11,020 --> 00:10:13,710 You can think of that of as finding 140 00:10:13,710 --> 00:10:18,510 a small number of basis vectors in the feature space, so that every object can be 141 00:10:18,510 --> 00:10:23,880 approximated by a linear combination of those basis vectors. 142 00:10:23,880 --> 00:10:27,560 And this coefficients of approximation can be treated as 143 00:10:27,560 --> 00:10:29,090 new features for each object. 144 00:10:30,460 --> 00:10:35,415 So the only difference between NMF and PCA is that NMF requires 145 00:10:35,415 --> 00:10:40,570 all the components of floor rank matrices to be non-negative. 146 00:10:40,570 --> 00:10:44,339 We set the number of basis vectors to 15, and 147 00:10:44,339 --> 00:10:49,280 here we see plots between the first two extracted features. 148 00:10:49,280 --> 00:10:52,490 One for NMF and one for PCA. 149 00:10:54,210 --> 00:10:59,280 So these are the coefficient for most important basis vectors actually. 150 00:11:00,400 --> 00:11:02,255 We used 3D based model and 151 00:11:02,255 --> 00:11:07,100 it is obvious that NMF features were a lot better for trees. 152 00:11:07,100 --> 00:11:13,239 So NMF works good when the non-negativity of the data is essential. 153 00:11:14,290 --> 00:11:18,150 And in our case we worked with counts, which are non-negative by nature. 154 00:11:19,326 --> 00:11:23,660 Simple trick to get another pack of features by doing almost nothing 155 00:11:23,660 --> 00:11:26,390 is to apply a log transform to the data and 156 00:11:26,390 --> 00:11:30,020 calculate NMF on the transformed data again. 157 00:11:31,470 --> 00:11:32,740 Let's think what happened here. 158 00:11:34,050 --> 00:11:36,970 NMF protest originally uses MSE laws to 159 00:11:36,970 --> 00:11:39,620 measure the quality of approximation it build. 160 00:11:40,870 --> 00:11:47,420 This trick implicitly changes the laws NMF optimizes from MSE to RMSLE. 161 00:11:48,660 --> 00:11:53,650 Just recall that RMSLE is MSE in the log space. 162 00:11:53,650 --> 00:11:58,550 So now the decomposition process pays attention to different things 163 00:11:58,550 --> 00:12:02,640 due to different loss, and thus produces different features. 164 00:12:03,650 --> 00:12:07,330 We used the small pipeline to select 4-grams features. 165 00:12:07,330 --> 00:12:12,714 We removed rare features, applied linear SVM to the L1 regularization, 166 00:12:12,714 --> 00:12:15,630 as such model tends to select features. 167 00:12:15,630 --> 00:12:20,472 And after that, we thresholded random forest feature 168 00:12:20,472 --> 00:12:25,953 importances to get final feature set of only 131 feature. 169 00:12:25,953 --> 00:12:30,210 The pipeline was a little bit more complicated with 10-grams. 170 00:12:30,210 --> 00:12:34,659 First, we used the hashing to reuse the dimensionality of original features. 171 00:12:35,700 --> 00:12:39,200 We actually did it online while computing 10-grams. 172 00:12:39,200 --> 00:12:44,281 We then selected about 800 features based on L1 regularize SVM and 173 00:12:44,281 --> 00:12:46,700 the RandomForest importances. 174 00:12:46,700 --> 00:12:53,070 This is how we got about 800 features instead of 2 to the power of 28. 175 00:12:53,070 --> 00:12:57,090 But we went even further. 176 00:12:58,290 --> 00:13:00,150 This is actually the most interesting part for me. 177 00:13:01,410 --> 00:13:03,360 The main problem with feature selection, 178 00:13:03,360 --> 00:13:06,855 that I've just described, is that we've done it for 179 00:13:06,855 --> 00:13:11,670 10-grams independently of other features that we already had to that moment. 180 00:13:12,880 --> 00:13:15,730 After selection, we could end up with really good features. 181 00:13:15,730 --> 00:13:19,590 But those features could contain the same information 182 00:13:19,590 --> 00:13:22,100 that we already had in other features. 183 00:13:22,100 --> 00:13:26,590 And we actually wanted to improve our features with 10-grams. 184 00:13:26,590 --> 00:13:28,650 So here is what we did instead. 185 00:13:29,840 --> 00:13:32,170 We generated out of fold prediction for 186 00:13:32,170 --> 00:13:35,910 the train set using all the features that we had. 187 00:13:35,910 --> 00:13:41,770 We sorted the object by their true class predicted probability and try to find 188 00:13:41,770 --> 00:13:47,270 the features that would separate the most error prone object from the others. 189 00:13:48,320 --> 00:13:52,580 So actually, we use another model for it. 190 00:13:52,580 --> 00:13:57,920 We've created a new data set with error prone objects having label 1, 191 00:13:57,920 --> 00:14:00,600 and others having label 0. 192 00:14:00,600 --> 00:14:02,820 We trained random forest and 193 00:14:02,820 --> 00:14:07,440 selected 14 10-grams, well, actually hashes to be precise. 194 00:14:08,730 --> 00:14:11,620 We had a nice performance increase on the leaderboard when 195 00:14:11,620 --> 00:14:14,690 incorporating those 14 features. 196 00:14:14,690 --> 00:14:18,350 This method actually could lead us to sever overfitting, 197 00:14:18,350 --> 00:14:20,940 but it fortunately worked out nicely. 198 00:14:22,100 --> 00:14:23,540 All right, let's get to modeling. 199 00:14:25,320 --> 00:14:29,750 We didn't use stacking in this competition as we usually do now. 200 00:14:29,750 --> 00:14:33,210 It became popular slightly after this competition. 201 00:14:33,210 --> 00:14:36,090 We first used Random Forest everywhere, but 202 00:14:36,090 --> 00:14:39,120 it turned out it needs to calibrated for log loss. 203 00:14:39,120 --> 00:14:44,160 So we switched to XGBoost and used it for all our modeling. 204 00:14:45,550 --> 00:14:48,730 Every person in the team extracted his own features and 205 00:14:48,730 --> 00:14:51,820 trained his own models with his own parameters. 206 00:14:51,820 --> 00:14:56,930 So our final solution was a combination of diverse models. 207 00:14:56,930 --> 00:15:00,456 We found bagging worked quite well in this data set. 208 00:15:00,456 --> 00:15:05,550 We even did the bagging in very peculiar way. 209 00:15:05,550 --> 00:15:10,020 We concatenated our twin set with a seven times larger set of 210 00:15:10,020 --> 00:15:13,060 object sampled from train set with the replacement. 211 00:15:14,220 --> 00:15:15,690 And we used the resulting data set 212 00:15:16,970 --> 00:15:22,890 that is eight times larger than original train set, to train XGBoost. 213 00:15:22,890 --> 00:15:26,360 Of course, we averaged about 20 models to account for randomness. 214 00:15:28,300 --> 00:15:32,414 And finally, let's talk about several more tricks we used in this competition. 215 00:15:33,690 --> 00:15:36,990 The accuracy of the models was quite high, and 216 00:15:36,990 --> 00:15:40,840 we all know that the more data we have the better. 217 00:15:40,840 --> 00:15:46,120 So we've decided to try to use testing data for training our models. 218 00:15:47,340 --> 00:15:49,300 We just need some labels for the testers, right? 219 00:15:50,570 --> 00:15:53,020 We either use predicted class or 220 00:15:53,020 --> 00:15:56,150 we sample the label from the predicted class distribution. 221 00:15:57,670 --> 00:16:02,148 Generated test set labels are usually called pseudo labels. 222 00:16:03,170 --> 00:16:07,600 So how do we perform cross validation with pseudo labels? 223 00:16:07,600 --> 00:16:11,560 We split our train set into fold as we usually do well performing 224 00:16:11,560 --> 00:16:12,300 cross validation. 225 00:16:13,400 --> 00:16:15,890 In this example, we split data in two folds. 226 00:16:16,900 --> 00:16:21,400 But what's different in cross validation is that now, before training the model in 227 00:16:21,400 --> 00:16:25,680 a particular fold, we can concatenate this fold with the test data. 228 00:16:26,860 --> 00:16:29,730 Then we switch the faults and do the same thing. 229 00:16:30,880 --> 00:16:34,630 See we trained the objects to denoted with green color and 230 00:16:34,630 --> 00:16:36,730 predict the once shown with red. 231 00:16:37,840 --> 00:16:41,830 Okay, and to get predictions for the test set, 232 00:16:41,830 --> 00:16:46,980 we again do kind of cross validation thing, like on the previous slide. 233 00:16:46,980 --> 00:16:50,106 But now we divide the test set in faults and 234 00:16:50,106 --> 00:16:53,810 concatenate train set to each fold of the the test set. 235 00:16:54,950 --> 00:17:00,430 In the end, we get out of all predictions for the test set, that is our submission. 236 00:17:01,700 --> 00:17:04,710 One of the crucial things to understand about this method, 237 00:17:04,710 --> 00:17:08,800 that sometimes we need two different models for it. 238 00:17:08,800 --> 00:17:13,660 And this is the case where one model is very confident in its predictions. 239 00:17:13,660 --> 00:17:17,060 That is, the predictive probabilities are very close to 0 and 1. 240 00:17:18,200 --> 00:17:22,676 In this case, if we train a model, predict task set, 241 00:17:22,676 --> 00:17:26,953 sample labels or take it the most probable label and 242 00:17:26,953 --> 00:17:32,840 retrim same model with the same features or get no improvement at all. 243 00:17:32,840 --> 00:17:38,875 We just did not introduce any information with pseudo labeling in this way, 244 00:17:38,875 --> 00:17:43,790 but If I ask say, Stanislav, to predict that with his model. 245 00:17:43,790 --> 00:17:46,592 And then I use his labels for a test and 246 00:17:46,592 --> 00:17:51,310 create my model, that is where I will get a nice improvement. 247 00:17:52,740 --> 00:17:55,400 This actually becomes just another way 248 00:17:55,400 --> 00:17:58,209 to incorporate knowledge from two diverse models. 249 00:17:59,290 --> 00:18:04,280 And the last thing that helped us a lot is per-class mixing. 250 00:18:05,300 --> 00:18:09,640 At the time of the competition, people usually mixed models linearly. 251 00:18:09,640 --> 00:18:15,030 We went further and mixed models joining coefficients for each class separately. 252 00:18:16,380 --> 00:18:20,130 In fact, if you think about it, it's very similar to stacking with 253 00:18:20,130 --> 00:18:22,910 a linear model of a special kind at a second level. 254 00:18:24,270 --> 00:18:28,794 But the second became popular in a month after this competition, and 255 00:18:28,794 --> 00:18:32,700 what we did here is a very simplest manual form of stacking. 256 00:18:34,820 --> 00:18:40,280 We published our source code online, so if you are interested you can check it out. 257 00:18:40,280 --> 00:18:43,610 All right, this was an interesting competition. 258 00:18:43,610 --> 00:18:48,545 It was challenging from technical view point as we needed to manipulate 259 00:18:48,545 --> 00:18:53,870 more than half of terabyte of data, but it was very interesting. 260 00:18:53,870 --> 00:18:56,110 The data was given in a raw format and 261 00:18:56,110 --> 00:18:59,630 the actual challenge was to come up with nice features. 262 00:18:59,630 --> 00:19:05,121 And that is where we could be creative, thank you. 263 00:19:05,121 --> 00:19:15,121 [MUSIC]