1 00:00:00,000 --> 00:00:03,929 [MUSIC] 2 00:00:03,929 --> 00:00:04,922 Hello everyone. 3 00:00:04,922 --> 00:00:08,195 In this video we will analyze the Crowdflower competition. 4 00:00:08,195 --> 00:00:12,513 We, I mean me, Stanislav Semenov and Dmitry Altukhov participated as a team and 5 00:00:12,513 --> 00:00:14,140 took second place. 6 00:00:14,140 --> 00:00:19,000 I will explain most important parts of our solution along with some details. 7 00:00:19,000 --> 00:00:21,035 The outline of the video is as follows. 8 00:00:21,035 --> 00:00:24,815 First, I will tell you about the contest itself, what kind of data and 9 00:00:24,815 --> 00:00:26,172 metrics were provided. 10 00:00:26,172 --> 00:00:29,490 After that we will discuss are approach, features and tricks. 11 00:00:29,490 --> 00:00:33,899 And then I will summarize what is worth to kind of look for on the competition. 12 00:00:33,899 --> 00:00:37,059 Some of the competition were organized by CrowdFlower Company. 13 00:00:37,059 --> 00:00:41,131 The goal of this competition is to measure relevance of the search results given 14 00:00:41,131 --> 00:00:45,690 the queries and resulting product descriptions from living e-commerce sites. 15 00:00:45,690 --> 00:00:49,558 The task is to evaluate the accuracy of their search algorithms. 16 00:00:49,558 --> 00:00:52,812 The challenge of the competition is to predict the relevance score of 17 00:00:52,812 --> 00:00:54,730 a given query and problem description. 18 00:00:55,840 --> 00:00:59,543 On the picture we see assessor's user interface, which has a query, 19 00:00:59,543 --> 00:01:02,386 a search query, and some information about an item. 20 00:01:02,386 --> 00:01:07,517 Assessors were asked to give each query-product pairing a score of 1, 21 00:01:07,517 --> 00:01:13,572 2, 3, or 4, with 4 indicating the item completely satisfied the search query, 22 00:01:13,572 --> 00:01:16,703 and 1 indicating the item doesn't match. 23 00:01:16,703 --> 00:01:21,103 As the training data, we have only median and variance of these scores. 24 00:01:21,103 --> 00:01:24,378 Data set consists of three text fields, request query, 25 00:01:24,378 --> 00:01:27,586 result and products title, and product description, 26 00:01:27,586 --> 00:01:31,698 and two columns related to the target, median and variance of scores. 27 00:01:31,698 --> 00:01:35,643 To ensure that the algorithm is robust enough to handle any HTML snippets 28 00:01:35,643 --> 00:01:39,976 in the real world, the data provided in the program description field is raw and 29 00:01:39,976 --> 00:01:43,550 sometimes contain permissions that is relevant to the product. 30 00:01:43,550 --> 00:01:47,077 For example, a string of text or object IDs. 31 00:01:47,077 --> 00:01:51,686 To discourage hand-labeling, the actual set was augmented with extra data. 32 00:01:51,686 --> 00:01:54,298 This additional data was ignored to when the public and 33 00:01:54,298 --> 00:01:56,990 private leaderboard scores were calculated. 34 00:01:56,990 --> 00:02:00,314 And of course, campaign scores have no idea which objects we had already 35 00:02:00,314 --> 00:02:01,739 used to calculate the scores. 36 00:02:01,739 --> 00:02:07,042 So we have 10,000 samples in train, and 20,000 in test, but it's good data. 37 00:02:07,042 --> 00:02:09,007 I mean, validation works well and 38 00:02:09,007 --> 00:02:12,646 local scores are close enough to leaderboard scores. 39 00:02:12,646 --> 00:02:17,860 Effective solutions, non-standard metric was used, quadratic weighted kappa. 40 00:02:17,860 --> 00:02:19,774 Let's take a closer look at it. 41 00:02:19,774 --> 00:02:24,990 You can find a detailed description of the metric on the competition evaluation page. 42 00:02:24,990 --> 00:02:29,610 We have already discussed the metric in our course, but let me recall it. 43 00:02:29,610 --> 00:02:33,195 Submissions are scored based on the quadratic weighted kappa which measures 44 00:02:33,195 --> 00:02:34,912 the agreement between two ratings. 45 00:02:34,912 --> 00:02:39,290 This metric typically will rise from 0, random agreement between raters, to 1, 46 00:02:39,290 --> 00:02:41,275 complete agreement between raters. 47 00:02:41,275 --> 00:02:46,402 In case there is less agreement between the raters than expected by random, 48 00:02:46,402 --> 00:02:48,256 the metric may go below 0. 49 00:02:48,256 --> 00:02:50,301 In order to understand the metric, 50 00:02:50,301 --> 00:02:53,243 let's consider an example of how to calculate it. 51 00:02:53,243 --> 00:02:57,911 First, we need N by n confusion matrix C, which constructed and normalized. 52 00:02:57,911 --> 00:03:03,342 Our vertical axis by its failures, or horizontal predicted failures. 53 00:03:03,342 --> 00:03:07,120 In our case, N is equal to 4 as the number of possible ratings. 54 00:03:08,390 --> 00:03:11,869 Second we need N by n histogram matrix of expected matrix E, 55 00:03:11,869 --> 00:03:16,683 which is calculated assuming that there is no correlation between ratings cost. 56 00:03:16,683 --> 00:03:21,460 This is calculated as within histogram vectors of ratings. 57 00:03:21,460 --> 00:03:24,386 Also we need N by N matrix of weights, W, 58 00:03:24,386 --> 00:03:28,357 which is calculated based on its elements positions. 59 00:03:28,357 --> 00:03:33,516 This particular formula for weights uses square distance between indexes i and 60 00:03:33,516 --> 00:03:37,753 j, so this is why the method is called quadratic weighted kappa. 61 00:03:37,753 --> 00:03:42,248 Finally, kappa can be calculated as one minus a fraction of this weighted sum 62 00:03:42,248 --> 00:03:44,530 of confusion matrix in the numerator, 63 00:03:44,530 --> 00:03:48,073 and weighted sum of expectation matrix in the denominator. 64 00:03:48,073 --> 00:03:52,475 I want to notice that kappa has properties similar both to classification loss and 65 00:03:52,475 --> 00:03:53,508 regression loss. 66 00:03:53,508 --> 00:03:55,122 The more distant predicted and 67 00:03:55,122 --> 00:03:57,921 true ratings are, the more penalties there will be. 68 00:03:57,921 --> 00:04:03,274 Remember that thing, we will use it later in our video. 69 00:04:03,274 --> 00:04:05,913 Okay, let's now talk about my team solution. 70 00:04:05,913 --> 00:04:09,253 It's turned out to be quite complex, consisting of several parts. 71 00:04:09,253 --> 00:04:13,598 Like extending of queries, per-query models, bumper features, and so on. 72 00:04:13,598 --> 00:04:15,951 I will tell you about main points. 73 00:04:15,951 --> 00:04:18,969 So let's start with text features. 74 00:04:18,969 --> 00:04:23,573 We have three text fields, a query, a title, and a description. 75 00:04:23,573 --> 00:04:26,587 You can apply all techniques that we discuss in our course and 76 00:04:26,587 --> 00:04:28,845 calculate various measures of similarity. 77 00:04:28,845 --> 00:04:30,917 That's exactly what we did. 78 00:04:30,917 --> 00:04:35,209 That is for query title and query description pair, we calculated the number 79 00:04:35,209 --> 00:04:39,108 of magic words, cosine distance between TF-IDF representations, 80 00:04:39,108 --> 00:04:43,551 distance between the average word2vec vectors, and Levensthein distance. 81 00:04:43,551 --> 00:04:47,128 In fact, this is a standard set of features that should be used for 82 00:04:47,128 --> 00:04:47,980 similar task. 83 00:04:47,980 --> 00:04:50,253 And there is nothing outstanding. 84 00:04:50,253 --> 00:04:52,500 The support should be considered as a baseline. 85 00:04:52,500 --> 00:04:55,091 And now we turn to the interesting things. 86 00:04:55,091 --> 00:04:58,475 In addition, we found it was useful to use symbolic n-grams. 87 00:04:58,475 --> 00:05:01,815 You can work with them in the same way as with word-based, 88 00:05:01,815 --> 00:05:05,410 if each letter is interpreted as a separate word. 89 00:05:05,410 --> 00:05:08,810 We use symbolic n-grams from one letter, to five letters. 90 00:05:08,810 --> 00:05:13,642 After getting a large parse metrics of n-grams, we apply it as to them, and 91 00:05:13,642 --> 00:05:16,717 took only close to 300 combinations as features. 92 00:05:16,717 --> 00:05:20,922 You remember we discussed this portion of our course, there is an example. 93 00:05:20,922 --> 00:05:24,461 Looking at the data we notice three interesting properties. 94 00:05:24,461 --> 00:05:29,636 Queries are very short, numbers of unique queries is 261, 95 00:05:29,636 --> 00:05:33,550 and queries are the same in train and test sets. 96 00:05:33,550 --> 00:05:37,617 We decided to use these observations to build extended versions of query 97 00:05:37,617 --> 00:05:38,357 as follows. 98 00:05:38,357 --> 00:05:41,999 For each query, we get the most relevant corresponding items, 99 00:05:41,999 --> 00:05:44,095 those with relevance equal to four. 100 00:05:44,095 --> 00:05:47,996 We join all words from the title of the relevant items and 101 00:05:47,996 --> 00:05:50,077 take ten most popular words. 102 00:05:50,077 --> 00:05:52,159 This is what we called extended query and 103 00:05:52,159 --> 00:05:55,782 it's used to build all these text features that I mentioned earlier. 104 00:05:55,782 --> 00:05:59,584 Notice that this trick is applicable only within the context framework. 105 00:05:59,584 --> 00:06:03,709 Due to the fact that queries in test are exactly the same as in train. 106 00:06:03,709 --> 00:06:05,485 In real life we couldn't do so 107 00:06:05,485 --> 00:06:10,258 because it's unrealistic to ignore relevant results for every search query. 108 00:06:10,258 --> 00:06:13,980 The fact that sets of queries in train and test are the same, 109 00:06:13,980 --> 00:06:18,379 give us an opportunity to split our problem into many small subtasks. 110 00:06:18,379 --> 00:06:22,027 Specifically, for each query, you can build a separate model that 111 00:06:22,027 --> 00:06:24,977 will only predict relevance of corresponding items. 112 00:06:24,977 --> 00:06:29,918 Again, in real life, such tricks can't be applied, but within the context framework, 113 00:06:29,918 --> 00:06:31,099 it's totally fine. 114 00:06:31,099 --> 00:06:34,490 So, for each unique query, we get corresponding samples, 115 00:06:34,490 --> 00:06:38,614 calculate work to work similarities, back and forth presentation, and 116 00:06:38,614 --> 00:06:40,489 fit a random fourth classifier. 117 00:06:40,489 --> 00:06:45,022 Finally, we have 261 model which predictions were used as 118 00:06:45,022 --> 00:06:47,120 a feature in final example. 119 00:06:47,120 --> 00:06:51,373 Do remember that every product item is by several people. 120 00:06:51,373 --> 00:06:53,543 Median in a variance of the ratings. 121 00:06:53,543 --> 00:06:57,000 The variance show in ratings are inconsistent or not. 122 00:06:57,000 --> 00:06:59,367 Intuitively, if the variance is large, 123 00:06:59,367 --> 00:07:03,435 then such an object need to be taken into account with a smaller weight. 124 00:07:03,435 --> 00:07:08,018 One way to do it is to assign items weight depending on query answers, 125 00:07:08,018 --> 00:07:11,822 which was a heuristics 1 / 1 + variance as the weight. 126 00:07:11,822 --> 00:07:15,316 Another method is to restore individual observation using median and 127 00:07:15,316 --> 00:07:16,509 variance statistics. 128 00:07:16,509 --> 00:07:19,267 We found that supposing there are three assessors, 129 00:07:19,267 --> 00:07:22,536 we can almost always certainly restore our original labels. 130 00:07:22,536 --> 00:07:26,112 For example, if we have a median equal to three, and 131 00:07:26,112 --> 00:07:31,152 variance equal to 0.66, there of course are two, three, and four, 132 00:07:31,152 --> 00:07:36,307 which by this approach and for each source sample, generated three once. 133 00:07:36,307 --> 00:07:39,737 However, using them as training data took longer to train, and 134 00:07:39,737 --> 00:07:41,206 did not improve the score. 135 00:07:41,206 --> 00:07:45,812 And simple heuristic works quite well, and they use it in final solution. 136 00:07:45,812 --> 00:07:49,951 In the competition, you need to choose a metric, we need to predict class, 137 00:07:49,951 --> 00:07:52,359 but penalty for the error is not symmetric. 138 00:07:52,359 --> 00:07:54,267 We decided to take it into account, 139 00:07:54,267 --> 00:07:58,987 adding several artificially created binary delimiters between classes as features. 140 00:07:58,987 --> 00:08:02,701 In other words, we're trying to classify to answer the question, 141 00:08:02,701 --> 00:08:07,280 is it true that target class number is greater than 1, greater than 2, and so on. 142 00:08:07,280 --> 00:08:12,517 We call these features bumpers, since they are kind of separators between classes. 143 00:08:12,517 --> 00:08:17,803 We build them in fashion, similar to how we construct predictions instead. 144 00:08:17,803 --> 00:08:21,099 It was very useful for the final solution. 145 00:08:21,099 --> 00:08:23,857 All mentioned features will be used in an ensemble. 146 00:08:23,857 --> 00:08:26,531 We build several models on different subsets of features. 147 00:08:26,531 --> 00:08:30,941 Some feel relatively simple like SVM and some look quite complex. 148 00:08:30,941 --> 00:08:33,830 You can see the part of complex model created by me. 149 00:08:33,830 --> 00:08:37,474 It uses bumper features, all sorts of similarities and 150 00:08:37,474 --> 00:08:40,431 query features in different combinations. 151 00:08:40,431 --> 00:08:42,756 Also there is a lot of frostage models, 152 00:08:42,756 --> 00:08:45,862 which are mixed up with second stage random forest. 153 00:08:45,862 --> 00:08:50,124 In fact, each participant of the team made his own model, and then for 154 00:08:50,124 --> 00:08:54,473 competition we simply mixed our model linearly for final submission. 155 00:08:54,473 --> 00:08:58,222 Let's remember that the metric on the one hand has some proper classification 156 00:08:58,222 --> 00:09:00,110 It's necessary to predict the class. 157 00:09:00,110 --> 00:09:04,445 But for regression answer, we can analyze more. 158 00:09:04,445 --> 00:09:07,009 We have built models for regression, but 159 00:09:07,009 --> 00:09:11,266 we have had to somehow turn real-valued predictions into classes. 160 00:09:11,266 --> 00:09:16,461 A simple approach with would work poorly, so we decided to pick up thresholds. 161 00:09:16,461 --> 00:09:17,971 For the purpose, we were right. 162 00:09:17,971 --> 00:09:21,911 Thresholds and choose the best one weighted on cross-validation score. 163 00:09:21,911 --> 00:09:26,208 The buff procedure gave us a very significant increase in quality. 164 00:09:26,208 --> 00:09:30,635 in fact it often happens in competitions with non-standard metrics that [INAUDIBLE] 165 00:09:30,635 --> 00:09:34,201 grades symmetric optimization gives a significant improvement. 166 00:09:34,201 --> 00:09:35,545 So let's sum up. 167 00:09:35,545 --> 00:09:38,679 In the competition it was really important to use the [INAUDIBLE] ideas. 168 00:09:38,679 --> 00:09:42,850 First symbolic and grams, once since they give a significant increase in 169 00:09:42,850 --> 00:09:46,024 the score and it was not solved that you should use that. 170 00:09:46,024 --> 00:09:50,326 Second, expansion of queries led to significant increase in this course. 171 00:09:50,326 --> 00:09:54,508 Also optimization of thresholds was a crucial part of our solution. 172 00:09:54,508 --> 00:09:57,721 I hope you will re-use some of these tricks in your competitions, though. 173 00:09:57,721 --> 00:09:59,782 Thank you for the attention. 174 00:09:59,782 --> 00:10:09,782 [MUSIC]