[MUSIC] Hello everyone. In this video we will analyze the Crowdflower competition. We, I mean me, Stanislav Semenov and Dmitry Altukhov participated as a team and took second place. I will explain most important parts of our solution along with some details. The outline of the video is as follows. First, I will tell you about the contest itself, what kind of data and metrics were provided. After that we will discuss are approach, features and tricks. And then I will summarize what is worth to kind of look for on the competition. Some of the competition were organized by CrowdFlower Company. The goal of this competition is to measure relevance of the search results given the queries and resulting product descriptions from living e-commerce sites. The task is to evaluate the accuracy of their search algorithms. The challenge of the competition is to predict the relevance score of a given query and problem description. On the picture we see assessor's user interface, which has a query, a search query, and some information about an item. Assessors were asked to give each query-product pairing a score of 1, 2, 3, or 4, with 4 indicating the item completely satisfied the search query, and 1 indicating the item doesn't match. As the training data, we have only median and variance of these scores. Data set consists of three text fields, request query, result and products title, and product description, and two columns related to the target, median and variance of scores. To ensure that the algorithm is robust enough to handle any HTML snippets in the real world, the data provided in the program description field is raw and sometimes contain permissions that is relevant to the product. For example, a string of text or object IDs. To discourage hand-labeling, the actual set was augmented with extra data. This additional data was ignored to when the public and private leaderboard scores were calculated. And of course, campaign scores have no idea which objects we had already used to calculate the scores. So we have 10,000 samples in train, and 20,000 in test, but it's good data. I mean, validation works well and local scores are close enough to leaderboard scores. Effective solutions, non-standard metric was used, quadratic weighted kappa. Let's take a closer look at it. You can find a detailed description of the metric on the competition evaluation page. We have already discussed the metric in our course, but let me recall it. Submissions are scored based on the quadratic weighted kappa which measures the agreement between two ratings. This metric typically will rise from 0, random agreement between raters, to 1, complete agreement between raters. In case there is less agreement between the raters than expected by random, the metric may go below 0. In order to understand the metric, let's consider an example of how to calculate it. First, we need N by n confusion matrix C, which constructed and normalized. Our vertical axis by its failures, or horizontal predicted failures. In our case, N is equal to 4 as the number of possible ratings. Second we need N by n histogram matrix of expected matrix E, which is calculated assuming that there is no correlation between ratings cost. This is calculated as within histogram vectors of ratings. Also we need N by N matrix of weights, W, which is calculated based on its elements positions. This particular formula for weights uses square distance between indexes i and j, so this is why the method is called quadratic weighted kappa. Finally, kappa can be calculated as one minus a fraction of this weighted sum of confusion matrix in the numerator, and weighted sum of expectation matrix in the denominator. I want to notice that kappa has properties similar both to classification loss and regression loss. The more distant predicted and true ratings are, the more penalties there will be. Remember that thing, we will use it later in our video. Okay, let's now talk about my team solution. It's turned out to be quite complex, consisting of several parts. Like extending of queries, per-query models, bumper features, and so on. I will tell you about main points. So let's start with text features. We have three text fields, a query, a title, and a description. You can apply all techniques that we discuss in our course and calculate various measures of similarity. That's exactly what we did. That is for query title and query description pair, we calculated the number of magic words, cosine distance between TF-IDF representations, distance between the average word2vec vectors, and Levensthein distance. In fact, this is a standard set of features that should be used for similar task. And there is nothing outstanding. The support should be considered as a baseline. And now we turn to the interesting things. In addition, we found it was useful to use symbolic n-grams. You can work with them in the same way as with word-based, if each letter is interpreted as a separate word. We use symbolic n-grams from one letter, to five letters. After getting a large parse metrics of n-grams, we apply it as to them, and took only close to 300 combinations as features. You remember we discussed this portion of our course, there is an example. Looking at the data we notice three interesting properties. Queries are very short, numbers of unique queries is 261, and queries are the same in train and test sets. We decided to use these observations to build extended versions of query as follows. For each query, we get the most relevant corresponding items, those with relevance equal to four. We join all words from the title of the relevant items and take ten most popular words. This is what we called extended query and it's used to build all these text features that I mentioned earlier. Notice that this trick is applicable only within the context framework. Due to the fact that queries in test are exactly the same as in train. In real life we couldn't do so because it's unrealistic to ignore relevant results for every search query. The fact that sets of queries in train and test are the same, give us an opportunity to split our problem into many small subtasks. Specifically, for each query, you can build a separate model that will only predict relevance of corresponding items. Again, in real life, such tricks can't be applied, but within the context framework, it's totally fine. So, for each unique query, we get corresponding samples, calculate work to work similarities, back and forth presentation, and fit a random fourth classifier. Finally, we have 261 model which predictions were used as a feature in final example. Do remember that every product item is by several people. Median in a variance of the ratings. The variance show in ratings are inconsistent or not. Intuitively, if the variance is large, then such an object need to be taken into account with a smaller weight. One way to do it is to assign items weight depending on query answers, which was a heuristics 1 / 1 + variance as the weight. Another method is to restore individual observation using median and variance statistics. We found that supposing there are three assessors, we can almost always certainly restore our original labels. For example, if we have a median equal to three, and variance equal to 0.66, there of course are two, three, and four, which by this approach and for each source sample, generated three once. However, using them as training data took longer to train, and did not improve the score. And simple heuristic works quite well, and they use it in final solution. In the competition, you need to choose a metric, we need to predict class, but penalty for the error is not symmetric. We decided to take it into account, adding several artificially created binary delimiters between classes as features. In other words, we're trying to classify to answer the question, is it true that target class number is greater than 1, greater than 2, and so on. We call these features bumpers, since they are kind of separators between classes. We build them in fashion, similar to how we construct predictions instead. It was very useful for the final solution. All mentioned features will be used in an ensemble. We build several models on different subsets of features. Some feel relatively simple like SVM and some look quite complex. You can see the part of complex model created by me. It uses bumper features, all sorts of similarities and query features in different combinations. Also there is a lot of frostage models, which are mixed up with second stage random forest. In fact, each participant of the team made his own model, and then for competition we simply mixed our model linearly for final submission. Let's remember that the metric on the one hand has some proper classification It's necessary to predict the class. But for regression answer, we can analyze more. We have built models for regression, but we have had to somehow turn real-valued predictions into classes. A simple approach with would work poorly, so we decided to pick up thresholds. For the purpose, we were right. Thresholds and choose the best one weighted on cross-validation score. The buff procedure gave us a very significant increase in quality. in fact it often happens in competitions with non-standard metrics that [INAUDIBLE] grades symmetric optimization gives a significant improvement. So let's sum up. In the competition it was really important to use the [INAUDIBLE] ideas. First symbolic and grams, once since they give a significant increase in the score and it was not solved that you should use that. Second, expansion of queries led to significant increase in this course. Also optimization of thresholds was a crucial part of our solution. I hope you will re-use some of these tricks in your competitions, though. Thank you for the attention. [MUSIC]