1
00:00:00,000 --> 00:00:03,929
[MUSIC]

2
00:00:03,929 --> 00:00:04,922
Hello everyone.

3
00:00:04,922 --> 00:00:08,195
In this video we will analyze
the Crowdflower competition.

4
00:00:08,195 --> 00:00:12,513
We, I mean me, Stanislav Semenov and
Dmitry Altukhov participated as a team and

5
00:00:12,513 --> 00:00:14,140
took second place.

6
00:00:14,140 --> 00:00:19,000
I will explain most important parts of
our solution along with some details.

7
00:00:19,000 --> 00:00:21,035
The outline of the video is as follows.

8
00:00:21,035 --> 00:00:24,815
First, I will tell you about the contest
itself, what kind of data and

9
00:00:24,815 --> 00:00:26,172
metrics were provided.

10
00:00:26,172 --> 00:00:29,490
After that we will discuss are approach,
features and tricks.

11
00:00:29,490 --> 00:00:33,899
And then I will summarize what is worth
to kind of look for on the competition.

12
00:00:33,899 --> 00:00:37,059
Some of the competition were
organized by CrowdFlower Company.

13
00:00:37,059 --> 00:00:41,131
The goal of this competition is to measure
relevance of the search results given

14
00:00:41,131 --> 00:00:45,690
the queries and resulting product
descriptions from living e-commerce sites.

15
00:00:45,690 --> 00:00:49,558
The task is to evaluate the accuracy
of their search algorithms.

16
00:00:49,558 --> 00:00:52,812
The challenge of the competition is
to predict the relevance score of

17
00:00:52,812 --> 00:00:54,730
a given query and problem description.

18
00:00:55,840 --> 00:00:59,543
On the picture we see assessor's
user interface, which has a query,

19
00:00:59,543 --> 00:01:02,386
a search query, and
some information about an item.

20
00:01:02,386 --> 00:01:07,517
Assessors were asked to give each
query-product pairing a score of 1,

21
00:01:07,517 --> 00:01:13,572
2, 3, or 4, with 4 indicating the item
completely satisfied the search query,

22
00:01:13,572 --> 00:01:16,703
and 1 indicating the item doesn't match.

23
00:01:16,703 --> 00:01:21,103
As the training data, we have only
median and variance of these scores.

24
00:01:21,103 --> 00:01:24,378
Data set consists of three text fields,
request query,

25
00:01:24,378 --> 00:01:27,586
result and products title,
and product description,

26
00:01:27,586 --> 00:01:31,698
and two columns related to the target,
median and variance of scores.

27
00:01:31,698 --> 00:01:35,643
To ensure that the algorithm is robust
enough to handle any HTML snippets

28
00:01:35,643 --> 00:01:39,976
in the real world, the data provided in
the program description field is raw and

29
00:01:39,976 --> 00:01:43,550
sometimes contain permissions
that is relevant to the product.

30
00:01:43,550 --> 00:01:47,077
For example,
a string of text or object IDs.

31
00:01:47,077 --> 00:01:51,686
To discourage hand-labeling, the actual
set was augmented with extra data.

32
00:01:51,686 --> 00:01:54,298
This additional data was
ignored to when the public and

33
00:01:54,298 --> 00:01:56,990
private leaderboard
scores were calculated.

34
00:01:56,990 --> 00:02:00,314
And of course, campaign scores have
no idea which objects we had already

35
00:02:00,314 --> 00:02:01,739
used to calculate the scores.

36
00:02:01,739 --> 00:02:07,042
So we have 10,000 samples in train, and
20,000 in test, but it's good data.

37
00:02:07,042 --> 00:02:09,007
I mean, validation works well and

38
00:02:09,007 --> 00:02:12,646
local scores are close enough
to leaderboard scores.

39
00:02:12,646 --> 00:02:17,860
Effective solutions, non-standard metric
was used, quadratic weighted kappa.

40
00:02:17,860 --> 00:02:19,774
Let's take a closer look at it.

41
00:02:19,774 --> 00:02:24,990
You can find a detailed description of the
metric on the competition evaluation page.

42
00:02:24,990 --> 00:02:29,610
We have already discussed the metric
in our course, but let me recall it.

43
00:02:29,610 --> 00:02:33,195
Submissions are scored based on the
quadratic weighted kappa which measures

44
00:02:33,195 --> 00:02:34,912
the agreement between two ratings.

45
00:02:34,912 --> 00:02:39,290
This metric typically will rise from 0,
random agreement between raters, to 1,

46
00:02:39,290 --> 00:02:41,275
complete agreement between raters.

47
00:02:41,275 --> 00:02:46,402
In case there is less agreement between
the raters than expected by random,

48
00:02:46,402 --> 00:02:48,256
the metric may go below 0.

49
00:02:48,256 --> 00:02:50,301
In order to understand the metric,

50
00:02:50,301 --> 00:02:53,243
let's consider an example
of how to calculate it.

51
00:02:53,243 --> 00:02:57,911
First, we need N by n confusion matrix C,
which constructed and normalized.

52
00:02:57,911 --> 00:03:03,342
Our vertical axis by its failures,
or horizontal predicted failures.

53
00:03:03,342 --> 00:03:07,120
In our case, N is equal to 4 as
the number of possible ratings.

54
00:03:08,390 --> 00:03:11,869
Second we need N by n histogram
matrix of expected matrix E,

55
00:03:11,869 --> 00:03:16,683
which is calculated assuming that there
is no correlation between ratings cost.

56
00:03:16,683 --> 00:03:21,460
This is calculated as within
histogram vectors of ratings.

57
00:03:21,460 --> 00:03:24,386
Also we need N by N matrix of weights, W,

58
00:03:24,386 --> 00:03:28,357
which is calculated based
on its elements positions.

59
00:03:28,357 --> 00:03:33,516
This particular formula for weights uses
square distance between indexes i and

60
00:03:33,516 --> 00:03:37,753
j, so this is why the method is
called quadratic weighted kappa.

61
00:03:37,753 --> 00:03:42,248
Finally, kappa can be calculated as one
minus a fraction of this weighted sum

62
00:03:42,248 --> 00:03:44,530
of confusion matrix in the numerator,

63
00:03:44,530 --> 00:03:48,073
and weighted sum of expectation
matrix in the denominator.

64
00:03:48,073 --> 00:03:52,475
I want to notice that kappa has properties
similar both to classification loss and

65
00:03:52,475 --> 00:03:53,508
regression loss.

66
00:03:53,508 --> 00:03:55,122
The more distant predicted and

67
00:03:55,122 --> 00:03:57,921
true ratings are,
the more penalties there will be.

68
00:03:57,921 --> 00:04:03,274
Remember that thing,
we will use it later in our video.

69
00:04:03,274 --> 00:04:05,913
Okay, let's now talk
about my team solution.

70
00:04:05,913 --> 00:04:09,253
It's turned out to be quite complex,
consisting of several parts.

71
00:04:09,253 --> 00:04:13,598
Like extending of queries, per-query
models, bumper features, and so on.

72
00:04:13,598 --> 00:04:15,951
I will tell you about main points.

73
00:04:15,951 --> 00:04:18,969
So let's start with text features.

74
00:04:18,969 --> 00:04:23,573
We have three text fields, a query,
a title, and a description.

75
00:04:23,573 --> 00:04:26,587
You can apply all techniques that
we discuss in our course and

76
00:04:26,587 --> 00:04:28,845
calculate various measures of similarity.

77
00:04:28,845 --> 00:04:30,917
That's exactly what we did.

78
00:04:30,917 --> 00:04:35,209
That is for query title and query
description pair, we calculated the number

79
00:04:35,209 --> 00:04:39,108
of magic words, cosine distance
between TF-IDF representations,

80
00:04:39,108 --> 00:04:43,551
distance between the average word2vec
vectors, and Levensthein distance.

81
00:04:43,551 --> 00:04:47,128
In fact, this is a standard set of
features that should be used for

82
00:04:47,128 --> 00:04:47,980
similar task.

83
00:04:47,980 --> 00:04:50,253
And there is nothing outstanding.

84
00:04:50,253 --> 00:04:52,500
The support should be
considered as a baseline.

85
00:04:52,500 --> 00:04:55,091
And now we turn to the interesting things.

86
00:04:55,091 --> 00:04:58,475
In addition, we found it was
useful to use symbolic n-grams.

87
00:04:58,475 --> 00:05:01,815
You can work with them in
the same way as with word-based,

88
00:05:01,815 --> 00:05:05,410
if each letter is interpreted
as a separate word.

89
00:05:05,410 --> 00:05:08,810
We use symbolic n-grams from one letter,
to five letters.

90
00:05:08,810 --> 00:05:13,642
After getting a large parse metrics of
n-grams, we apply it as to them, and

91
00:05:13,642 --> 00:05:16,717
took only close to 300
combinations as features.

92
00:05:16,717 --> 00:05:20,922
You remember we discussed this portion
of our course, there is an example.

93
00:05:20,922 --> 00:05:24,461
Looking at the data we notice
three interesting properties.

94
00:05:24,461 --> 00:05:29,636
Queries are very short,
numbers of unique queries is 261,

95
00:05:29,636 --> 00:05:33,550
and queries are the same in train and
test sets.

96
00:05:33,550 --> 00:05:37,617
We decided to use these observations
to build extended versions of query

97
00:05:37,617 --> 00:05:38,357
as follows.

98
00:05:38,357 --> 00:05:41,999
For each query, we get the most
relevant corresponding items,

99
00:05:41,999 --> 00:05:44,095
those with relevance equal to four.

100
00:05:44,095 --> 00:05:47,996
We join all words from the title
of the relevant items and

101
00:05:47,996 --> 00:05:50,077
take ten most popular words.

102
00:05:50,077 --> 00:05:52,159
This is what we called extended query and

103
00:05:52,159 --> 00:05:55,782
it's used to build all these text
features that I mentioned earlier.

104
00:05:55,782 --> 00:05:59,584
Notice that this trick is applicable
only within the context framework.

105
00:05:59,584 --> 00:06:03,709
Due to the fact that queries in test
are exactly the same as in train.

106
00:06:03,709 --> 00:06:05,485
In real life we couldn't do so

107
00:06:05,485 --> 00:06:10,258
because it's unrealistic to ignore
relevant results for every search query.

108
00:06:10,258 --> 00:06:13,980
The fact that sets of queries in train and
test are the same,

109
00:06:13,980 --> 00:06:18,379
give us an opportunity to split our
problem into many small subtasks.

110
00:06:18,379 --> 00:06:22,027
Specifically, for each query,
you can build a separate model that

111
00:06:22,027 --> 00:06:24,977
will only predict relevance
of corresponding items.

112
00:06:24,977 --> 00:06:29,918
Again, in real life, such tricks can't be
applied, but within the context framework,

113
00:06:29,918 --> 00:06:31,099
it's totally fine.

114
00:06:31,099 --> 00:06:34,490
So, for each unique query,
we get corresponding samples,

115
00:06:34,490 --> 00:06:38,614
calculate work to work similarities,
back and forth presentation, and

116
00:06:38,614 --> 00:06:40,489
fit a random fourth classifier.

117
00:06:40,489 --> 00:06:45,022
Finally, we have 261 model
which predictions were used as

118
00:06:45,022 --> 00:06:47,120
a feature in final example.

119
00:06:47,120 --> 00:06:51,373
Do remember that every product
item is by several people.

120
00:06:51,373 --> 00:06:53,543
Median in a variance of the ratings.

121
00:06:53,543 --> 00:06:57,000
The variance show in ratings
are inconsistent or not.

122
00:06:57,000 --> 00:06:59,367
Intuitively, if the variance is large,

123
00:06:59,367 --> 00:07:03,435
then such an object need to be taken
into account with a smaller weight.

124
00:07:03,435 --> 00:07:08,018
One way to do it is to assign items
weight depending on query answers,

125
00:07:08,018 --> 00:07:11,822
which was a heuristics 1 /
1 + variance as the weight.

126
00:07:11,822 --> 00:07:15,316
Another method is to restore
individual observation using median and

127
00:07:15,316 --> 00:07:16,509
variance statistics.

128
00:07:16,509 --> 00:07:19,267
We found that supposing
there are three assessors,

129
00:07:19,267 --> 00:07:22,536
we can almost always certainly
restore our original labels.

130
00:07:22,536 --> 00:07:26,112
For example,
if we have a median equal to three, and

131
00:07:26,112 --> 00:07:31,152
variance equal to 0.66,
there of course are two, three, and four,

132
00:07:31,152 --> 00:07:36,307
which by this approach and for
each source sample, generated three once.

133
00:07:36,307 --> 00:07:39,737
However, using them as training
data took longer to train, and

134
00:07:39,737 --> 00:07:41,206
did not improve the score.

135
00:07:41,206 --> 00:07:45,812
And simple heuristic works quite well,
and they use it in final solution.

136
00:07:45,812 --> 00:07:49,951
In the competition, you need to choose
a metric, we need to predict class,

137
00:07:49,951 --> 00:07:52,359
but penalty for
the error is not symmetric.

138
00:07:52,359 --> 00:07:54,267
We decided to take it into account,

139
00:07:54,267 --> 00:07:58,987
adding several artificially created binary
delimiters between classes as features.

140
00:07:58,987 --> 00:08:02,701
In other words, we're trying to
classify to answer the question,

141
00:08:02,701 --> 00:08:07,280
is it true that target class number is
greater than 1, greater than 2, and so on.

142
00:08:07,280 --> 00:08:12,517
We call these features bumpers, since they
are kind of separators between classes.

143
00:08:12,517 --> 00:08:17,803
We build them in fashion, similar to
how we construct predictions instead.

144
00:08:17,803 --> 00:08:21,099
It was very useful for the final solution.

145
00:08:21,099 --> 00:08:23,857
All mentioned features will
be used in an ensemble.

146
00:08:23,857 --> 00:08:26,531
We build several models on
different subsets of features.

147
00:08:26,531 --> 00:08:30,941
Some feel relatively simple like SVM and
some look quite complex.

148
00:08:30,941 --> 00:08:33,830
You can see the part of
complex model created by me.

149
00:08:33,830 --> 00:08:37,474
It uses bumper features,
all sorts of similarities and

150
00:08:37,474 --> 00:08:40,431
query features in different combinations.

151
00:08:40,431 --> 00:08:42,756
Also there is a lot of frostage models,

152
00:08:42,756 --> 00:08:45,862
which are mixed up with
second stage random forest.

153
00:08:45,862 --> 00:08:50,124
In fact, each participant of the team
made his own model, and then for

154
00:08:50,124 --> 00:08:54,473
competition we simply mixed our
model linearly for final submission.

155
00:08:54,473 --> 00:08:58,222
Let's remember that the metric on the one
hand has some proper classification

156
00:08:58,222 --> 00:09:00,110
It's necessary to predict the class.

157
00:09:00,110 --> 00:09:04,445
But for regression answer,
we can analyze more.

158
00:09:04,445 --> 00:09:07,009
We have built models for regression, but

159
00:09:07,009 --> 00:09:11,266
we have had to somehow turn
real-valued predictions into classes.

160
00:09:11,266 --> 00:09:16,461
A simple approach with would work poorly,
so we decided to pick up thresholds.

161
00:09:16,461 --> 00:09:17,971
For the purpose, we were right.

162
00:09:17,971 --> 00:09:21,911
Thresholds and choose the best one
weighted on cross-validation score.

163
00:09:21,911 --> 00:09:26,208
The buff procedure gave us a very
significant increase in quality.

164
00:09:26,208 --> 00:09:30,635
in fact it often happens in competitions
with non-standard metrics that [INAUDIBLE]

165
00:09:30,635 --> 00:09:34,201
grades symmetric optimization
gives a significant improvement.

166
00:09:34,201 --> 00:09:35,545
So let's sum up.

167
00:09:35,545 --> 00:09:38,679
In the competition it was really
important to use the [INAUDIBLE] ideas.

168
00:09:38,679 --> 00:09:42,850
First symbolic and grams, once since
they give a significant increase in

169
00:09:42,850 --> 00:09:46,024
the score and it was not solved
that you should use that.

170
00:09:46,024 --> 00:09:50,326
Second, expansion of queries led to
significant increase in this course.

171
00:09:50,326 --> 00:09:54,508
Also optimization of thresholds was
a crucial part of our solution.

172
00:09:54,508 --> 00:09:57,721
I hope you will re-use some of these
tricks in your competitions, though.

173
00:09:57,721 --> 00:09:59,782
Thank you for the attention.

174
00:09:59,782 --> 00:10:09,782
[MUSIC]