1
00:00:02,610 --> 00:00:08,075
Hi. In this lecture will transform tokens into features.

2
00:00:08,075 --> 00:00:12,085
And the best way to do that is Bag of Words.

3
00:00:12,085 --> 00:00:15,910
Let's count occurrences of a particular token in our text.

4
00:00:15,910 --> 00:00:17,680
The motivation is the following.

5
00:00:17,680 --> 00:00:22,320
We're actually looking for marker words like excellent or disappointed,

6
00:00:22,320 --> 00:00:24,835
and we want to detect those words,

7
00:00:24,835 --> 00:00:30,230
and make decisions based on absence or presence of that particular word,

8
00:00:30,230 --> 00:00:32,285
and how it might work.

9
00:00:32,285 --> 00:00:36,370
Let's take an example of three reviews like a good movie,

10
00:00:36,370 --> 00:00:38,085
not a good movie, did not like.

11
00:00:38,085 --> 00:00:44,020
Let's take all the possible words or tokens that we have in our documents.

12
00:00:44,020 --> 00:00:45,755
And for each such token,

13
00:00:45,755 --> 00:00:51,450
let's introduce a new feature or column that will correspond to that particular word.

14
00:00:51,450 --> 00:00:54,490
So, that is a pretty huge metrics of numbers,

15
00:00:54,490 --> 00:01:02,885
and how we translate our text into a vector in that metrics or row in that metrics.

16
00:01:02,885 --> 00:01:06,672
So, let's take for example good movie review.

17
00:01:06,672 --> 00:01:08,770
We have the word good,

18
00:01:08,770 --> 00:01:10,310
which is present in our text.

19
00:01:10,310 --> 00:01:13,860
So we put one in the column that corresponds to that word,

20
00:01:13,860 --> 00:01:15,835
then comes word movie,

21
00:01:15,835 --> 00:01:18,540
and we put one in the second column just

22
00:01:18,540 --> 00:01:22,080
to show that that word is actually seen in our text.

23
00:01:22,080 --> 00:01:23,690
We don't have any other words,

24
00:01:23,690 --> 00:01:25,575
so all the rest are zeroes.

25
00:01:25,575 --> 00:01:32,280
And that is a really long vector which is sparse in a sense that it has a lot of zeroes.

26
00:01:32,280 --> 00:01:34,020
And for not a good movie,

27
00:01:34,020 --> 00:01:35,775
it will have four ones,

28
00:01:35,775 --> 00:01:38,376
and all the rest of zeroes and so forth.

29
00:01:38,376 --> 00:01:40,640
This process is called text vectorization,

30
00:01:40,640 --> 00:01:46,350
because we actually replace the text with a huge vector of numbers,

31
00:01:46,350 --> 00:01:52,725
and each dimension of that vector corresponds to a certain token in our database.

32
00:01:52,725 --> 00:01:56,135
You can actually see that it has some problems.

33
00:01:56,135 --> 00:01:58,615
The first one is that we lose word order,

34
00:01:58,615 --> 00:02:01,525
because we can actually shuffle over words,

35
00:02:01,525 --> 00:02:05,165
and the representation on the right will stay the same.

36
00:02:05,165 --> 00:02:07,620
And that's why it's called bag of words,

37
00:02:07,620 --> 00:02:10,685
because it's a bag they're not ordered,

38
00:02:10,685 --> 00:02:13,830
and so they can come up in any order.

39
00:02:13,830 --> 00:02:17,840
And different problem is that counters are not normalized.

40
00:02:17,840 --> 00:02:19,905
Let's solve these two problems,

41
00:02:19,905 --> 00:02:23,420
and let's start with preserving some ordering.

42
00:02:23,420 --> 00:02:25,670
So how can we do that?

43
00:02:25,670 --> 00:02:31,490
Actually you can easily come to an idea that you should look at token pairs,

44
00:02:31,490 --> 00:02:34,530
triplets, or different combinations.

45
00:02:34,530 --> 00:02:39,110
These approach is also called as extracting n-grams.

46
00:02:39,110 --> 00:02:41,690
One gram stands for tokens,

47
00:02:41,690 --> 00:02:45,823
two gram stands for a token pair and so forth.

48
00:02:45,823 --> 00:02:47,510
So let's look how it might work.

49
00:02:47,510 --> 00:02:49,475
We have the same three reviews,

50
00:02:49,475 --> 00:02:54,965
and now we don't only have columns that correspond to tokens,

51
00:02:54,965 --> 00:02:59,910
but we have also columns that correspond to let's say token pairs.

52
00:02:59,910 --> 00:03:04,295
And our good movie review now translates into vector,

53
00:03:04,295 --> 00:03:11,375
which has one in a column corresponding to that token pair good movie,

54
00:03:11,375 --> 00:03:15,223
for movie for good and so forth.

55
00:03:15,223 --> 00:03:19,175
So, this way, we preserve some local word order,

56
00:03:19,175 --> 00:03:23,960
and we hope that that will help us to analyze this text better.

57
00:03:23,960 --> 00:03:26,635
The problems are obvious though.

58
00:03:26,635 --> 00:03:29,795
This representation can have too many features,

59
00:03:29,795 --> 00:03:37,325
because let's say you have 100,000 words in your database,

60
00:03:37,325 --> 00:03:40,580
and if you try to take the pairs of those words,

61
00:03:40,580 --> 00:03:44,630
then you can actually come up with a huge number that can exponentially

62
00:03:44,630 --> 00:03:49,916
grow with the number of consecutive words that you want to analyze.

63
00:03:49,916 --> 00:03:51,665
So that is a problem.

64
00:03:51,665 --> 00:03:54,200
And to overcome that problem,

65
00:03:54,200 --> 00:03:56,335
we can actually remove some n-grams.

66
00:03:56,335 --> 00:03:59,600
Let's remove n-grams from features based on

67
00:03:59,600 --> 00:04:03,510
their occurrence frequency in documents of our corpus.

68
00:04:03,510 --> 00:04:07,840
You can actually see that for high frequency n-grams,

69
00:04:07,840 --> 00:04:10,045
as well as for low frequency n-grams,

70
00:04:10,045 --> 00:04:14,170
we can show why we don't need those n-grams.

71
00:04:14,170 --> 00:04:18,010
For high frequency, if you take a text and take

72
00:04:18,010 --> 00:04:22,165
high frequency n-grams that is seen in almost all of the documents,

73
00:04:22,165 --> 00:04:24,700
and for English language that would be articles,

74
00:04:24,700 --> 00:04:27,220
and preposition, and stuff like that.

75
00:04:27,220 --> 00:04:33,100
Because they're just there for grammatical structure and they don't have much meaning.

76
00:04:33,100 --> 00:04:35,200
These are called stop-words,

77
00:04:35,200 --> 00:04:37,375
they won't help us to discriminate texts,

78
00:04:37,375 --> 00:04:41,500
and we can pretty easily remove them.

79
00:04:41,500 --> 00:04:44,485
Another story is low frequency n-grams,

80
00:04:44,485 --> 00:04:46,818
and if you look at low frequency n-grams,

81
00:04:46,818 --> 00:04:51,040
you actually find typos because people type with mistakes,

82
00:04:51,040 --> 00:04:56,930
or rare n-grams that's usually not seen in any other reviews.

83
00:04:56,930 --> 00:04:59,320
And both of them are bad for our model,

84
00:04:59,320 --> 00:05:02,230
because if we don't remove these tokens,

85
00:05:02,230 --> 00:05:05,465
then very likely we will overfeed,

86
00:05:05,465 --> 00:05:09,070
because that would be a very good feature for

87
00:05:09,070 --> 00:05:14,578
our future classifier that can just see that, okay,

88
00:05:14,578 --> 00:05:16,510
we have a review that has a typo,

89
00:05:16,510 --> 00:05:19,930
and we had only like two of those reviews,

90
00:05:19,930 --> 00:05:21,460
which had those typo,

91
00:05:21,460 --> 00:05:26,822
and it's pretty clear whether it's positive or negative.

92
00:05:26,822 --> 00:05:29,035
So, it can learn

93
00:05:29,035 --> 00:05:35,485
some independences that are actually not there and we don't really need them.

94
00:05:35,485 --> 00:05:39,580
And the last one is medium frequency n-grams,

95
00:05:39,580 --> 00:05:41,605
and those are really good n-grams,

96
00:05:41,605 --> 00:05:47,980
because they contain n-grams that are not stop-words,

97
00:05:47,980 --> 00:05:51,345
that are not typos and we actually look at them.

98
00:05:51,345 --> 00:05:55,635
And, the problem is there're a lot of medium frequency n-grams.

99
00:05:55,635 --> 00:05:57,685
And it proved to be useful to look at

100
00:05:57,685 --> 00:06:02,225
n-gram frequency in our corpus for filtering out bad n-grams.

101
00:06:02,225 --> 00:06:08,555
What if we can use the same frequency for ranking of medium frequency n-grams?

102
00:06:08,555 --> 00:06:11,085
Maybe we can decide which medium frequency n-gram

103
00:06:11,085 --> 00:06:14,340
is better and which is worse based on that frequency.

104
00:06:14,340 --> 00:06:16,470
And the idea is the following,

105
00:06:16,470 --> 00:06:19,920
the n-gram with smaller frequency can be more

106
00:06:19,920 --> 00:06:23,838
discriminating because it can capture a specific issue in the review.

107
00:06:23,838 --> 00:06:27,600
Let's say, somebody is not happy with the Wi-Fi

108
00:06:27,600 --> 00:06:32,190
and let's say it says, Wi-Fi breaks often,

109
00:06:32,190 --> 00:06:34,875
and that n-gram, Wi-Fi breaks,

110
00:06:34,875 --> 00:06:38,170
it can not be very frequent in our database,

111
00:06:38,170 --> 00:06:39,870
in our corpus of our documents,

112
00:06:39,870 --> 00:06:47,500
but it can actually highlight a specific issue that we need to look closer at.

113
00:06:47,500 --> 00:06:49,695
And to utilize that idea,

114
00:06:49,695 --> 00:06:57,410
we will have to introduce some notions first like term frequency.

115
00:06:57,410 --> 00:07:04,955
We will denote it as TF and that is the frequency for term t. The term is an n-gram,

116
00:07:04,955 --> 00:07:08,730
token, or anything like that in a document d.

117
00:07:08,730 --> 00:07:12,870
And there are different options how you can count that term frequency.

118
00:07:12,870 --> 00:07:15,975
The first and the easiest one is binary.

119
00:07:15,975 --> 00:07:19,010
You can actually take zero or one based on the fact

120
00:07:19,010 --> 00:07:23,330
whether that token is absent in our text or it is present.

121
00:07:23,330 --> 00:07:27,470
Then, a different option is to take just a raw count

122
00:07:27,470 --> 00:07:31,355
of how many times we've seen that term in our document,

123
00:07:31,355 --> 00:07:34,725
and let's denote that by f. Then,

124
00:07:34,725 --> 00:07:36,340
you can take a term frequency,

125
00:07:36,340 --> 00:07:40,380
so you can actually look at all the counts of all the terms that you have seen in

126
00:07:40,380 --> 00:07:45,300
your document and you can normalize those counters to have a sum of one.

127
00:07:45,300 --> 00:07:50,020
So there is a kind of a probability distribution on those tokens.

128
00:07:50,020 --> 00:07:52,440
And for that, you take that f and divide by

129
00:07:52,440 --> 00:07:55,380
the sum of fs for all the tokens in your document.

130
00:07:55,380 --> 00:08:02,990
And, one more useful scheme is logarithmic normalization.

131
00:08:02,990 --> 00:08:06,930
You take the logarithm of those counts and it actually introduces

132
00:08:06,930 --> 00:08:13,224
a logarithmic scale for your counters and that might help you to solve the task better.

133
00:08:13,224 --> 00:08:15,335
So that's it with term frequency.

134
00:08:15,335 --> 00:08:18,955
We will use that in the following slides.

135
00:08:18,955 --> 00:08:22,430
Another thing is inverse document frequency.

136
00:08:22,430 --> 00:08:24,230
Lets denote by capital N,

137
00:08:24,230 --> 00:08:26,725
total number of documents in our corpus,

138
00:08:26,725 --> 00:08:29,045
and our corpus is a capital D,

139
00:08:29,045 --> 00:08:32,085
that is the set of all our documents.

140
00:08:32,085 --> 00:08:35,200
Now, let's look at how many documents are

141
00:08:35,200 --> 00:08:38,870
there in that corpus that contain a specific term.

142
00:08:38,870 --> 00:08:43,640
And that is the second line and that is the size of

143
00:08:43,640 --> 00:08:48,765
that set that actually means the number of documents where the term appears.

144
00:08:48,765 --> 00:08:53,855
If you think about document frequency,

145
00:08:53,855 --> 00:08:56,690
then you would take that number of documents where

146
00:08:56,690 --> 00:08:59,670
the term appears and divide by the total number of documents,

147
00:08:59,670 --> 00:09:05,030
and you have a frequency of those of that term in our documents.

148
00:09:05,030 --> 00:09:10,610
But if you want to take inverse argument frequency then you just swap

149
00:09:10,610 --> 00:09:18,005
the up and down of that ratio and you take a logarithm of that and that thing,

150
00:09:18,005 --> 00:09:20,798
we will call inverse document frequency.

151
00:09:20,798 --> 00:09:26,000
So, it is just the logarithm of N over the number of documents where the term appears.

152
00:09:26,000 --> 00:09:28,520
And using these two things,

153
00:09:28,520 --> 00:09:30,815
IDF and term frequency,

154
00:09:30,815 --> 00:09:34,580
we can actually come up with TF-IDF value,

155
00:09:34,580 --> 00:09:38,035
which needs a term,

156
00:09:38,035 --> 00:09:42,620
a document, and a corpus to be calculated.

157
00:09:42,620 --> 00:09:44,360
And it works like the following,

158
00:09:44,360 --> 00:09:48,705
you take the term frequency of our term T in our document d and you

159
00:09:48,705 --> 00:09:54,360
multiplied by inverse document frequency of that term in all our documents.

160
00:09:54,360 --> 00:09:58,597
And let's see why it actually makes sense to do something like this.

161
00:09:58,597 --> 00:10:02,885
A high weight in TF-IDF is reached when we have

162
00:10:02,885 --> 00:10:07,010
high term frequency in the given document and

163
00:10:07,010 --> 00:10:11,480
the low document frequency of the term in the whole collection of documents.

164
00:10:11,480 --> 00:10:14,695
That is precisely the idea that we wanted to follow.

165
00:10:14,695 --> 00:10:15,800
We wanted to find

166
00:10:15,800 --> 00:10:24,240
frequent issues in the reviews that are not so frequent in the whole data-set,

167
00:10:24,240 --> 00:10:28,550
so specific issues and we want to highlight them.

168
00:10:28,550 --> 00:10:30,715
Let's see how it might work.

169
00:10:30,715 --> 00:10:36,550
We can replace counters in our bag of words representation with TF-IDF values.

170
00:10:36,550 --> 00:10:39,750
We can also normalize the result row-wise,

171
00:10:39,750 --> 00:10:41,885
so we normalize each row.

172
00:10:41,885 --> 00:10:43,670
We can do, that for example,

173
00:10:43,670 --> 00:10:47,120
by dividing by L2 norm or the sum of those numbers,

174
00:10:47,120 --> 00:10:49,386
you can go anyway.

175
00:10:49,386 --> 00:10:52,845
And, where we actually get in the result we have

176
00:10:52,845 --> 00:10:57,300
not counters but some real values and let's look at this example.

177
00:10:57,300 --> 00:10:58,780
We have a good movie,

178
00:10:58,780 --> 00:11:02,215
two gram and it appears in two documents.

179
00:11:02,215 --> 00:11:06,740
So in our collection it is pretty frequent, two gram.

180
00:11:06,740 --> 00:11:17,210
That's why the value 0.17 is actually lower than 0.47 and we get 0.47 for

181
00:11:17,210 --> 00:11:23,190
did not two gram and that actually is there because

182
00:11:23,190 --> 00:11:26,250
that did not two gram happened only in

183
00:11:26,250 --> 00:11:29,850
one review and that could be a specific issue and we want to highlight,

184
00:11:29,850 --> 00:11:33,465
that we want to have a bigger value for that feature.

185
00:11:33,465 --> 00:11:36,460
Let's look how it might work in python.

186
00:11:36,460 --> 00:11:42,490
In Python, you can use scikit learn library and you can import TF-IDF vectorizer.

187
00:11:42,490 --> 00:11:46,275
Let me remind you that vectorization means that we replace how we text

188
00:11:46,275 --> 00:11:50,400
with a huge vector that has a lot of

189
00:11:50,400 --> 00:11:54,300
zeroes but some of the values are not zeroes and those are

190
00:11:54,300 --> 00:11:58,714
precisely the values that correspond to the tokens that is seen in our text.

191
00:11:58,714 --> 00:12:02,560
Now, let's take an example of five different texts

192
00:12:02,560 --> 00:12:06,700
like small movie reviews and what we do is we

193
00:12:06,700 --> 00:12:09,730
instantiate that TF-IDF vectoriser and it has

194
00:12:09,730 --> 00:12:15,817
some useful arguments that you can pass to it,

195
00:12:15,817 --> 00:12:22,165
like mean DF, which stands for minimum document frequency that is

196
00:12:22,165 --> 00:12:25,340
essentially a cutoff threshold for

197
00:12:25,340 --> 00:12:29,930
low frequency and grams because we want to throw them away.

198
00:12:29,930 --> 00:12:36,320
And we can actually threshold it on a maximum number of documents where

199
00:12:36,320 --> 00:12:43,160
we've seen that token and this is done for stripping away stop words.

200
00:12:43,160 --> 00:12:44,895
And this time, in scikit learn library,

201
00:12:44,895 --> 00:12:49,580
we actually bust that argument as a ratio of

202
00:12:49,580 --> 00:12:55,775
documents but not a real valued number of documents where we've seen that.

203
00:12:55,775 --> 00:12:58,475
And the last argument is n-gram range,

204
00:12:58,475 --> 00:13:03,576
which actually tells TF-IDF vectorizer,

205
00:13:03,576 --> 00:13:07,970
what n-grams should be used in these back of words for representation.

206
00:13:07,970 --> 00:13:11,860
In this scenario, they take one gram and two gram.

207
00:13:11,860 --> 00:13:15,870
So, if we have vectorized our text we get something like this.

208
00:13:15,870 --> 00:13:20,965
So not all possible two grams or one grams are there because some of them are filtered.

209
00:13:20,965 --> 00:13:22,905
You can just follow,

210
00:13:22,905 --> 00:13:26,490
just look at the reviews and see why that happened and you can

211
00:13:26,490 --> 00:13:30,660
also see that we have real values in

212
00:13:30,660 --> 00:13:35,130
these matters because those who actually TF-IDF values and each row is

213
00:13:35,130 --> 00:13:40,230
normalized to have a norm of one.

214
00:13:40,230 --> 00:13:46,835
So let's summarize, we've made actually a simple counter features in bag of words manner.

215
00:13:46,835 --> 00:13:50,445
We replaced each text by a huge vector of counters.

216
00:13:50,445 --> 00:13:53,770
You can add n-grams to try to preserve

217
00:13:53,770 --> 00:13:56,710
some local ordering and we will further see

218
00:13:56,710 --> 00:14:00,300
that it actually improves the quality of text classification.

219
00:14:00,300 --> 00:14:01,930
You can replace counter's with

220
00:14:01,930 --> 00:14:06,367
TF-IDF values and that usually gives you a performance boost as well.

221
00:14:06,367 --> 00:14:07,623
In the next video,

222
00:14:07,623 --> 00:14:12,520
we will train our first model on top of these features.