1
00:00:02,880 --> 00:00:06,535
Hi! My name is Andre and this week,

2
00:00:06,535 --> 00:00:09,175
we will focus on text classification problem.

3
00:00:09,175 --> 00:00:14,330
Although, the methods that we will overview can be applied to text regression as well,

4
00:00:14,330 --> 00:00:18,055
but that will be easier to keep in mind text classification problem.

5
00:00:18,055 --> 00:00:19,915
And for the example of such problem,

6
00:00:19,915 --> 00:00:21,860
we can take sentiment analysis.

7
00:00:21,860 --> 00:00:25,510
That is the problem when you have a text of review as an input,

8
00:00:25,510 --> 00:00:26,765
and as an output,

9
00:00:26,765 --> 00:00:29,180
you have to produce the class of sentiment.

10
00:00:29,180 --> 00:00:32,200
For example, it could be two classes like positive and negative.

11
00:00:32,200 --> 00:00:34,960
It could be more fine grained like positive,

12
00:00:34,960 --> 00:00:37,570
somewhat positive, neutral, somewhat negative,

13
00:00:37,570 --> 00:00:39,530
and negative, and so forth.

14
00:00:39,530 --> 00:00:42,745
And the example of positive review is the following.

15
00:00:42,745 --> 00:00:44,140
"The hotel is really beautiful.

16
00:00:44,140 --> 00:00:46,880
Very nice and helpful service at the front desk."

17
00:00:46,880 --> 00:00:50,560
So we read that and we understand that is a positive review.

18
00:00:50,560 --> 00:00:52,523
As for the negative review,

19
00:00:52,523 --> 00:00:54,895
"We had problems to get the Wi-Fi working.

20
00:00:54,895 --> 00:00:57,640
The pool area was occupied with young party animals,

21
00:00:57,640 --> 00:01:00,040
so the area wasn't fun for us."

22
00:01:00,040 --> 00:01:05,110
So, it's easy for us to read this text and to understand whether it

23
00:01:05,110 --> 00:01:11,415
has positive or negative sentiment but for computer that is much more difficult.

24
00:01:11,415 --> 00:01:14,890
And we'll first start with text preprocessing.

25
00:01:14,890 --> 00:01:17,370
And the first thing we have to ask ourselves,

26
00:01:17,370 --> 00:01:19,100
is what is text?

27
00:01:19,100 --> 00:01:21,510
You can think of text as a sequence,

28
00:01:21,510 --> 00:01:24,270
and it can be a sequence of different things.

29
00:01:24,270 --> 00:01:25,965
It can be a sequence of characters,

30
00:01:25,965 --> 00:01:29,265
that is a very low level representation of text.

31
00:01:29,265 --> 00:01:35,880
You can think of it as a sequence of words or maybe more high level features like,

32
00:01:35,880 --> 00:01:38,725
phrases like, "I don't really like",

33
00:01:38,725 --> 00:01:39,960
that could be a phrase,

34
00:01:39,960 --> 00:01:41,580
or a named entity like,

35
00:01:41,580 --> 00:01:45,300
the history of museum or the museum of history.

36
00:01:45,300 --> 00:01:51,065
And, it could be like bigger chunks like sentences or paragraphs and so forth.

37
00:01:51,065 --> 00:01:55,765
Let's start with words and let's denote what word is.

38
00:01:55,765 --> 00:01:59,345
It seems natural to think of a text as a sequence of words

39
00:01:59,345 --> 00:02:04,270
and you can think of a word as a meaningful sequence of characters.

40
00:02:04,270 --> 00:02:08,440
So, it has some meaning and it is usually like,

41
00:02:08,440 --> 00:02:12,145
if we take English language for example,

42
00:02:12,145 --> 00:02:17,600
it is usually easy to find the boundaries of words because in English we can split up

43
00:02:17,600 --> 00:02:23,635
a sentence by spaces or punctuation and all that is left are words.

44
00:02:23,635 --> 00:02:25,945
Let's look at the example,

45
00:02:25,945 --> 00:02:28,823
Friends, Romans, Countrymen, lend me your ears;

46
00:02:28,823 --> 00:02:30,660
so it has commas,

47
00:02:30,660 --> 00:02:33,985
it has a semicolon and it has spaces.

48
00:02:33,985 --> 00:02:35,740
And if we split them those,

49
00:02:35,740 --> 00:02:41,920
then we will get words that are ready for further analysis like Friends,

50
00:02:41,920 --> 00:02:44,470
Romans, Countrymen, and so forth.

51
00:02:44,470 --> 00:02:46,480
It could be more difficult in German,

52
00:02:46,480 --> 00:02:51,485
because in German, there are compound words which are written without spaces at all.

53
00:02:51,485 --> 00:02:55,730
And, the longest word that is still in use is the following,

54
00:02:55,730 --> 00:02:58,585
you can see it on the slide and it actually stands for

55
00:02:58,585 --> 00:03:02,260
insurance companies which provide legal protection.

56
00:03:02,260 --> 00:03:06,370
So for the analysis of this text,

57
00:03:06,370 --> 00:03:10,270
it could be beneficial to split that compound word into

58
00:03:10,270 --> 00:03:15,070
separate words because every one of them actually makes sense.

59
00:03:15,070 --> 00:03:19,510
They're just written in such form that they don't have spaces.

60
00:03:19,510 --> 00:03:21,730
The Japanese language is a different story.

61
00:03:21,730 --> 00:03:23,360
It doesn't have spaces at all,

62
00:03:23,360 --> 00:03:25,865
but people can still read it right.

63
00:03:25,865 --> 00:03:29,005
And even if you look at the example of the end of the slide,

64
00:03:29,005 --> 00:03:33,800
you can actually read that sentence in English but it doesn't have spaces,

65
00:03:33,800 --> 00:03:36,835
but that's not a problem for a human being.

66
00:03:36,835 --> 00:03:43,585
And the process of splitting an input text into meaningful chunks is called Tokenization,

67
00:03:43,585 --> 00:03:46,425
and that chunk is actually called token.

68
00:03:46,425 --> 00:03:51,990
You can think of a token as a useful unit for further semantic processing.

69
00:03:51,990 --> 00:03:53,860
It can be a word, a sentence,

70
00:03:53,860 --> 00:03:56,275
a paragraph or anything else.

71
00:03:56,275 --> 00:04:00,405
Let's look at the example of simple whitespaceTokenizer.

72
00:04:00,405 --> 00:04:05,510
What it does, is it splits the input sequence on white spaces,

73
00:04:05,510 --> 00:04:09,530
that could be a space or any other character that is not visible.

74
00:04:09,530 --> 00:04:16,075
And, actually, you can find that whitespaceTokenizer in Python library NLTK.

75
00:04:16,075 --> 00:04:19,610
And let's take an example of a text which says,

76
00:04:19,610 --> 00:04:21,890
this is Andrew's text, isn't it?

77
00:04:21,890 --> 00:04:25,155
And we split it on whitespaces.

78
00:04:25,155 --> 00:04:26,410
What is the problem here?

79
00:04:26,410 --> 00:04:31,670
So, you can see different tokens here that are left after this tokenization.

80
00:04:31,670 --> 00:04:36,185
The problem is that the last token, it question mark,

81
00:04:36,185 --> 00:04:40,730
it does have actually the same meaning as the token,

82
00:04:40,730 --> 00:04:42,150
it without question mark.

83
00:04:42,150 --> 00:04:44,255
But, if we tried to compare them,

84
00:04:44,255 --> 00:04:46,955
then these are different tokens.

85
00:04:46,955 --> 00:04:50,330
And that might be not a desirable effect.

86
00:04:50,330 --> 00:04:55,115
We might want to merge these two tokens because they have essentially the same meaning,

87
00:04:55,115 --> 00:04:57,613
as well as for the text comma,

88
00:04:57,613 --> 00:05:00,570
it is the same token as simply text.

89
00:05:00,570 --> 00:05:05,840
So let's try to also split by punctuation and for that purpose there is

90
00:05:05,840 --> 00:05:11,255
a tokenizer ready for you in NLTK library as well.

91
00:05:11,255 --> 00:05:14,431
And, this time we can get something like this.

92
00:05:14,431 --> 00:05:17,000
The problem with this thing,

93
00:05:17,000 --> 00:05:22,940
is that we have apostrophes that different tokens and we have that s,

94
00:05:22,940 --> 00:05:26,620
isn, and t as separate tokens as well.

95
00:05:26,620 --> 00:05:30,260
But the problem is, that these tokens actually don't have

96
00:05:30,260 --> 00:05:34,580
much meaning because it doesn't make sense to analyze

97
00:05:34,580 --> 00:05:38,660
that single letter t or s. It only

98
00:05:38,660 --> 00:05:43,400
makes sense when it is combined with apostrophe or the previous word.

99
00:05:43,400 --> 00:05:50,045
So, actually, we can come up with a set of rules or heuristics which you can find in

100
00:05:50,045 --> 00:05:54,875
TreeBanktokenizer and it actually uses the grammar rules of

101
00:05:54,875 --> 00:06:00,095
English language to make it tokenization that actually makes sense for further analysis.

102
00:06:00,095 --> 00:06:05,600
And, this is very close to perfect tokenization that we want for English language.

103
00:06:05,600 --> 00:06:10,445
So, Andrew and text are now different tokens and

104
00:06:10,445 --> 00:06:13,550
apostrophe s is left untouched as

105
00:06:13,550 --> 00:06:16,730
a different token and that actually makes much more sense,

106
00:06:16,730 --> 00:06:22,730
as well as is and n apostrophe t. Because n apostrophe t is actually,

107
00:06:22,730 --> 00:06:29,850
it means not like we negate the last token that we had.

108
00:06:29,850 --> 00:06:32,410
Let's look at Python example.

109
00:06:32,410 --> 00:06:38,145
You just import NLTK, you have a bunch of text and you can instantiate

110
00:06:38,145 --> 00:06:41,145
tokenizer like whitespace tokenizer and just called

111
00:06:41,145 --> 00:06:44,745
tokenize and you will have the list of tokens.

112
00:06:44,745 --> 00:06:50,530
You can use TreeBanktokenizer or WordPunctTokenizer that we have reviewed previously.

113
00:06:50,530 --> 00:06:54,850
So it's pretty easy to do tokenization in Python.

114
00:06:54,850 --> 00:06:57,990
The next thing you might want to do is token normalization.

115
00:06:57,990 --> 00:07:01,980
We may want the same token for different forms of the word like,

116
00:07:01,980 --> 00:07:07,215
we have word, wolf or wolves and this is actually the same thing, right?

117
00:07:07,215 --> 00:07:11,880
And we want to merge this token into a single one, wolf.

118
00:07:11,880 --> 00:07:15,135
We can have different examples like talk, talks or talked,

119
00:07:15,135 --> 00:07:17,295
then maybe it's all about the talk,

120
00:07:17,295 --> 00:07:21,395
and we don't really care what ending that word has.

121
00:07:21,395 --> 00:07:26,840
And the process of normalizing the words is called stemming or lemmatization.

122
00:07:26,840 --> 00:07:29,600
And stemming is a process of removing and

123
00:07:29,600 --> 00:07:33,230
replacing suffixes to get to the root form of the word,

124
00:07:33,230 --> 00:07:35,025
which is called the stem.

125
00:07:35,025 --> 00:07:41,870
It usually refers to heuristic that chop off suffixes or replaces them.

126
00:07:41,870 --> 00:07:44,120
Another story is lemmatization.

127
00:07:44,120 --> 00:07:46,180
When people talk about lemmatization,

128
00:07:46,180 --> 00:07:48,800
they usually refer to doing things

129
00:07:48,800 --> 00:07:52,870
properly with the use of vocabularies and morphological analysis.

130
00:07:52,870 --> 00:07:57,060
This time we return the base or dictionary form of a word,

131
00:07:57,060 --> 00:07:58,840
which is known as the lemma.

132
00:07:58,840 --> 00:08:01,125
Let's see the examples of how it works.

133
00:08:01,125 --> 00:08:02,880
For stemming example, there is

134
00:08:02,880 --> 00:08:07,410
a well-known Porter's stemmer that is like the oldest stemmer for English language.

135
00:08:07,410 --> 00:08:12,120
It has five heuristic phases of word reductions applied sequentially.

136
00:08:12,120 --> 00:08:14,650
And let me show you the example of phase one rules.

137
00:08:14,650 --> 00:08:15,743
They are pretty simple rules.

138
00:08:15,743 --> 00:08:18,135
You can think of them as regular expressions.

139
00:08:18,135 --> 00:08:23,370
So when you see the combination of characters like SSES,

140
00:08:23,370 --> 00:08:29,497
you just replace it with SS and strip that ES at the end,

141
00:08:29,497 --> 00:08:34,338
and it may work for word like caresses,

142
00:08:34,338 --> 00:08:37,830
and it's successfully reduced to caress.

143
00:08:37,830 --> 00:08:41,410
Another rule is replace IES with I.

144
00:08:41,410 --> 00:08:45,740
And for ponies, it actually works in any way,

145
00:08:45,740 --> 00:08:47,820
but what would you get in the result is not

146
00:08:47,820 --> 00:08:50,784
a valid word because poni shouldn't end with I,

147
00:08:50,784 --> 00:08:53,190
Y, and it ends with I.

148
00:08:53,190 --> 00:08:54,540
So that is a problem.

149
00:08:54,540 --> 00:08:56,528
But it actually works in practice,

150
00:08:56,528 --> 00:08:58,410
and it is well-known stemmer,

151
00:08:58,410 --> 00:09:01,380
and you can find it in an NLTK library as well.

152
00:09:01,380 --> 00:09:04,535
Let's see other examples of how it might work.

153
00:09:04,535 --> 00:09:06,315
For feet, it produces feet.

154
00:09:06,315 --> 00:09:09,075
So it doesn't know anything about irregular forms.

155
00:09:09,075 --> 00:09:11,015
For wolves, it produce wolv,

156
00:09:11,015 --> 00:09:12,480
which is not a valid word,

157
00:09:12,480 --> 00:09:15,035
but still it can be useful for analysis.

158
00:09:15,035 --> 00:09:17,820
Cats become cat, and talked becomes talk.

159
00:09:17,820 --> 00:09:19,410
So the problems are obvious.

160
00:09:19,410 --> 00:09:21,138
It fails on the regular forms,

161
00:09:21,138 --> 00:09:23,100
and it produces non-words.

162
00:09:23,100 --> 00:09:26,935
But that could be not much of a problem actually.

163
00:09:26,935 --> 00:09:28,810
In other example is lemmatization.

164
00:09:28,810 --> 00:09:30,690
And for that purpose, you can use

165
00:09:30,690 --> 00:09:35,870
WordNet lemmatizer that uses WordNet Database to lookup lemmas.

166
00:09:35,870 --> 00:09:38,400
It can also be found in NLTK library,

167
00:09:38,400 --> 00:09:40,110
and the examples are the following.

168
00:09:40,110 --> 00:09:41,940
This time when we have a word feet,

169
00:09:41,940 --> 00:09:45,345
is actually successfully reduced to the normalized form,

170
00:09:45,345 --> 00:09:48,390
foot, because we have that in our database.

171
00:09:48,390 --> 00:09:52,120
We know about words of English language and all irregular forms.

172
00:09:52,120 --> 00:09:54,870
When you take wolves, it becomes wolf.

173
00:09:54,870 --> 00:09:59,685
Cats become cat, and talked becomes talked, so nothing changes.

174
00:09:59,685 --> 00:10:03,880
And the problem is lemmatizer actually doesn't really use all the forms.

175
00:10:03,880 --> 00:10:07,215
So, for nouns, it might be like

176
00:10:07,215 --> 00:10:11,835
the normal form or lemma could be a singular form of that noun.

177
00:10:11,835 --> 00:10:15,135
But for verbs, that is a different story.

178
00:10:15,135 --> 00:10:21,705
And that might actually prevents you from merging tokens that have the same meaning.

179
00:10:21,705 --> 00:10:23,265
The takeaway is the following.

180
00:10:23,265 --> 00:10:26,303
We need to try stemming and lemmatization,

181
00:10:26,303 --> 00:10:29,735
and choose what works best for our task.

182
00:10:29,735 --> 00:10:32,565
Let's look at the Python example.

183
00:10:32,565 --> 00:10:34,750
Here, we just import NLTK library.

184
00:10:34,750 --> 00:10:36,290
We take the bunch of text,

185
00:10:36,290 --> 00:10:38,975
and the first thing we need to do is tokenize it.

186
00:10:38,975 --> 00:10:44,580
And for that purpose, let's use Treebank Tokenizer that produces a list of tokens.

187
00:10:44,580 --> 00:10:49,325
And, now, we can instantiate Porter stemmer or WordNet lemmatizer,

188
00:10:49,325 --> 00:10:53,810
and we can call stem or lemmatize on each

189
00:10:53,810 --> 00:10:58,480
token on our text and get the results that we have reviewed in the previous slides.

190
00:10:58,480 --> 00:11:02,740
So it is pretty easy in Python and NLTK too.

191
00:11:02,740 --> 00:11:05,351
So what you can do next,

192
00:11:05,351 --> 00:11:07,755
you can further normalize those tokens.

193
00:11:07,755 --> 00:11:10,020
And there are a bunch of different problems.

194
00:11:10,020 --> 00:11:11,840
Let's review some of them.

195
00:11:11,840 --> 00:11:14,125
The first problem is capital letters.

196
00:11:14,125 --> 00:11:17,530
You can have us and us written in different forms.

197
00:11:17,530 --> 00:11:20,410
And if both of these words are pronounced,

198
00:11:20,410 --> 00:11:24,745
then it is safe to reduce it to the word, us.

199
00:11:24,745 --> 00:11:32,620
And another story is when you have us and US in capital form.

200
00:11:32,620 --> 00:11:34,795
That could be a pronoun, and a country.

201
00:11:34,795 --> 00:11:37,500
And we need to distinguish them somehow.

202
00:11:37,500 --> 00:11:40,309
And the problem is that,

203
00:11:40,309 --> 00:11:45,143
if you remember that we always keep in mind that we're doing text classification,

204
00:11:45,143 --> 00:11:47,190
and we are working on, let's say,

205
00:11:47,190 --> 00:11:50,985
sentiment analysis, then it is easy to imagine

206
00:11:50,985 --> 00:11:55,380
a review which is written with Caps Lock just like with capital letters,

207
00:11:55,380 --> 00:11:57,930
and us could mean actually us,

208
00:11:57,930 --> 00:12:00,010
a pronoun, but not a country.

209
00:12:00,010 --> 00:12:03,410
So that is a very tricky part.

210
00:12:03,410 --> 00:12:06,780
We can use heuristics for English language luckily.

211
00:12:06,780 --> 00:12:09,060
We can lowercase the beginning of the sentence because we

212
00:12:09,060 --> 00:12:11,790
know that every sentence starts with capital letter,

213
00:12:11,790 --> 00:12:15,390
then it is very likely that we need to lowercase that.

214
00:12:15,390 --> 00:12:21,180
We can also lowercase words that are seen in titles because in English language,

215
00:12:21,180 --> 00:12:24,420
titles are written in such form that every word is capitalized,

216
00:12:24,420 --> 00:12:26,170
so we can strip that.

217
00:12:26,170 --> 00:12:30,840
And what else we can do is we can leave mid-sentence words as they

218
00:12:30,840 --> 00:12:35,460
are because if they're capitalized somewhere inside the sentence,

219
00:12:35,460 --> 00:12:40,725
maybe that means that that is a name or a named entity,

220
00:12:40,725 --> 00:12:42,730
and we should leave it as it is.

221
00:12:42,730 --> 00:12:44,870
Or we can go a much harder way.

222
00:12:44,870 --> 00:12:47,685
We can use machine learning to retrieve true casing,

223
00:12:47,685 --> 00:12:50,325
but that is out of scope of the lecture,

224
00:12:50,325 --> 00:12:57,140
and that might be a harder problem than the original problem of sentiment analysis.

225
00:12:57,140 --> 00:12:59,760
Another type of normalization that you can use for

226
00:12:59,760 --> 00:13:04,478
your tokens is normalizing acronyms like ETA or E,

227
00:13:04,478 --> 00:13:08,475
T, A, or ETA written in capital form.

228
00:13:08,475 --> 00:13:09,800
That is the same thing.

229
00:13:09,800 --> 00:13:11,160
That is the acronym, ETA,

230
00:13:11,160 --> 00:13:14,160
which stands for estimated time of arrival.

231
00:13:14,160 --> 00:13:19,620
And people might frequently use that in their reviews or chats or anywhere else.

232
00:13:19,620 --> 00:13:23,445
And for this, we actually can write a bunch of regular expressions that

233
00:13:23,445 --> 00:13:28,800
will capture those different representation of the same acronym,

234
00:13:28,800 --> 00:13:30,225
and we'll normalize that.

235
00:13:30,225 --> 00:13:34,245
But that is a pretty hard thing because you must think about

236
00:13:34,245 --> 00:13:39,420
all the possible forms in advance and all the acronyms that you want to normalize.

237
00:13:39,420 --> 00:13:41,770
So let's summarize.

238
00:13:41,770 --> 00:13:45,190
We can think of text as a sequence of tokens.

239
00:13:45,190 --> 00:13:50,305
And tokenization is a process of extracting those tokens,

240
00:13:50,305 --> 00:13:54,065
and token is like a meaningful part,

241
00:13:54,065 --> 00:13:55,870
a meaningful chunk of our text.

242
00:13:55,870 --> 00:13:57,270
It could be a word,

243
00:13:57,270 --> 00:13:59,415
a sentence or something bigger.

244
00:13:59,415 --> 00:14:04,185
We can normalize those tokens using either stemming or lemmatization.

245
00:14:04,185 --> 00:14:07,975
And, actually, you have to try both to decide which works best.

246
00:14:07,975 --> 00:14:12,920
We can also normalize casing and acronyms and a bunch of different things.

247
00:14:12,920 --> 00:14:14,195
In the next video,

248
00:14:14,195 --> 00:14:18,890
we will transform extracted tokens into features for our model.