1 00:00:02,880 --> 00:00:06,535 Hi! My name is Andre and this week, 2 00:00:06,535 --> 00:00:09,175 we will focus on text classification problem. 3 00:00:09,175 --> 00:00:14,330 Although, the methods that we will overview can be applied to text regression as well, 4 00:00:14,330 --> 00:00:18,055 but that will be easier to keep in mind text classification problem. 5 00:00:18,055 --> 00:00:19,915 And for the example of such problem, 6 00:00:19,915 --> 00:00:21,860 we can take sentiment analysis. 7 00:00:21,860 --> 00:00:25,510 That is the problem when you have a text of review as an input, 8 00:00:25,510 --> 00:00:26,765 and as an output, 9 00:00:26,765 --> 00:00:29,180 you have to produce the class of sentiment. 10 00:00:29,180 --> 00:00:32,200 For example, it could be two classes like positive and negative. 11 00:00:32,200 --> 00:00:34,960 It could be more fine grained like positive, 12 00:00:34,960 --> 00:00:37,570 somewhat positive, neutral, somewhat negative, 13 00:00:37,570 --> 00:00:39,530 and negative, and so forth. 14 00:00:39,530 --> 00:00:42,745 And the example of positive review is the following. 15 00:00:42,745 --> 00:00:44,140 "The hotel is really beautiful. 16 00:00:44,140 --> 00:00:46,880 Very nice and helpful service at the front desk." 17 00:00:46,880 --> 00:00:50,560 So we read that and we understand that is a positive review. 18 00:00:50,560 --> 00:00:52,523 As for the negative review, 19 00:00:52,523 --> 00:00:54,895 "We had problems to get the Wi-Fi working. 20 00:00:54,895 --> 00:00:57,640 The pool area was occupied with young party animals, 21 00:00:57,640 --> 00:01:00,040 so the area wasn't fun for us." 22 00:01:00,040 --> 00:01:05,110 So, it's easy for us to read this text and to understand whether it 23 00:01:05,110 --> 00:01:11,415 has positive or negative sentiment but for computer that is much more difficult. 24 00:01:11,415 --> 00:01:14,890 And we'll first start with text preprocessing. 25 00:01:14,890 --> 00:01:17,370 And the first thing we have to ask ourselves, 26 00:01:17,370 --> 00:01:19,100 is what is text? 27 00:01:19,100 --> 00:01:21,510 You can think of text as a sequence, 28 00:01:21,510 --> 00:01:24,270 and it can be a sequence of different things. 29 00:01:24,270 --> 00:01:25,965 It can be a sequence of characters, 30 00:01:25,965 --> 00:01:29,265 that is a very low level representation of text. 31 00:01:29,265 --> 00:01:35,880 You can think of it as a sequence of words or maybe more high level features like, 32 00:01:35,880 --> 00:01:38,725 phrases like, "I don't really like", 33 00:01:38,725 --> 00:01:39,960 that could be a phrase, 34 00:01:39,960 --> 00:01:41,580 or a named entity like, 35 00:01:41,580 --> 00:01:45,300 the history of museum or the museum of history. 36 00:01:45,300 --> 00:01:51,065 And, it could be like bigger chunks like sentences or paragraphs and so forth. 37 00:01:51,065 --> 00:01:55,765 Let's start with words and let's denote what word is. 38 00:01:55,765 --> 00:01:59,345 It seems natural to think of a text as a sequence of words 39 00:01:59,345 --> 00:02:04,270 and you can think of a word as a meaningful sequence of characters. 40 00:02:04,270 --> 00:02:08,440 So, it has some meaning and it is usually like, 41 00:02:08,440 --> 00:02:12,145 if we take English language for example, 42 00:02:12,145 --> 00:02:17,600 it is usually easy to find the boundaries of words because in English we can split up 43 00:02:17,600 --> 00:02:23,635 a sentence by spaces or punctuation and all that is left are words. 44 00:02:23,635 --> 00:02:25,945 Let's look at the example, 45 00:02:25,945 --> 00:02:28,823 Friends, Romans, Countrymen, lend me your ears; 46 00:02:28,823 --> 00:02:30,660 so it has commas, 47 00:02:30,660 --> 00:02:33,985 it has a semicolon and it has spaces. 48 00:02:33,985 --> 00:02:35,740 And if we split them those, 49 00:02:35,740 --> 00:02:41,920 then we will get words that are ready for further analysis like Friends, 50 00:02:41,920 --> 00:02:44,470 Romans, Countrymen, and so forth. 51 00:02:44,470 --> 00:02:46,480 It could be more difficult in German, 52 00:02:46,480 --> 00:02:51,485 because in German, there are compound words which are written without spaces at all. 53 00:02:51,485 --> 00:02:55,730 And, the longest word that is still in use is the following, 54 00:02:55,730 --> 00:02:58,585 you can see it on the slide and it actually stands for 55 00:02:58,585 --> 00:03:02,260 insurance companies which provide legal protection. 56 00:03:02,260 --> 00:03:06,370 So for the analysis of this text, 57 00:03:06,370 --> 00:03:10,270 it could be beneficial to split that compound word into 58 00:03:10,270 --> 00:03:15,070 separate words because every one of them actually makes sense. 59 00:03:15,070 --> 00:03:19,510 They're just written in such form that they don't have spaces. 60 00:03:19,510 --> 00:03:21,730 The Japanese language is a different story. 61 00:03:21,730 --> 00:03:23,360 It doesn't have spaces at all, 62 00:03:23,360 --> 00:03:25,865 but people can still read it right. 63 00:03:25,865 --> 00:03:29,005 And even if you look at the example of the end of the slide, 64 00:03:29,005 --> 00:03:33,800 you can actually read that sentence in English but it doesn't have spaces, 65 00:03:33,800 --> 00:03:36,835 but that's not a problem for a human being. 66 00:03:36,835 --> 00:03:43,585 And the process of splitting an input text into meaningful chunks is called Tokenization, 67 00:03:43,585 --> 00:03:46,425 and that chunk is actually called token. 68 00:03:46,425 --> 00:03:51,990 You can think of a token as a useful unit for further semantic processing. 69 00:03:51,990 --> 00:03:53,860 It can be a word, a sentence, 70 00:03:53,860 --> 00:03:56,275 a paragraph or anything else. 71 00:03:56,275 --> 00:04:00,405 Let's look at the example of simple whitespaceTokenizer. 72 00:04:00,405 --> 00:04:05,510 What it does, is it splits the input sequence on white spaces, 73 00:04:05,510 --> 00:04:09,530 that could be a space or any other character that is not visible. 74 00:04:09,530 --> 00:04:16,075 And, actually, you can find that whitespaceTokenizer in Python library NLTK. 75 00:04:16,075 --> 00:04:19,610 And let's take an example of a text which says, 76 00:04:19,610 --> 00:04:21,890 this is Andrew's text, isn't it? 77 00:04:21,890 --> 00:04:25,155 And we split it on whitespaces. 78 00:04:25,155 --> 00:04:26,410 What is the problem here? 79 00:04:26,410 --> 00:04:31,670 So, you can see different tokens here that are left after this tokenization. 80 00:04:31,670 --> 00:04:36,185 The problem is that the last token, it question mark, 81 00:04:36,185 --> 00:04:40,730 it does have actually the same meaning as the token, 82 00:04:40,730 --> 00:04:42,150 it without question mark. 83 00:04:42,150 --> 00:04:44,255 But, if we tried to compare them, 84 00:04:44,255 --> 00:04:46,955 then these are different tokens. 85 00:04:46,955 --> 00:04:50,330 And that might be not a desirable effect. 86 00:04:50,330 --> 00:04:55,115 We might want to merge these two tokens because they have essentially the same meaning, 87 00:04:55,115 --> 00:04:57,613 as well as for the text comma, 88 00:04:57,613 --> 00:05:00,570 it is the same token as simply text. 89 00:05:00,570 --> 00:05:05,840 So let's try to also split by punctuation and for that purpose there is 90 00:05:05,840 --> 00:05:11,255 a tokenizer ready for you in NLTK library as well. 91 00:05:11,255 --> 00:05:14,431 And, this time we can get something like this. 92 00:05:14,431 --> 00:05:17,000 The problem with this thing, 93 00:05:17,000 --> 00:05:22,940 is that we have apostrophes that different tokens and we have that s, 94 00:05:22,940 --> 00:05:26,620 isn, and t as separate tokens as well. 95 00:05:26,620 --> 00:05:30,260 But the problem is, that these tokens actually don't have 96 00:05:30,260 --> 00:05:34,580 much meaning because it doesn't make sense to analyze 97 00:05:34,580 --> 00:05:38,660 that single letter t or s. It only 98 00:05:38,660 --> 00:05:43,400 makes sense when it is combined with apostrophe or the previous word. 99 00:05:43,400 --> 00:05:50,045 So, actually, we can come up with a set of rules or heuristics which you can find in 100 00:05:50,045 --> 00:05:54,875 TreeBanktokenizer and it actually uses the grammar rules of 101 00:05:54,875 --> 00:06:00,095 English language to make it tokenization that actually makes sense for further analysis. 102 00:06:00,095 --> 00:06:05,600 And, this is very close to perfect tokenization that we want for English language. 103 00:06:05,600 --> 00:06:10,445 So, Andrew and text are now different tokens and 104 00:06:10,445 --> 00:06:13,550 apostrophe s is left untouched as 105 00:06:13,550 --> 00:06:16,730 a different token and that actually makes much more sense, 106 00:06:16,730 --> 00:06:22,730 as well as is and n apostrophe t. Because n apostrophe t is actually, 107 00:06:22,730 --> 00:06:29,850 it means not like we negate the last token that we had. 108 00:06:29,850 --> 00:06:32,410 Let's look at Python example. 109 00:06:32,410 --> 00:06:38,145 You just import NLTK, you have a bunch of text and you can instantiate 110 00:06:38,145 --> 00:06:41,145 tokenizer like whitespace tokenizer and just called 111 00:06:41,145 --> 00:06:44,745 tokenize and you will have the list of tokens. 112 00:06:44,745 --> 00:06:50,530 You can use TreeBanktokenizer or WordPunctTokenizer that we have reviewed previously. 113 00:06:50,530 --> 00:06:54,850 So it's pretty easy to do tokenization in Python. 114 00:06:54,850 --> 00:06:57,990 The next thing you might want to do is token normalization. 115 00:06:57,990 --> 00:07:01,980 We may want the same token for different forms of the word like, 116 00:07:01,980 --> 00:07:07,215 we have word, wolf or wolves and this is actually the same thing, right? 117 00:07:07,215 --> 00:07:11,880 And we want to merge this token into a single one, wolf. 118 00:07:11,880 --> 00:07:15,135 We can have different examples like talk, talks or talked, 119 00:07:15,135 --> 00:07:17,295 then maybe it's all about the talk, 120 00:07:17,295 --> 00:07:21,395 and we don't really care what ending that word has. 121 00:07:21,395 --> 00:07:26,840 And the process of normalizing the words is called stemming or lemmatization. 122 00:07:26,840 --> 00:07:29,600 And stemming is a process of removing and 123 00:07:29,600 --> 00:07:33,230 replacing suffixes to get to the root form of the word, 124 00:07:33,230 --> 00:07:35,025 which is called the stem. 125 00:07:35,025 --> 00:07:41,870 It usually refers to heuristic that chop off suffixes or replaces them. 126 00:07:41,870 --> 00:07:44,120 Another story is lemmatization. 127 00:07:44,120 --> 00:07:46,180 When people talk about lemmatization, 128 00:07:46,180 --> 00:07:48,800 they usually refer to doing things 129 00:07:48,800 --> 00:07:52,870 properly with the use of vocabularies and morphological analysis. 130 00:07:52,870 --> 00:07:57,060 This time we return the base or dictionary form of a word, 131 00:07:57,060 --> 00:07:58,840 which is known as the lemma. 132 00:07:58,840 --> 00:08:01,125 Let's see the examples of how it works. 133 00:08:01,125 --> 00:08:02,880 For stemming example, there is 134 00:08:02,880 --> 00:08:07,410 a well-known Porter's stemmer that is like the oldest stemmer for English language. 135 00:08:07,410 --> 00:08:12,120 It has five heuristic phases of word reductions applied sequentially. 136 00:08:12,120 --> 00:08:14,650 And let me show you the example of phase one rules. 137 00:08:14,650 --> 00:08:15,743 They are pretty simple rules. 138 00:08:15,743 --> 00:08:18,135 You can think of them as regular expressions. 139 00:08:18,135 --> 00:08:23,370 So when you see the combination of characters like SSES, 140 00:08:23,370 --> 00:08:29,497 you just replace it with SS and strip that ES at the end, 141 00:08:29,497 --> 00:08:34,338 and it may work for word like caresses, 142 00:08:34,338 --> 00:08:37,830 and it's successfully reduced to caress. 143 00:08:37,830 --> 00:08:41,410 Another rule is replace IES with I. 144 00:08:41,410 --> 00:08:45,740 And for ponies, it actually works in any way, 145 00:08:45,740 --> 00:08:47,820 but what would you get in the result is not 146 00:08:47,820 --> 00:08:50,784 a valid word because poni shouldn't end with I, 147 00:08:50,784 --> 00:08:53,190 Y, and it ends with I. 148 00:08:53,190 --> 00:08:54,540 So that is a problem. 149 00:08:54,540 --> 00:08:56,528 But it actually works in practice, 150 00:08:56,528 --> 00:08:58,410 and it is well-known stemmer, 151 00:08:58,410 --> 00:09:01,380 and you can find it in an NLTK library as well. 152 00:09:01,380 --> 00:09:04,535 Let's see other examples of how it might work. 153 00:09:04,535 --> 00:09:06,315 For feet, it produces feet. 154 00:09:06,315 --> 00:09:09,075 So it doesn't know anything about irregular forms. 155 00:09:09,075 --> 00:09:11,015 For wolves, it produce wolv, 156 00:09:11,015 --> 00:09:12,480 which is not a valid word, 157 00:09:12,480 --> 00:09:15,035 but still it can be useful for analysis. 158 00:09:15,035 --> 00:09:17,820 Cats become cat, and talked becomes talk. 159 00:09:17,820 --> 00:09:19,410 So the problems are obvious. 160 00:09:19,410 --> 00:09:21,138 It fails on the regular forms, 161 00:09:21,138 --> 00:09:23,100 and it produces non-words. 162 00:09:23,100 --> 00:09:26,935 But that could be not much of a problem actually. 163 00:09:26,935 --> 00:09:28,810 In other example is lemmatization. 164 00:09:28,810 --> 00:09:30,690 And for that purpose, you can use 165 00:09:30,690 --> 00:09:35,870 WordNet lemmatizer that uses WordNet Database to lookup lemmas. 166 00:09:35,870 --> 00:09:38,400 It can also be found in NLTK library, 167 00:09:38,400 --> 00:09:40,110 and the examples are the following. 168 00:09:40,110 --> 00:09:41,940 This time when we have a word feet, 169 00:09:41,940 --> 00:09:45,345 is actually successfully reduced to the normalized form, 170 00:09:45,345 --> 00:09:48,390 foot, because we have that in our database. 171 00:09:48,390 --> 00:09:52,120 We know about words of English language and all irregular forms. 172 00:09:52,120 --> 00:09:54,870 When you take wolves, it becomes wolf. 173 00:09:54,870 --> 00:09:59,685 Cats become cat, and talked becomes talked, so nothing changes. 174 00:09:59,685 --> 00:10:03,880 And the problem is lemmatizer actually doesn't really use all the forms. 175 00:10:03,880 --> 00:10:07,215 So, for nouns, it might be like 176 00:10:07,215 --> 00:10:11,835 the normal form or lemma could be a singular form of that noun. 177 00:10:11,835 --> 00:10:15,135 But for verbs, that is a different story. 178 00:10:15,135 --> 00:10:21,705 And that might actually prevents you from merging tokens that have the same meaning. 179 00:10:21,705 --> 00:10:23,265 The takeaway is the following. 180 00:10:23,265 --> 00:10:26,303 We need to try stemming and lemmatization, 181 00:10:26,303 --> 00:10:29,735 and choose what works best for our task. 182 00:10:29,735 --> 00:10:32,565 Let's look at the Python example. 183 00:10:32,565 --> 00:10:34,750 Here, we just import NLTK library. 184 00:10:34,750 --> 00:10:36,290 We take the bunch of text, 185 00:10:36,290 --> 00:10:38,975 and the first thing we need to do is tokenize it. 186 00:10:38,975 --> 00:10:44,580 And for that purpose, let's use Treebank Tokenizer that produces a list of tokens. 187 00:10:44,580 --> 00:10:49,325 And, now, we can instantiate Porter stemmer or WordNet lemmatizer, 188 00:10:49,325 --> 00:10:53,810 and we can call stem or lemmatize on each 189 00:10:53,810 --> 00:10:58,480 token on our text and get the results that we have reviewed in the previous slides. 190 00:10:58,480 --> 00:11:02,740 So it is pretty easy in Python and NLTK too. 191 00:11:02,740 --> 00:11:05,351 So what you can do next, 192 00:11:05,351 --> 00:11:07,755 you can further normalize those tokens. 193 00:11:07,755 --> 00:11:10,020 And there are a bunch of different problems. 194 00:11:10,020 --> 00:11:11,840 Let's review some of them. 195 00:11:11,840 --> 00:11:14,125 The first problem is capital letters. 196 00:11:14,125 --> 00:11:17,530 You can have us and us written in different forms. 197 00:11:17,530 --> 00:11:20,410 And if both of these words are pronounced, 198 00:11:20,410 --> 00:11:24,745 then it is safe to reduce it to the word, us. 199 00:11:24,745 --> 00:11:32,620 And another story is when you have us and US in capital form. 200 00:11:32,620 --> 00:11:34,795 That could be a pronoun, and a country. 201 00:11:34,795 --> 00:11:37,500 And we need to distinguish them somehow. 202 00:11:37,500 --> 00:11:40,309 And the problem is that, 203 00:11:40,309 --> 00:11:45,143 if you remember that we always keep in mind that we're doing text classification, 204 00:11:45,143 --> 00:11:47,190 and we are working on, let's say, 205 00:11:47,190 --> 00:11:50,985 sentiment analysis, then it is easy to imagine 206 00:11:50,985 --> 00:11:55,380 a review which is written with Caps Lock just like with capital letters, 207 00:11:55,380 --> 00:11:57,930 and us could mean actually us, 208 00:11:57,930 --> 00:12:00,010 a pronoun, but not a country. 209 00:12:00,010 --> 00:12:03,410 So that is a very tricky part. 210 00:12:03,410 --> 00:12:06,780 We can use heuristics for English language luckily. 211 00:12:06,780 --> 00:12:09,060 We can lowercase the beginning of the sentence because we 212 00:12:09,060 --> 00:12:11,790 know that every sentence starts with capital letter, 213 00:12:11,790 --> 00:12:15,390 then it is very likely that we need to lowercase that. 214 00:12:15,390 --> 00:12:21,180 We can also lowercase words that are seen in titles because in English language, 215 00:12:21,180 --> 00:12:24,420 titles are written in such form that every word is capitalized, 216 00:12:24,420 --> 00:12:26,170 so we can strip that. 217 00:12:26,170 --> 00:12:30,840 And what else we can do is we can leave mid-sentence words as they 218 00:12:30,840 --> 00:12:35,460 are because if they're capitalized somewhere inside the sentence, 219 00:12:35,460 --> 00:12:40,725 maybe that means that that is a name or a named entity, 220 00:12:40,725 --> 00:12:42,730 and we should leave it as it is. 221 00:12:42,730 --> 00:12:44,870 Or we can go a much harder way. 222 00:12:44,870 --> 00:12:47,685 We can use machine learning to retrieve true casing, 223 00:12:47,685 --> 00:12:50,325 but that is out of scope of the lecture, 224 00:12:50,325 --> 00:12:57,140 and that might be a harder problem than the original problem of sentiment analysis. 225 00:12:57,140 --> 00:12:59,760 Another type of normalization that you can use for 226 00:12:59,760 --> 00:13:04,478 your tokens is normalizing acronyms like ETA or E, 227 00:13:04,478 --> 00:13:08,475 T, A, or ETA written in capital form. 228 00:13:08,475 --> 00:13:09,800 That is the same thing. 229 00:13:09,800 --> 00:13:11,160 That is the acronym, ETA, 230 00:13:11,160 --> 00:13:14,160 which stands for estimated time of arrival. 231 00:13:14,160 --> 00:13:19,620 And people might frequently use that in their reviews or chats or anywhere else. 232 00:13:19,620 --> 00:13:23,445 And for this, we actually can write a bunch of regular expressions that 233 00:13:23,445 --> 00:13:28,800 will capture those different representation of the same acronym, 234 00:13:28,800 --> 00:13:30,225 and we'll normalize that. 235 00:13:30,225 --> 00:13:34,245 But that is a pretty hard thing because you must think about 236 00:13:34,245 --> 00:13:39,420 all the possible forms in advance and all the acronyms that you want to normalize. 237 00:13:39,420 --> 00:13:41,770 So let's summarize. 238 00:13:41,770 --> 00:13:45,190 We can think of text as a sequence of tokens. 239 00:13:45,190 --> 00:13:50,305 And tokenization is a process of extracting those tokens, 240 00:13:50,305 --> 00:13:54,065 and token is like a meaningful part, 241 00:13:54,065 --> 00:13:55,870 a meaningful chunk of our text. 242 00:13:55,870 --> 00:13:57,270 It could be a word, 243 00:13:57,270 --> 00:13:59,415 a sentence or something bigger. 244 00:13:59,415 --> 00:14:04,185 We can normalize those tokens using either stemming or lemmatization. 245 00:14:04,185 --> 00:14:07,975 And, actually, you have to try both to decide which works best. 246 00:14:07,975 --> 00:14:12,920 We can also normalize casing and acronyms and a bunch of different things. 247 00:14:12,920 --> 00:14:14,195 In the next video, 248 00:14:14,195 --> 00:14:18,890 we will transform extracted tokens into features for our model.