Hi! My name is Andre and this week, we will focus on text classification problem. Although, the methods that we will overview can be applied to text regression as well, but that will be easier to keep in mind text classification problem. And for the example of such problem, we can take sentiment analysis. That is the problem when you have a text of review as an input, and as an output, you have to produce the class of sentiment. For example, it could be two classes like positive and negative. It could be more fine grained like positive, somewhat positive, neutral, somewhat negative, and negative, and so forth. And the example of positive review is the following. "The hotel is really beautiful. Very nice and helpful service at the front desk." So we read that and we understand that is a positive review. As for the negative review, "We had problems to get the Wi-Fi working. The pool area was occupied with young party animals, so the area wasn't fun for us." So, it's easy for us to read this text and to understand whether it has positive or negative sentiment but for computer that is much more difficult. And we'll first start with text preprocessing. And the first thing we have to ask ourselves, is what is text? You can think of text as a sequence, and it can be a sequence of different things. It can be a sequence of characters, that is a very low level representation of text. You can think of it as a sequence of words or maybe more high level features like, phrases like, "I don't really like", that could be a phrase, or a named entity like, the history of museum or the museum of history. And, it could be like bigger chunks like sentences or paragraphs and so forth. Let's start with words and let's denote what word is. It seems natural to think of a text as a sequence of words and you can think of a word as a meaningful sequence of characters. So, it has some meaning and it is usually like, if we take English language for example, it is usually easy to find the boundaries of words because in English we can split up a sentence by spaces or punctuation and all that is left are words. Let's look at the example, Friends, Romans, Countrymen, lend me your ears; so it has commas, it has a semicolon and it has spaces. And if we split them those, then we will get words that are ready for further analysis like Friends, Romans, Countrymen, and so forth. It could be more difficult in German, because in German, there are compound words which are written without spaces at all. And, the longest word that is still in use is the following, you can see it on the slide and it actually stands for insurance companies which provide legal protection. So for the analysis of this text, it could be beneficial to split that compound word into separate words because every one of them actually makes sense. They're just written in such form that they don't have spaces. The Japanese language is a different story. It doesn't have spaces at all, but people can still read it right. And even if you look at the example of the end of the slide, you can actually read that sentence in English but it doesn't have spaces, but that's not a problem for a human being. And the process of splitting an input text into meaningful chunks is called Tokenization, and that chunk is actually called token. You can think of a token as a useful unit for further semantic processing. It can be a word, a sentence, a paragraph or anything else. Let's look at the example of simple whitespaceTokenizer. What it does, is it splits the input sequence on white spaces, that could be a space or any other character that is not visible. And, actually, you can find that whitespaceTokenizer in Python library NLTK. And let's take an example of a text which says, this is Andrew's text, isn't it? And we split it on whitespaces. What is the problem here? So, you can see different tokens here that are left after this tokenization. The problem is that the last token, it question mark, it does have actually the same meaning as the token, it without question mark. But, if we tried to compare them, then these are different tokens. And that might be not a desirable effect. We might want to merge these two tokens because they have essentially the same meaning, as well as for the text comma, it is the same token as simply text. So let's try to also split by punctuation and for that purpose there is a tokenizer ready for you in NLTK library as well. And, this time we can get something like this. The problem with this thing, is that we have apostrophes that different tokens and we have that s, isn, and t as separate tokens as well. But the problem is, that these tokens actually don't have much meaning because it doesn't make sense to analyze that single letter t or s. It only makes sense when it is combined with apostrophe or the previous word. So, actually, we can come up with a set of rules or heuristics which you can find in TreeBanktokenizer and it actually uses the grammar rules of English language to make it tokenization that actually makes sense for further analysis. And, this is very close to perfect tokenization that we want for English language. So, Andrew and text are now different tokens and apostrophe s is left untouched as a different token and that actually makes much more sense, as well as is and n apostrophe t. Because n apostrophe t is actually, it means not like we negate the last token that we had. Let's look at Python example. You just import NLTK, you have a bunch of text and you can instantiate tokenizer like whitespace tokenizer and just called tokenize and you will have the list of tokens. You can use TreeBanktokenizer or WordPunctTokenizer that we have reviewed previously. So it's pretty easy to do tokenization in Python. The next thing you might want to do is token normalization. We may want the same token for different forms of the word like, we have word, wolf or wolves and this is actually the same thing, right? And we want to merge this token into a single one, wolf. We can have different examples like talk, talks or talked, then maybe it's all about the talk, and we don't really care what ending that word has. And the process of normalizing the words is called stemming or lemmatization. And stemming is a process of removing and replacing suffixes to get to the root form of the word, which is called the stem. It usually refers to heuristic that chop off suffixes or replaces them. Another story is lemmatization. When people talk about lemmatization, they usually refer to doing things properly with the use of vocabularies and morphological analysis. This time we return the base or dictionary form of a word, which is known as the lemma. Let's see the examples of how it works. For stemming example, there is a well-known Porter's stemmer that is like the oldest stemmer for English language. It has five heuristic phases of word reductions applied sequentially. And let me show you the example of phase one rules. They are pretty simple rules. You can think of them as regular expressions. So when you see the combination of characters like SSES, you just replace it with SS and strip that ES at the end, and it may work for word like caresses, and it's successfully reduced to caress. Another rule is replace IES with I. And for ponies, it actually works in any way, but what would you get in the result is not a valid word because poni shouldn't end with I, Y, and it ends with I. So that is a problem. But it actually works in practice, and it is well-known stemmer, and you can find it in an NLTK library as well. Let's see other examples of how it might work. For feet, it produces feet. So it doesn't know anything about irregular forms. For wolves, it produce wolv, which is not a valid word, but still it can be useful for analysis. Cats become cat, and talked becomes talk. So the problems are obvious. It fails on the regular forms, and it produces non-words. But that could be not much of a problem actually. In other example is lemmatization. And for that purpose, you can use WordNet lemmatizer that uses WordNet Database to lookup lemmas. It can also be found in NLTK library, and the examples are the following. This time when we have a word feet, is actually successfully reduced to the normalized form, foot, because we have that in our database. We know about words of English language and all irregular forms. When you take wolves, it becomes wolf. Cats become cat, and talked becomes talked, so nothing changes. And the problem is lemmatizer actually doesn't really use all the forms. So, for nouns, it might be like the normal form or lemma could be a singular form of that noun. But for verbs, that is a different story. And that might actually prevents you from merging tokens that have the same meaning. The takeaway is the following. We need to try stemming and lemmatization, and choose what works best for our task. Let's look at the Python example. Here, we just import NLTK library. We take the bunch of text, and the first thing we need to do is tokenize it. And for that purpose, let's use Treebank Tokenizer that produces a list of tokens. And, now, we can instantiate Porter stemmer or WordNet lemmatizer, and we can call stem or lemmatize on each token on our text and get the results that we have reviewed in the previous slides. So it is pretty easy in Python and NLTK too. So what you can do next, you can further normalize those tokens. And there are a bunch of different problems. Let's review some of them. The first problem is capital letters. You can have us and us written in different forms. And if both of these words are pronounced, then it is safe to reduce it to the word, us. And another story is when you have us and US in capital form. That could be a pronoun, and a country. And we need to distinguish them somehow. And the problem is that, if you remember that we always keep in mind that we're doing text classification, and we are working on, let's say, sentiment analysis, then it is easy to imagine a review which is written with Caps Lock just like with capital letters, and us could mean actually us, a pronoun, but not a country. So that is a very tricky part. We can use heuristics for English language luckily. We can lowercase the beginning of the sentence because we know that every sentence starts with capital letter, then it is very likely that we need to lowercase that. We can also lowercase words that are seen in titles because in English language, titles are written in such form that every word is capitalized, so we can strip that. And what else we can do is we can leave mid-sentence words as they are because if they're capitalized somewhere inside the sentence, maybe that means that that is a name or a named entity, and we should leave it as it is. Or we can go a much harder way. We can use machine learning to retrieve true casing, but that is out of scope of the lecture, and that might be a harder problem than the original problem of sentiment analysis. Another type of normalization that you can use for your tokens is normalizing acronyms like ETA or E, T, A, or ETA written in capital form. That is the same thing. That is the acronym, ETA, which stands for estimated time of arrival. And people might frequently use that in their reviews or chats or anywhere else. And for this, we actually can write a bunch of regular expressions that will capture those different representation of the same acronym, and we'll normalize that. But that is a pretty hard thing because you must think about all the possible forms in advance and all the acronyms that you want to normalize. So let's summarize. We can think of text as a sequence of tokens. And tokenization is a process of extracting those tokens, and token is like a meaningful part, a meaningful chunk of our text. It could be a word, a sentence or something bigger. We can normalize those tokens using either stemming or lemmatization. And, actually, you have to try both to decide which works best. We can also normalize casing and acronyms and a bunch of different things. In the next video, we will transform extracted tokens into features for our model.