Hi. In this lecture will transform tokens into features. And the best way to do that is Bag of Words. Let's count occurrences of a particular token in our text. The motivation is the following. We're actually looking for marker words like excellent or disappointed, and we want to detect those words, and make decisions based on absence or presence of that particular word, and how it might work. Let's take an example of three reviews like a good movie, not a good movie, did not like. Let's take all the possible words or tokens that we have in our documents. And for each such token, let's introduce a new feature or column that will correspond to that particular word. So, that is a pretty huge metrics of numbers, and how we translate our text into a vector in that metrics or row in that metrics. So, let's take for example good movie review. We have the word good, which is present in our text. So we put one in the column that corresponds to that word, then comes word movie, and we put one in the second column just to show that that word is actually seen in our text. We don't have any other words, so all the rest are zeroes. And that is a really long vector which is sparse in a sense that it has a lot of zeroes. And for not a good movie, it will have four ones, and all the rest of zeroes and so forth. This process is called text vectorization, because we actually replace the text with a huge vector of numbers, and each dimension of that vector corresponds to a certain token in our database. You can actually see that it has some problems. The first one is that we lose word order, because we can actually shuffle over words, and the representation on the right will stay the same. And that's why it's called bag of words, because it's a bag they're not ordered, and so they can come up in any order. And different problem is that counters are not normalized. Let's solve these two problems, and let's start with preserving some ordering. So how can we do that? Actually you can easily come to an idea that you should look at token pairs, triplets, or different combinations. These approach is also called as extracting n-grams. One gram stands for tokens, two gram stands for a token pair and so forth. So let's look how it might work. We have the same three reviews, and now we don't only have columns that correspond to tokens, but we have also columns that correspond to let's say token pairs. And our good movie review now translates into vector, which has one in a column corresponding to that token pair good movie, for movie for good and so forth. So, this way, we preserve some local word order, and we hope that that will help us to analyze this text better. The problems are obvious though. This representation can have too many features, because let's say you have 100,000 words in your database, and if you try to take the pairs of those words, then you can actually come up with a huge number that can exponentially grow with the number of consecutive words that you want to analyze. So that is a problem. And to overcome that problem, we can actually remove some n-grams. Let's remove n-grams from features based on their occurrence frequency in documents of our corpus. You can actually see that for high frequency n-grams, as well as for low frequency n-grams, we can show why we don't need those n-grams. For high frequency, if you take a text and take high frequency n-grams that is seen in almost all of the documents, and for English language that would be articles, and preposition, and stuff like that. Because they're just there for grammatical structure and they don't have much meaning. These are called stop-words, they won't help us to discriminate texts, and we can pretty easily remove them. Another story is low frequency n-grams, and if you look at low frequency n-grams, you actually find typos because people type with mistakes, or rare n-grams that's usually not seen in any other reviews. And both of them are bad for our model, because if we don't remove these tokens, then very likely we will overfeed, because that would be a very good feature for our future classifier that can just see that, okay, we have a review that has a typo, and we had only like two of those reviews, which had those typo, and it's pretty clear whether it's positive or negative. So, it can learn some independences that are actually not there and we don't really need them. And the last one is medium frequency n-grams, and those are really good n-grams, because they contain n-grams that are not stop-words, that are not typos and we actually look at them. And, the problem is there're a lot of medium frequency n-grams. And it proved to be useful to look at n-gram frequency in our corpus for filtering out bad n-grams. What if we can use the same frequency for ranking of medium frequency n-grams? Maybe we can decide which medium frequency n-gram is better and which is worse based on that frequency. And the idea is the following, the n-gram with smaller frequency can be more discriminating because it can capture a specific issue in the review. Let's say, somebody is not happy with the Wi-Fi and let's say it says, Wi-Fi breaks often, and that n-gram, Wi-Fi breaks, it can not be very frequent in our database, in our corpus of our documents, but it can actually highlight a specific issue that we need to look closer at. And to utilize that idea, we will have to introduce some notions first like term frequency. We will denote it as TF and that is the frequency for term t. The term is an n-gram, token, or anything like that in a document d. And there are different options how you can count that term frequency. The first and the easiest one is binary. You can actually take zero or one based on the fact whether that token is absent in our text or it is present. Then, a different option is to take just a raw count of how many times we've seen that term in our document, and let's denote that by f. Then, you can take a term frequency, so you can actually look at all the counts of all the terms that you have seen in your document and you can normalize those counters to have a sum of one. So there is a kind of a probability distribution on those tokens. And for that, you take that f and divide by the sum of fs for all the tokens in your document. And, one more useful scheme is logarithmic normalization. You take the logarithm of those counts and it actually introduces a logarithmic scale for your counters and that might help you to solve the task better. So that's it with term frequency. We will use that in the following slides. Another thing is inverse document frequency. Lets denote by capital N, total number of documents in our corpus, and our corpus is a capital D, that is the set of all our documents. Now, let's look at how many documents are there in that corpus that contain a specific term. And that is the second line and that is the size of that set that actually means the number of documents where the term appears. If you think about document frequency, then you would take that number of documents where the term appears and divide by the total number of documents, and you have a frequency of those of that term in our documents. But if you want to take inverse argument frequency then you just swap the up and down of that ratio and you take a logarithm of that and that thing, we will call inverse document frequency. So, it is just the logarithm of N over the number of documents where the term appears. And using these two things, IDF and term frequency, we can actually come up with TF-IDF value, which needs a term, a document, and a corpus to be calculated. And it works like the following, you take the term frequency of our term T in our document d and you multiplied by inverse document frequency of that term in all our documents. And let's see why it actually makes sense to do something like this. A high weight in TF-IDF is reached when we have high term frequency in the given document and the low document frequency of the term in the whole collection of documents. That is precisely the idea that we wanted to follow. We wanted to find frequent issues in the reviews that are not so frequent in the whole data-set, so specific issues and we want to highlight them. Let's see how it might work. We can replace counters in our bag of words representation with TF-IDF values. We can also normalize the result row-wise, so we normalize each row. We can do, that for example, by dividing by L2 norm or the sum of those numbers, you can go anyway. And, where we actually get in the result we have not counters but some real values and let's look at this example. We have a good movie, two gram and it appears in two documents. So in our collection it is pretty frequent, two gram. That's why the value 0.17 is actually lower than 0.47 and we get 0.47 for did not two gram and that actually is there because that did not two gram happened only in one review and that could be a specific issue and we want to highlight, that we want to have a bigger value for that feature. Let's look how it might work in python. In Python, you can use scikit learn library and you can import TF-IDF vectorizer. Let me remind you that vectorization means that we replace how we text with a huge vector that has a lot of zeroes but some of the values are not zeroes and those are precisely the values that correspond to the tokens that is seen in our text. Now, let's take an example of five different texts like small movie reviews and what we do is we instantiate that TF-IDF vectoriser and it has some useful arguments that you can pass to it, like mean DF, which stands for minimum document frequency that is essentially a cutoff threshold for low frequency and grams because we want to throw them away. And we can actually threshold it on a maximum number of documents where we've seen that token and this is done for stripping away stop words. And this time, in scikit learn library, we actually bust that argument as a ratio of documents but not a real valued number of documents where we've seen that. And the last argument is n-gram range, which actually tells TF-IDF vectorizer, what n-grams should be used in these back of words for representation. In this scenario, they take one gram and two gram. So, if we have vectorized our text we get something like this. So not all possible two grams or one grams are there because some of them are filtered. You can just follow, just look at the reviews and see why that happened and you can also see that we have real values in these matters because those who actually TF-IDF values and each row is normalized to have a norm of one. So let's summarize, we've made actually a simple counter features in bag of words manner. We replaced each text by a huge vector of counters. You can add n-grams to try to preserve some local ordering and we will further see that it actually improves the quality of text classification. You can replace counter's with TF-IDF values and that usually gives you a performance boost as well. In the next video, we will train our first model on top of these features.