1 00:00:02,610 --> 00:00:08,075 Hi. In this lecture will transform tokens into features. 2 00:00:08,075 --> 00:00:12,085 And the best way to do that is Bag of Words. 3 00:00:12,085 --> 00:00:15,910 Let's count occurrences of a particular token in our text. 4 00:00:15,910 --> 00:00:17,680 The motivation is the following. 5 00:00:17,680 --> 00:00:22,320 We're actually looking for marker words like excellent or disappointed, 6 00:00:22,320 --> 00:00:24,835 and we want to detect those words, 7 00:00:24,835 --> 00:00:30,230 and make decisions based on absence or presence of that particular word, 8 00:00:30,230 --> 00:00:32,285 and how it might work. 9 00:00:32,285 --> 00:00:36,370 Let's take an example of three reviews like a good movie, 10 00:00:36,370 --> 00:00:38,085 not a good movie, did not like. 11 00:00:38,085 --> 00:00:44,020 Let's take all the possible words or tokens that we have in our documents. 12 00:00:44,020 --> 00:00:45,755 And for each such token, 13 00:00:45,755 --> 00:00:51,450 let's introduce a new feature or column that will correspond to that particular word. 14 00:00:51,450 --> 00:00:54,490 So, that is a pretty huge metrics of numbers, 15 00:00:54,490 --> 00:01:02,885 and how we translate our text into a vector in that metrics or row in that metrics. 16 00:01:02,885 --> 00:01:06,672 So, let's take for example good movie review. 17 00:01:06,672 --> 00:01:08,770 We have the word good, 18 00:01:08,770 --> 00:01:10,310 which is present in our text. 19 00:01:10,310 --> 00:01:13,860 So we put one in the column that corresponds to that word, 20 00:01:13,860 --> 00:01:15,835 then comes word movie, 21 00:01:15,835 --> 00:01:18,540 and we put one in the second column just 22 00:01:18,540 --> 00:01:22,080 to show that that word is actually seen in our text. 23 00:01:22,080 --> 00:01:23,690 We don't have any other words, 24 00:01:23,690 --> 00:01:25,575 so all the rest are zeroes. 25 00:01:25,575 --> 00:01:32,280 And that is a really long vector which is sparse in a sense that it has a lot of zeroes. 26 00:01:32,280 --> 00:01:34,020 And for not a good movie, 27 00:01:34,020 --> 00:01:35,775 it will have four ones, 28 00:01:35,775 --> 00:01:38,376 and all the rest of zeroes and so forth. 29 00:01:38,376 --> 00:01:40,640 This process is called text vectorization, 30 00:01:40,640 --> 00:01:46,350 because we actually replace the text with a huge vector of numbers, 31 00:01:46,350 --> 00:01:52,725 and each dimension of that vector corresponds to a certain token in our database. 32 00:01:52,725 --> 00:01:56,135 You can actually see that it has some problems. 33 00:01:56,135 --> 00:01:58,615 The first one is that we lose word order, 34 00:01:58,615 --> 00:02:01,525 because we can actually shuffle over words, 35 00:02:01,525 --> 00:02:05,165 and the representation on the right will stay the same. 36 00:02:05,165 --> 00:02:07,620 And that's why it's called bag of words, 37 00:02:07,620 --> 00:02:10,685 because it's a bag they're not ordered, 38 00:02:10,685 --> 00:02:13,830 and so they can come up in any order. 39 00:02:13,830 --> 00:02:17,840 And different problem is that counters are not normalized. 40 00:02:17,840 --> 00:02:19,905 Let's solve these two problems, 41 00:02:19,905 --> 00:02:23,420 and let's start with preserving some ordering. 42 00:02:23,420 --> 00:02:25,670 So how can we do that? 43 00:02:25,670 --> 00:02:31,490 Actually you can easily come to an idea that you should look at token pairs, 44 00:02:31,490 --> 00:02:34,530 triplets, or different combinations. 45 00:02:34,530 --> 00:02:39,110 These approach is also called as extracting n-grams. 46 00:02:39,110 --> 00:02:41,690 One gram stands for tokens, 47 00:02:41,690 --> 00:02:45,823 two gram stands for a token pair and so forth. 48 00:02:45,823 --> 00:02:47,510 So let's look how it might work. 49 00:02:47,510 --> 00:02:49,475 We have the same three reviews, 50 00:02:49,475 --> 00:02:54,965 and now we don't only have columns that correspond to tokens, 51 00:02:54,965 --> 00:02:59,910 but we have also columns that correspond to let's say token pairs. 52 00:02:59,910 --> 00:03:04,295 And our good movie review now translates into vector, 53 00:03:04,295 --> 00:03:11,375 which has one in a column corresponding to that token pair good movie, 54 00:03:11,375 --> 00:03:15,223 for movie for good and so forth. 55 00:03:15,223 --> 00:03:19,175 So, this way, we preserve some local word order, 56 00:03:19,175 --> 00:03:23,960 and we hope that that will help us to analyze this text better. 57 00:03:23,960 --> 00:03:26,635 The problems are obvious though. 58 00:03:26,635 --> 00:03:29,795 This representation can have too many features, 59 00:03:29,795 --> 00:03:37,325 because let's say you have 100,000 words in your database, 60 00:03:37,325 --> 00:03:40,580 and if you try to take the pairs of those words, 61 00:03:40,580 --> 00:03:44,630 then you can actually come up with a huge number that can exponentially 62 00:03:44,630 --> 00:03:49,916 grow with the number of consecutive words that you want to analyze. 63 00:03:49,916 --> 00:03:51,665 So that is a problem. 64 00:03:51,665 --> 00:03:54,200 And to overcome that problem, 65 00:03:54,200 --> 00:03:56,335 we can actually remove some n-grams. 66 00:03:56,335 --> 00:03:59,600 Let's remove n-grams from features based on 67 00:03:59,600 --> 00:04:03,510 their occurrence frequency in documents of our corpus. 68 00:04:03,510 --> 00:04:07,840 You can actually see that for high frequency n-grams, 69 00:04:07,840 --> 00:04:10,045 as well as for low frequency n-grams, 70 00:04:10,045 --> 00:04:14,170 we can show why we don't need those n-grams. 71 00:04:14,170 --> 00:04:18,010 For high frequency, if you take a text and take 72 00:04:18,010 --> 00:04:22,165 high frequency n-grams that is seen in almost all of the documents, 73 00:04:22,165 --> 00:04:24,700 and for English language that would be articles, 74 00:04:24,700 --> 00:04:27,220 and preposition, and stuff like that. 75 00:04:27,220 --> 00:04:33,100 Because they're just there for grammatical structure and they don't have much meaning. 76 00:04:33,100 --> 00:04:35,200 These are called stop-words, 77 00:04:35,200 --> 00:04:37,375 they won't help us to discriminate texts, 78 00:04:37,375 --> 00:04:41,500 and we can pretty easily remove them. 79 00:04:41,500 --> 00:04:44,485 Another story is low frequency n-grams, 80 00:04:44,485 --> 00:04:46,818 and if you look at low frequency n-grams, 81 00:04:46,818 --> 00:04:51,040 you actually find typos because people type with mistakes, 82 00:04:51,040 --> 00:04:56,930 or rare n-grams that's usually not seen in any other reviews. 83 00:04:56,930 --> 00:04:59,320 And both of them are bad for our model, 84 00:04:59,320 --> 00:05:02,230 because if we don't remove these tokens, 85 00:05:02,230 --> 00:05:05,465 then very likely we will overfeed, 86 00:05:05,465 --> 00:05:09,070 because that would be a very good feature for 87 00:05:09,070 --> 00:05:14,578 our future classifier that can just see that, okay, 88 00:05:14,578 --> 00:05:16,510 we have a review that has a typo, 89 00:05:16,510 --> 00:05:19,930 and we had only like two of those reviews, 90 00:05:19,930 --> 00:05:21,460 which had those typo, 91 00:05:21,460 --> 00:05:26,822 and it's pretty clear whether it's positive or negative. 92 00:05:26,822 --> 00:05:29,035 So, it can learn 93 00:05:29,035 --> 00:05:35,485 some independences that are actually not there and we don't really need them. 94 00:05:35,485 --> 00:05:39,580 And the last one is medium frequency n-grams, 95 00:05:39,580 --> 00:05:41,605 and those are really good n-grams, 96 00:05:41,605 --> 00:05:47,980 because they contain n-grams that are not stop-words, 97 00:05:47,980 --> 00:05:51,345 that are not typos and we actually look at them. 98 00:05:51,345 --> 00:05:55,635 And, the problem is there're a lot of medium frequency n-grams. 99 00:05:55,635 --> 00:05:57,685 And it proved to be useful to look at 100 00:05:57,685 --> 00:06:02,225 n-gram frequency in our corpus for filtering out bad n-grams. 101 00:06:02,225 --> 00:06:08,555 What if we can use the same frequency for ranking of medium frequency n-grams? 102 00:06:08,555 --> 00:06:11,085 Maybe we can decide which medium frequency n-gram 103 00:06:11,085 --> 00:06:14,340 is better and which is worse based on that frequency. 104 00:06:14,340 --> 00:06:16,470 And the idea is the following, 105 00:06:16,470 --> 00:06:19,920 the n-gram with smaller frequency can be more 106 00:06:19,920 --> 00:06:23,838 discriminating because it can capture a specific issue in the review. 107 00:06:23,838 --> 00:06:27,600 Let's say, somebody is not happy with the Wi-Fi 108 00:06:27,600 --> 00:06:32,190 and let's say it says, Wi-Fi breaks often, 109 00:06:32,190 --> 00:06:34,875 and that n-gram, Wi-Fi breaks, 110 00:06:34,875 --> 00:06:38,170 it can not be very frequent in our database, 111 00:06:38,170 --> 00:06:39,870 in our corpus of our documents, 112 00:06:39,870 --> 00:06:47,500 but it can actually highlight a specific issue that we need to look closer at. 113 00:06:47,500 --> 00:06:49,695 And to utilize that idea, 114 00:06:49,695 --> 00:06:57,410 we will have to introduce some notions first like term frequency. 115 00:06:57,410 --> 00:07:04,955 We will denote it as TF and that is the frequency for term t. The term is an n-gram, 116 00:07:04,955 --> 00:07:08,730 token, or anything like that in a document d. 117 00:07:08,730 --> 00:07:12,870 And there are different options how you can count that term frequency. 118 00:07:12,870 --> 00:07:15,975 The first and the easiest one is binary. 119 00:07:15,975 --> 00:07:19,010 You can actually take zero or one based on the fact 120 00:07:19,010 --> 00:07:23,330 whether that token is absent in our text or it is present. 121 00:07:23,330 --> 00:07:27,470 Then, a different option is to take just a raw count 122 00:07:27,470 --> 00:07:31,355 of how many times we've seen that term in our document, 123 00:07:31,355 --> 00:07:34,725 and let's denote that by f. Then, 124 00:07:34,725 --> 00:07:36,340 you can take a term frequency, 125 00:07:36,340 --> 00:07:40,380 so you can actually look at all the counts of all the terms that you have seen in 126 00:07:40,380 --> 00:07:45,300 your document and you can normalize those counters to have a sum of one. 127 00:07:45,300 --> 00:07:50,020 So there is a kind of a probability distribution on those tokens. 128 00:07:50,020 --> 00:07:52,440 And for that, you take that f and divide by 129 00:07:52,440 --> 00:07:55,380 the sum of fs for all the tokens in your document. 130 00:07:55,380 --> 00:08:02,990 And, one more useful scheme is logarithmic normalization. 131 00:08:02,990 --> 00:08:06,930 You take the logarithm of those counts and it actually introduces 132 00:08:06,930 --> 00:08:13,224 a logarithmic scale for your counters and that might help you to solve the task better. 133 00:08:13,224 --> 00:08:15,335 So that's it with term frequency. 134 00:08:15,335 --> 00:08:18,955 We will use that in the following slides. 135 00:08:18,955 --> 00:08:22,430 Another thing is inverse document frequency. 136 00:08:22,430 --> 00:08:24,230 Lets denote by capital N, 137 00:08:24,230 --> 00:08:26,725 total number of documents in our corpus, 138 00:08:26,725 --> 00:08:29,045 and our corpus is a capital D, 139 00:08:29,045 --> 00:08:32,085 that is the set of all our documents. 140 00:08:32,085 --> 00:08:35,200 Now, let's look at how many documents are 141 00:08:35,200 --> 00:08:38,870 there in that corpus that contain a specific term. 142 00:08:38,870 --> 00:08:43,640 And that is the second line and that is the size of 143 00:08:43,640 --> 00:08:48,765 that set that actually means the number of documents where the term appears. 144 00:08:48,765 --> 00:08:53,855 If you think about document frequency, 145 00:08:53,855 --> 00:08:56,690 then you would take that number of documents where 146 00:08:56,690 --> 00:08:59,670 the term appears and divide by the total number of documents, 147 00:08:59,670 --> 00:09:05,030 and you have a frequency of those of that term in our documents. 148 00:09:05,030 --> 00:09:10,610 But if you want to take inverse argument frequency then you just swap 149 00:09:10,610 --> 00:09:18,005 the up and down of that ratio and you take a logarithm of that and that thing, 150 00:09:18,005 --> 00:09:20,798 we will call inverse document frequency. 151 00:09:20,798 --> 00:09:26,000 So, it is just the logarithm of N over the number of documents where the term appears. 152 00:09:26,000 --> 00:09:28,520 And using these two things, 153 00:09:28,520 --> 00:09:30,815 IDF and term frequency, 154 00:09:30,815 --> 00:09:34,580 we can actually come up with TF-IDF value, 155 00:09:34,580 --> 00:09:38,035 which needs a term, 156 00:09:38,035 --> 00:09:42,620 a document, and a corpus to be calculated. 157 00:09:42,620 --> 00:09:44,360 And it works like the following, 158 00:09:44,360 --> 00:09:48,705 you take the term frequency of our term T in our document d and you 159 00:09:48,705 --> 00:09:54,360 multiplied by inverse document frequency of that term in all our documents. 160 00:09:54,360 --> 00:09:58,597 And let's see why it actually makes sense to do something like this. 161 00:09:58,597 --> 00:10:02,885 A high weight in TF-IDF is reached when we have 162 00:10:02,885 --> 00:10:07,010 high term frequency in the given document and 163 00:10:07,010 --> 00:10:11,480 the low document frequency of the term in the whole collection of documents. 164 00:10:11,480 --> 00:10:14,695 That is precisely the idea that we wanted to follow. 165 00:10:14,695 --> 00:10:15,800 We wanted to find 166 00:10:15,800 --> 00:10:24,240 frequent issues in the reviews that are not so frequent in the whole data-set, 167 00:10:24,240 --> 00:10:28,550 so specific issues and we want to highlight them. 168 00:10:28,550 --> 00:10:30,715 Let's see how it might work. 169 00:10:30,715 --> 00:10:36,550 We can replace counters in our bag of words representation with TF-IDF values. 170 00:10:36,550 --> 00:10:39,750 We can also normalize the result row-wise, 171 00:10:39,750 --> 00:10:41,885 so we normalize each row. 172 00:10:41,885 --> 00:10:43,670 We can do, that for example, 173 00:10:43,670 --> 00:10:47,120 by dividing by L2 norm or the sum of those numbers, 174 00:10:47,120 --> 00:10:49,386 you can go anyway. 175 00:10:49,386 --> 00:10:52,845 And, where we actually get in the result we have 176 00:10:52,845 --> 00:10:57,300 not counters but some real values and let's look at this example. 177 00:10:57,300 --> 00:10:58,780 We have a good movie, 178 00:10:58,780 --> 00:11:02,215 two gram and it appears in two documents. 179 00:11:02,215 --> 00:11:06,740 So in our collection it is pretty frequent, two gram. 180 00:11:06,740 --> 00:11:17,210 That's why the value 0.17 is actually lower than 0.47 and we get 0.47 for 181 00:11:17,210 --> 00:11:23,190 did not two gram and that actually is there because 182 00:11:23,190 --> 00:11:26,250 that did not two gram happened only in 183 00:11:26,250 --> 00:11:29,850 one review and that could be a specific issue and we want to highlight, 184 00:11:29,850 --> 00:11:33,465 that we want to have a bigger value for that feature. 185 00:11:33,465 --> 00:11:36,460 Let's look how it might work in python. 186 00:11:36,460 --> 00:11:42,490 In Python, you can use scikit learn library and you can import TF-IDF vectorizer. 187 00:11:42,490 --> 00:11:46,275 Let me remind you that vectorization means that we replace how we text 188 00:11:46,275 --> 00:11:50,400 with a huge vector that has a lot of 189 00:11:50,400 --> 00:11:54,300 zeroes but some of the values are not zeroes and those are 190 00:11:54,300 --> 00:11:58,714 precisely the values that correspond to the tokens that is seen in our text. 191 00:11:58,714 --> 00:12:02,560 Now, let's take an example of five different texts 192 00:12:02,560 --> 00:12:06,700 like small movie reviews and what we do is we 193 00:12:06,700 --> 00:12:09,730 instantiate that TF-IDF vectoriser and it has 194 00:12:09,730 --> 00:12:15,817 some useful arguments that you can pass to it, 195 00:12:15,817 --> 00:12:22,165 like mean DF, which stands for minimum document frequency that is 196 00:12:22,165 --> 00:12:25,340 essentially a cutoff threshold for 197 00:12:25,340 --> 00:12:29,930 low frequency and grams because we want to throw them away. 198 00:12:29,930 --> 00:12:36,320 And we can actually threshold it on a maximum number of documents where 199 00:12:36,320 --> 00:12:43,160 we've seen that token and this is done for stripping away stop words. 200 00:12:43,160 --> 00:12:44,895 And this time, in scikit learn library, 201 00:12:44,895 --> 00:12:49,580 we actually bust that argument as a ratio of 202 00:12:49,580 --> 00:12:55,775 documents but not a real valued number of documents where we've seen that. 203 00:12:55,775 --> 00:12:58,475 And the last argument is n-gram range, 204 00:12:58,475 --> 00:13:03,576 which actually tells TF-IDF vectorizer, 205 00:13:03,576 --> 00:13:07,970 what n-grams should be used in these back of words for representation. 206 00:13:07,970 --> 00:13:11,860 In this scenario, they take one gram and two gram. 207 00:13:11,860 --> 00:13:15,870 So, if we have vectorized our text we get something like this. 208 00:13:15,870 --> 00:13:20,965 So not all possible two grams or one grams are there because some of them are filtered. 209 00:13:20,965 --> 00:13:22,905 You can just follow, 210 00:13:22,905 --> 00:13:26,490 just look at the reviews and see why that happened and you can 211 00:13:26,490 --> 00:13:30,660 also see that we have real values in 212 00:13:30,660 --> 00:13:35,130 these matters because those who actually TF-IDF values and each row is 213 00:13:35,130 --> 00:13:40,230 normalized to have a norm of one. 214 00:13:40,230 --> 00:13:46,835 So let's summarize, we've made actually a simple counter features in bag of words manner. 215 00:13:46,835 --> 00:13:50,445 We replaced each text by a huge vector of counters. 216 00:13:50,445 --> 00:13:53,770 You can add n-grams to try to preserve 217 00:13:53,770 --> 00:13:56,710 some local ordering and we will further see 218 00:13:56,710 --> 00:14:00,300 that it actually improves the quality of text classification. 219 00:14:00,300 --> 00:14:01,930 You can replace counter's with 220 00:14:01,930 --> 00:14:06,367 TF-IDF values and that usually gives you a performance boost as well. 221 00:14:06,367 --> 00:14:07,623 In the next video, 222 00:14:07,623 --> 00:14:12,520 we will train our first model on top of these features.