1 00:00:00,000 --> 00:00:03,537 [MUSIC] 2 00:00:03,537 --> 00:00:07,543 In this video, we will talk about first text classification model on top of 3 00:00:07,543 --> 00:00:09,400 features that we have described. 4 00:00:10,600 --> 00:00:13,960 And let's continue with the sentiment classification. 5 00:00:13,960 --> 00:00:17,690 We can actually take the IMDB movie reviews dataset, 6 00:00:17,690 --> 00:00:20,343 that you can download, it is freely available. 7 00:00:20,343 --> 00:00:25,680 It contains 25,000 positive and 25,000 negative reviews. 8 00:00:25,680 --> 00:00:29,050 And how did that dataset appear? 9 00:00:29,050 --> 00:00:34,040 You can actually look at IMDB website and you can see that people write reviews 10 00:00:34,040 --> 00:00:41,610 there, and they actually also provide the number of stars from one star to ten star. 11 00:00:41,610 --> 00:00:44,460 They actually rate the movie and write the review. 12 00:00:44,460 --> 00:00:50,310 And if you take all those reviews from IMDB website, 13 00:00:50,310 --> 00:00:54,660 you can actually use that as a dataset for 14 00:00:54,660 --> 00:00:59,860 text classification because you have a text and you have a number of stars, 15 00:00:59,860 --> 00:01:03,730 and you can actually think of stars as sentiment. 16 00:01:03,730 --> 00:01:08,290 If we have at least seven stars, you can label it as positive sentiment. 17 00:01:08,290 --> 00:01:13,055 If it has at most four stars, that means that is a bad movie for 18 00:01:13,055 --> 00:01:17,364 a particular person and that is a negative sentiment. 19 00:01:17,364 --> 00:01:23,590 And that's how you get the dataset for sentiment classification for free. 20 00:01:23,590 --> 00:01:28,630 It contains at most 30 reviews per movie just to make it less biased for 21 00:01:28,630 --> 00:01:30,310 any particular movie. 22 00:01:31,380 --> 00:01:36,050 These dataset also provides a 50/50 train test split so 23 00:01:36,050 --> 00:01:39,893 that future researchers can use the same split and 24 00:01:39,893 --> 00:01:44,430 reproduce their results and enhance the model. 25 00:01:44,430 --> 00:01:49,420 For evaluation, you can use accuracy and that actually happens 26 00:01:49,420 --> 00:01:54,030 because we have the same number of positive and negative reviews. 27 00:01:54,030 --> 00:01:59,550 So our dataset is balanced in terms of the size of the classes so 28 00:01:59,550 --> 00:02:02,430 we can evaluate accuracy here. 29 00:02:04,250 --> 00:02:07,169 Okay, so let's start with first model. 30 00:02:07,169 --> 00:02:13,240 Let's takes features, let's take bag 1-grams with TF-IDF values. 31 00:02:13,240 --> 00:02:17,626 And in the result, we will have a matrix of features, 32 00:02:17,626 --> 00:02:22,920 25,000 rows and 75,000 columns, and that is a pretty huge feature matrix. 33 00:02:22,920 --> 00:02:26,140 And what is more, it is extremely sparse. 34 00:02:26,140 --> 00:02:30,520 If you look at how many 0s are there, then you 35 00:02:30,520 --> 00:02:34,688 will see that 99.8% of all values in that matrix are 0s. 36 00:02:34,688 --> 00:02:38,919 So that actually applies some restrictions on 37 00:02:38,919 --> 00:02:43,589 the models that we can use on top of these features. 38 00:02:45,220 --> 00:02:46,590 And the model that is usable for 39 00:02:46,590 --> 00:02:51,750 these features is logistic regression, which works like the following. 40 00:02:51,750 --> 00:02:57,370 It tries to predict the probability of a review being a positive one 41 00:02:57,370 --> 00:03:03,380 given the features that we gave that model for that particular review. 42 00:03:03,380 --> 00:03:05,030 And the features that we use, 43 00:03:05,030 --> 00:03:10,190 let me remind you, is the vector of TF-IDF values. 44 00:03:10,190 --> 00:03:14,430 And what you actually can do is you can find the weight for 45 00:03:14,430 --> 00:03:17,730 every feature of that bag of force representation. 46 00:03:17,730 --> 00:03:22,440 You can multiply each value, each TF-IDF value by that weight, 47 00:03:22,440 --> 00:03:28,260 sum all of that things and pass it through a sigmoid activation function and 48 00:03:28,260 --> 00:03:30,190 that's how you get logistic regression model. 49 00:03:31,496 --> 00:03:34,530 And it's actually a linear classification model and 50 00:03:34,530 --> 00:03:40,690 what's good about that is since it's linear, it can handle sparse data. 51 00:03:40,690 --> 00:03:43,470 It's really fast to train and what's more, 52 00:03:43,470 --> 00:03:46,960 the weights that we get after the training can be interpreted. 53 00:03:48,930 --> 00:03:53,970 And let's look at that sigmoid graph at the bottom of the slide. 54 00:03:53,970 --> 00:03:57,340 If you have a linear combination that is close to 0, 55 00:03:57,340 --> 00:04:00,890 that means that sigmoid will output 0.5. 56 00:04:00,890 --> 00:04:04,520 So the probability of a review being positive is 0.5. 57 00:04:04,520 --> 00:04:08,270 So we really don't know whether it's positive or negative. 58 00:04:08,270 --> 00:04:13,660 But if that linear combination in the argument of our sigmoid function starts 59 00:04:13,660 --> 00:04:19,310 to become more and more positive, so it goes further away from zero. 60 00:04:19,310 --> 00:04:23,340 Then you see that the probability of a review being 61 00:04:23,340 --> 00:04:26,470 positive actually grows really fast. 62 00:04:26,470 --> 00:04:33,710 And that means that if we get the weight of our features that are positive, 63 00:04:33,710 --> 00:04:38,190 then those weights will likely correspond to the words that a positive. 64 00:04:38,190 --> 00:04:39,870 And if you take negative weights, 65 00:04:39,870 --> 00:04:45,210 they will correspond to the words that are negative like disgusting or awful. 66 00:04:46,900 --> 00:04:51,490 Okay, so logistic regression can work on these features and we can interpret it. 67 00:04:53,200 --> 00:04:58,990 Let's train logistic regression over bag of 1-grams with TF-IDF values. 68 00:04:58,990 --> 00:05:05,230 What you can actually see is that accuracy on test set is 88.5%. 69 00:05:05,230 --> 00:05:10,274 And that is a huge jump from a random classifier which outputs 50% accuracy. 70 00:05:10,274 --> 00:05:16,460 Let's look at learnt features because linear models can be interpreted. 71 00:05:16,460 --> 00:05:20,801 If we look at top positive weights, then we will see such words as great, 72 00:05:20,801 --> 00:05:23,446 excellent, perfect, best, wonderful. 73 00:05:23,446 --> 00:05:28,045 So it's really cool because the model captured that sentiment, 74 00:05:28,045 --> 00:05:33,481 the sentiment of those words, and it knows nothing about English language, 75 00:05:33,481 --> 00:05:37,190 it knows only the examples that we provided it with. 76 00:05:37,190 --> 00:05:39,580 And if we take top negative ways, 77 00:05:39,580 --> 00:05:44,010 then you will see words like worst, awful, bad, waste, boring, and so forth. 78 00:05:45,270 --> 00:05:49,223 So these word are clearly having negative sentiment and 79 00:05:49,223 --> 00:05:52,470 the model has learnt it from the examples. 80 00:05:52,470 --> 00:05:53,170 That is pretty cool. 81 00:05:54,310 --> 00:05:58,418 Let's try to make this model a little bit better, we know how to do that. 82 00:05:58,418 --> 00:06:01,000 Let's introduce 2-grams to our model. 83 00:06:02,630 --> 00:06:07,910 And before we can move further, we should throw away some n-grams that 84 00:06:07,910 --> 00:06:12,935 are not frequent, that are seen, let's say, less than 5 times. 85 00:06:12,935 --> 00:06:18,075 Because those n-grams are likely either typos or very, 86 00:06:18,075 --> 00:06:22,892 like people don't say like that and some of them do and 87 00:06:22,892 --> 00:06:27,174 it actually doesn't make sense to look at those 88 00:06:27,174 --> 00:06:31,610 features because we will most likely overfeed. 89 00:06:31,610 --> 00:06:34,120 So we want to throw that away. 90 00:06:34,120 --> 00:06:39,280 And if you introduce 2-grams and that thresholding for minimum frequency, 91 00:06:39,280 --> 00:06:44,602 you will actually get the number of the dimensions of our feature matrix, 92 00:06:44,602 --> 00:06:49,850 the following, 25,000 by 150,000. 93 00:06:49,850 --> 00:06:53,670 So that is a pretty huge matrix, but we can still use linear models and 94 00:06:53,670 --> 00:06:54,439 it just works. 95 00:06:55,450 --> 00:06:58,913 Let's train logistical regression over these bag of 1 and 96 00:06:58,913 --> 00:07:01,430 2-grams with TF-IDF values. 97 00:07:01,430 --> 00:07:05,800 And what we actually observe is that accuracy and test set has a bump. 98 00:07:05,800 --> 00:07:09,620 It has 1.5 accuracy boost. 99 00:07:09,620 --> 00:07:13,140 And now, we have very close to 90% accuracy. 100 00:07:13,140 --> 00:07:14,851 Let's look at learnt weight. 101 00:07:14,851 --> 00:07:20,177 If you look at top positive weights, then you will see that our 2-grams 102 00:07:20,177 --> 00:07:26,115 are actually used by our model because now it looks at 2-grams like well worth or 103 00:07:26,115 --> 00:07:32,390 better than and it thinks that those 2-grams have positive sentiment. 104 00:07:32,390 --> 00:07:36,020 If you look on the contrary on the top negative weights, 105 00:07:36,020 --> 00:07:38,060 then you will see the worst. 106 00:07:38,060 --> 00:07:45,080 That is another 2-gram that is now used by our model to predict the final sentiment. 107 00:07:45,080 --> 00:07:48,410 You might think that, okay, it doesn't make any sense. 108 00:07:48,410 --> 00:07:54,810 So the worst or worst is just the same thing as well as well worth or just worth. 109 00:07:54,810 --> 00:07:58,140 So maybe it is, but 110 00:07:58,140 --> 00:08:03,250 that 1.5% improvement in accuracy actually 111 00:08:03,250 --> 00:08:08,430 was provided by addition of those 2-grams into our model. 112 00:08:08,430 --> 00:08:13,110 So you can either believe it or not, but it actually increases performance. 113 00:08:14,490 --> 00:08:16,170 How to make it even better? 114 00:08:16,170 --> 00:08:19,760 You can play around with tokenization because in reviews, 115 00:08:19,760 --> 00:08:22,510 people use different stuff like emojis. 116 00:08:22,510 --> 00:08:26,410 They use smiles written with text. 117 00:08:26,410 --> 00:08:31,500 They can usually use a bunch of exclamation marks that, 118 00:08:31,500 --> 00:08:34,740 a lot of exclamation marks. 119 00:08:34,740 --> 00:08:39,560 And you can actually look at those sequences as, 120 00:08:39,560 --> 00:08:41,616 you can look at them as different tokens. 121 00:08:41,616 --> 00:08:45,521 And you can actually introduce them to your model and 122 00:08:45,521 --> 00:08:49,693 maybe you will get a better sentiment classification, 123 00:08:49,693 --> 00:08:55,392 because like a smiling face is better than an angry face and you can use that. 124 00:08:55,392 --> 00:09:01,510 You should also try to normalize tokens by applying stemming or lemmatization. 125 00:09:01,510 --> 00:09:05,736 You can try different models, like SVM or Naive Bayes, or 126 00:09:05,736 --> 00:09:09,369 any other model that can handle sparse features. 127 00:09:09,369 --> 00:09:12,942 Or another way is you can throw bag of words away and 128 00:09:12,942 --> 00:09:19,580 use deep learning techniques to squeeze the maximum accuracy from that dataset. 129 00:09:19,580 --> 00:09:25,030 And as for the 2016, accuracy on this particular dataset 130 00:09:25,030 --> 00:09:29,970 is close to 92% and that is a 2.5% improvement over the best 131 00:09:29,970 --> 00:09:35,090 model that we can get with bag of words and 2-grams. 132 00:09:35,090 --> 00:09:40,830 So that might seem like not a very good improvement, 133 00:09:40,830 --> 00:09:46,580 but that can actually make sense in some tasks where you can get a lot 134 00:09:46,580 --> 00:09:51,430 of money even for 1% improvement, like ad click prediction or anything like that. 135 00:09:52,620 --> 00:09:54,507 So let's summarize. 136 00:09:54,507 --> 00:10:00,250 Bag of words and simple linear models over that features actually work. 137 00:10:00,250 --> 00:10:06,890 And you can add 2-grams and that is done for free and you get a better model. 138 00:10:06,890 --> 00:10:10,610 The accuracy gained from deep learning models is not mind-blowing but 139 00:10:10,610 --> 00:10:14,380 it is still there and you might consider using deep learning 140 00:10:14,380 --> 00:10:18,850 techniques to solve the problems of sentiment classification. 141 00:10:18,850 --> 00:10:22,571 In the next video, we will look at spam filtering task, 142 00:10:22,571 --> 00:10:27,886 another example of task classification that can be handled in a different way. 143 00:10:27,886 --> 00:10:37,886 [MUSIC]