[MUSIC] In this video, we will talk about first
text classification model on top of features that we have described. And let's continue with
the sentiment classification. We can actually take the IMDB
movie reviews dataset, that you can download,
it is freely available. It contains 25,000 positive and
25,000 negative reviews. And how did that dataset appear? You can actually look at IMDB website and
you can see that people write reviews there, and they actually also provide the
number of stars from one star to ten star. They actually rate the movie and
write the review. And if you take all those
reviews from IMDB website, you can actually use that as a dataset for text classification because you have
a text and you have a number of stars, and you can actually think
of stars as sentiment. If we have at least seven stars,
you can label it as positive sentiment. If it has at most four stars,
that means that is a bad movie for a particular person and
that is a negative sentiment. And that's how you get the dataset for
sentiment classification for free. It contains at most 30 reviews per
movie just to make it less biased for any particular movie. These dataset also provides
a 50/50 train test split so that future researchers
can use the same split and reproduce their results and
enhance the model. For evaluation, you can use accuracy and
that actually happens because we have the same number
of positive and negative reviews. So our dataset is balanced in terms
of the size of the classes so we can evaluate accuracy here. Okay, so let's start with first model. Let's takes features,
let's take bag 1-grams with TF-IDF values. And in the result,
we will have a matrix of features, 25,000 rows and 75,000 columns, and
that is a pretty huge feature matrix. And what is more, it is extremely sparse. If you look at how many 0s are there,
then you will see that 99.8% of all
values in that matrix are 0s. So that actually applies
some restrictions on the models that we can use
on top of these features. And the model that is usable for these features is logistic regression,
which works like the following. It tries to predict the probability
of a review being a positive one given the features that we gave that
model for that particular review. And the features that we use, let me remind you,
is the vector of TF-IDF values. And what you actually can do
is you can find the weight for every feature of that bag
of force representation. You can multiply each value,
each TF-IDF value by that weight, sum all of that things and pass it
through a sigmoid activation function and that's how you get
logistic regression model. And it's actually a linear
classification model and what's good about that is since it's
linear, it can handle sparse data. It's really fast to train and what's more, the weights that we get after
the training can be interpreted. And let's look at that sigmoid
graph at the bottom of the slide. If you have a linear
combination that is close to 0, that means that sigmoid will output 0.5. So the probability of a review
being positive is 0.5. So we really don't know whether
it's positive or negative. But if that linear combination in the
argument of our sigmoid function starts to become more and more positive,
so it goes further away from zero. Then you see that the probability
of a review being positive actually grows really fast. And that means that if we get the weight
of our features that are positive, then those weights will likely
correspond to the words that a positive. And if you take negative weights, they will correspond to the words that
are negative like disgusting or awful. Okay, so logistic regression can work on
these features and we can interpret it. Let's train logistic regression over
bag of 1-grams with TF-IDF values. What you can actually see is that
accuracy on test set is 88.5%. And that is a huge jump from a random
classifier which outputs 50% accuracy. Let's look at learnt features because
linear models can be interpreted. If we look at top positive weights,
then we will see such words as great, excellent, perfect, best, wonderful. So it's really cool because
the model captured that sentiment, the sentiment of those words, and
it knows nothing about English language, it knows only the examples
that we provided it with. And if we take top negative ways, then you will see words like worst,
awful, bad, waste, boring, and so forth. So these word are clearly
having negative sentiment and the model has learnt it from the examples. That is pretty cool. Let's try to make this model a little
bit better, we know how to do that. Let's introduce 2-grams to our model. And before we can move further,
we should throw away some n-grams that are not frequent, that are seen,
let's say, less than 5 times. Because those n-grams are likely
either typos or very, like people don't say like that and
some of them do and it actually doesn't make
sense to look at those features because we will
most likely overfeed. So we want to throw that away. And if you introduce 2-grams and
that thresholding for minimum frequency, you will actually get the number of
the dimensions of our feature matrix, the following, 25,000 by 150,000. So that is a pretty huge matrix, but
we can still use linear models and it just works. Let's train logistical regression
over these bag of 1 and 2-grams with TF-IDF values. And what we actually observe is that
accuracy and test set has a bump. It has 1.5 accuracy boost. And now,
we have very close to 90% accuracy. Let's look at learnt weight. If you look at top positive weights,
then you will see that our 2-grams are actually used by our model because now
it looks at 2-grams like well worth or better than and it thinks that those
2-grams have positive sentiment. If you look on the contrary
on the top negative weights, then you will see the worst. That is another 2-gram that is now used by
our model to predict the final sentiment. You might think that, okay,
it doesn't make any sense. So the worst or worst is just the same
thing as well as well worth or just worth. So maybe it is, but that 1.5% improvement in accuracy actually was provided by addition of
those 2-grams into our model. So you can either believe it or not,
but it actually increases performance. How to make it even better? You can play around with
tokenization because in reviews, people use different stuff like emojis. They use smiles written with text. They can usually use a bunch
of exclamation marks that, a lot of exclamation marks. And you can actually look
at those sequences as, you can look at them as different tokens. And you can actually introduce
them to your model and maybe you will get a better
sentiment classification, because like a smiling face is better
than an angry face and you can use that. You should also try to normalize tokens
by applying stemming or lemmatization. You can try different models,
like SVM or Naive Bayes, or any other model that can
handle sparse features. Or another way is you can
throw bag of words away and use deep learning techniques to squeeze
the maximum accuracy from that dataset. And as for the 2016,
accuracy on this particular dataset is close to 92% and
that is a 2.5% improvement over the best model that we can get with
bag of words and 2-grams. So that might seem like not
a very good improvement, but that can actually make sense in
some tasks where you can get a lot of money even for 1% improvement, like ad
click prediction or anything like that. So let's summarize. Bag of words and simple linear models
over that features actually work. And you can add 2-grams and that is done
for free and you get a better model. The accuracy gained from deep learning
models is not mind-blowing but it is still there and
you might consider using deep learning techniques to solve the problems
of sentiment classification. In the next video,
we will look at spam filtering task, another example of task classification
that can be handled in a different way. [MUSIC]