1
00:00:00,000 --> 00:00:03,537
[MUSIC]

2
00:00:03,537 --> 00:00:07,543
In this video, we will talk about first
text classification model on top of

3
00:00:07,543 --> 00:00:09,400
features that we have described.

4
00:00:10,600 --> 00:00:13,960
And let's continue with
the sentiment classification.

5
00:00:13,960 --> 00:00:17,690
We can actually take the IMDB
movie reviews dataset,

6
00:00:17,690 --> 00:00:20,343
that you can download,
it is freely available.

7
00:00:20,343 --> 00:00:25,680
It contains 25,000 positive and
25,000 negative reviews.

8
00:00:25,680 --> 00:00:29,050
And how did that dataset appear?

9
00:00:29,050 --> 00:00:34,040
You can actually look at IMDB website and
you can see that people write reviews

10
00:00:34,040 --> 00:00:41,610
there, and they actually also provide the
number of stars from one star to ten star.

11
00:00:41,610 --> 00:00:44,460
They actually rate the movie and
write the review.

12
00:00:44,460 --> 00:00:50,310
And if you take all those
reviews from IMDB website,

13
00:00:50,310 --> 00:00:54,660
you can actually use that as a dataset for

14
00:00:54,660 --> 00:00:59,860
text classification because you have
a text and you have a number of stars,

15
00:00:59,860 --> 00:01:03,730
and you can actually think
of stars as sentiment.

16
00:01:03,730 --> 00:01:08,290
If we have at least seven stars,
you can label it as positive sentiment.

17
00:01:08,290 --> 00:01:13,055
If it has at most four stars,
that means that is a bad movie for

18
00:01:13,055 --> 00:01:17,364
a particular person and
that is a negative sentiment.

19
00:01:17,364 --> 00:01:23,590
And that's how you get the dataset for
sentiment classification for free.

20
00:01:23,590 --> 00:01:28,630
It contains at most 30 reviews per
movie just to make it less biased for

21
00:01:28,630 --> 00:01:30,310
any particular movie.

22
00:01:31,380 --> 00:01:36,050
These dataset also provides
a 50/50 train test split so

23
00:01:36,050 --> 00:01:39,893
that future researchers
can use the same split and

24
00:01:39,893 --> 00:01:44,430
reproduce their results and
enhance the model.

25
00:01:44,430 --> 00:01:49,420
For evaluation, you can use accuracy and
that actually happens

26
00:01:49,420 --> 00:01:54,030
because we have the same number
of positive and negative reviews.

27
00:01:54,030 --> 00:01:59,550
So our dataset is balanced in terms
of the size of the classes so

28
00:01:59,550 --> 00:02:02,430
we can evaluate accuracy here.

29
00:02:04,250 --> 00:02:07,169
Okay, so let's start with first model.

30
00:02:07,169 --> 00:02:13,240
Let's takes features,
let's take bag 1-grams with TF-IDF values.

31
00:02:13,240 --> 00:02:17,626
And in the result,
we will have a matrix of features,

32
00:02:17,626 --> 00:02:22,920
25,000 rows and 75,000 columns, and
that is a pretty huge feature matrix.

33
00:02:22,920 --> 00:02:26,140
And what is more, it is extremely sparse.

34
00:02:26,140 --> 00:02:30,520
If you look at how many 0s are there,
then you

35
00:02:30,520 --> 00:02:34,688
will see that 99.8% of all
values in that matrix are 0s.

36
00:02:34,688 --> 00:02:38,919
So that actually applies
some restrictions on

37
00:02:38,919 --> 00:02:43,589
the models that we can use
on top of these features.

38
00:02:45,220 --> 00:02:46,590
And the model that is usable for

39
00:02:46,590 --> 00:02:51,750
these features is logistic regression,
which works like the following.

40
00:02:51,750 --> 00:02:57,370
It tries to predict the probability
of a review being a positive one

41
00:02:57,370 --> 00:03:03,380
given the features that we gave that
model for that particular review.

42
00:03:03,380 --> 00:03:05,030
And the features that we use,

43
00:03:05,030 --> 00:03:10,190
let me remind you,
is the vector of TF-IDF values.

44
00:03:10,190 --> 00:03:14,430
And what you actually can do
is you can find the weight for

45
00:03:14,430 --> 00:03:17,730
every feature of that bag
of force representation.

46
00:03:17,730 --> 00:03:22,440
You can multiply each value,
each TF-IDF value by that weight,

47
00:03:22,440 --> 00:03:28,260
sum all of that things and pass it
through a sigmoid activation function and

48
00:03:28,260 --> 00:03:30,190
that's how you get
logistic regression model.

49
00:03:31,496 --> 00:03:34,530
And it's actually a linear
classification model and

50
00:03:34,530 --> 00:03:40,690
what's good about that is since it's
linear, it can handle sparse data.

51
00:03:40,690 --> 00:03:43,470
It's really fast to train and what's more,

52
00:03:43,470 --> 00:03:46,960
the weights that we get after
the training can be interpreted.

53
00:03:48,930 --> 00:03:53,970
And let's look at that sigmoid
graph at the bottom of the slide.

54
00:03:53,970 --> 00:03:57,340
If you have a linear
combination that is close to 0,

55
00:03:57,340 --> 00:04:00,890
that means that sigmoid will output 0.5.

56
00:04:00,890 --> 00:04:04,520
So the probability of a review
being positive is 0.5.

57
00:04:04,520 --> 00:04:08,270
So we really don't know whether
it's positive or negative.

58
00:04:08,270 --> 00:04:13,660
But if that linear combination in the
argument of our sigmoid function starts

59
00:04:13,660 --> 00:04:19,310
to become more and more positive,
so it goes further away from zero.

60
00:04:19,310 --> 00:04:23,340
Then you see that the probability
of a review being

61
00:04:23,340 --> 00:04:26,470
positive actually grows really fast.

62
00:04:26,470 --> 00:04:33,710
And that means that if we get the weight
of our features that are positive,

63
00:04:33,710 --> 00:04:38,190
then those weights will likely
correspond to the words that a positive.

64
00:04:38,190 --> 00:04:39,870
And if you take negative weights,

65
00:04:39,870 --> 00:04:45,210
they will correspond to the words that
are negative like disgusting or awful.

66
00:04:46,900 --> 00:04:51,490
Okay, so logistic regression can work on
these features and we can interpret it.

67
00:04:53,200 --> 00:04:58,990
Let's train logistic regression over
bag of 1-grams with TF-IDF values.

68
00:04:58,990 --> 00:05:05,230
What you can actually see is that
accuracy on test set is 88.5%.

69
00:05:05,230 --> 00:05:10,274
And that is a huge jump from a random
classifier which outputs 50% accuracy.

70
00:05:10,274 --> 00:05:16,460
Let's look at learnt features because
linear models can be interpreted.

71
00:05:16,460 --> 00:05:20,801
If we look at top positive weights,
then we will see such words as great,

72
00:05:20,801 --> 00:05:23,446
excellent, perfect, best, wonderful.

73
00:05:23,446 --> 00:05:28,045
So it's really cool because
the model captured that sentiment,

74
00:05:28,045 --> 00:05:33,481
the sentiment of those words, and
it knows nothing about English language,

75
00:05:33,481 --> 00:05:37,190
it knows only the examples
that we provided it with.

76
00:05:37,190 --> 00:05:39,580
And if we take top negative ways,

77
00:05:39,580 --> 00:05:44,010
then you will see words like worst,
awful, bad, waste, boring, and so forth.

78
00:05:45,270 --> 00:05:49,223
So these word are clearly
having negative sentiment and

79
00:05:49,223 --> 00:05:52,470
the model has learnt it from the examples.

80
00:05:52,470 --> 00:05:53,170
That is pretty cool.

81
00:05:54,310 --> 00:05:58,418
Let's try to make this model a little
bit better, we know how to do that.

82
00:05:58,418 --> 00:06:01,000
Let's introduce 2-grams to our model.

83
00:06:02,630 --> 00:06:07,910
And before we can move further,
we should throw away some n-grams that

84
00:06:07,910 --> 00:06:12,935
are not frequent, that are seen,
let's say, less than 5 times.

85
00:06:12,935 --> 00:06:18,075
Because those n-grams are likely
either typos or very,

86
00:06:18,075 --> 00:06:22,892
like people don't say like that and
some of them do and

87
00:06:22,892 --> 00:06:27,174
it actually doesn't make
sense to look at those

88
00:06:27,174 --> 00:06:31,610
features because we will
most likely overfeed.

89
00:06:31,610 --> 00:06:34,120
So we want to throw that away.

90
00:06:34,120 --> 00:06:39,280
And if you introduce 2-grams and
that thresholding for minimum frequency,

91
00:06:39,280 --> 00:06:44,602
you will actually get the number of
the dimensions of our feature matrix,

92
00:06:44,602 --> 00:06:49,850
the following, 25,000 by 150,000.

93
00:06:49,850 --> 00:06:53,670
So that is a pretty huge matrix, but
we can still use linear models and

94
00:06:53,670 --> 00:06:54,439
it just works.

95
00:06:55,450 --> 00:06:58,913
Let's train logistical regression
over these bag of 1 and

96
00:06:58,913 --> 00:07:01,430
2-grams with TF-IDF values.

97
00:07:01,430 --> 00:07:05,800
And what we actually observe is that
accuracy and test set has a bump.

98
00:07:05,800 --> 00:07:09,620
It has 1.5 accuracy boost.

99
00:07:09,620 --> 00:07:13,140
And now,
we have very close to 90% accuracy.

100
00:07:13,140 --> 00:07:14,851
Let's look at learnt weight.

101
00:07:14,851 --> 00:07:20,177
If you look at top positive weights,
then you will see that our 2-grams

102
00:07:20,177 --> 00:07:26,115
are actually used by our model because now
it looks at 2-grams like well worth or

103
00:07:26,115 --> 00:07:32,390
better than and it thinks that those
2-grams have positive sentiment.

104
00:07:32,390 --> 00:07:36,020
If you look on the contrary
on the top negative weights,

105
00:07:36,020 --> 00:07:38,060
then you will see the worst.

106
00:07:38,060 --> 00:07:45,080
That is another 2-gram that is now used by
our model to predict the final sentiment.

107
00:07:45,080 --> 00:07:48,410
You might think that, okay,
it doesn't make any sense.

108
00:07:48,410 --> 00:07:54,810
So the worst or worst is just the same
thing as well as well worth or just worth.

109
00:07:54,810 --> 00:07:58,140
So maybe it is, but

110
00:07:58,140 --> 00:08:03,250
that 1.5% improvement in accuracy actually

111
00:08:03,250 --> 00:08:08,430
was provided by addition of
those 2-grams into our model.

112
00:08:08,430 --> 00:08:13,110
So you can either believe it or not,
but it actually increases performance.

113
00:08:14,490 --> 00:08:16,170
How to make it even better?

114
00:08:16,170 --> 00:08:19,760
You can play around with
tokenization because in reviews,

115
00:08:19,760 --> 00:08:22,510
people use different stuff like emojis.

116
00:08:22,510 --> 00:08:26,410
They use smiles written with text.

117
00:08:26,410 --> 00:08:31,500
They can usually use a bunch
of exclamation marks that,

118
00:08:31,500 --> 00:08:34,740
a lot of exclamation marks.

119
00:08:34,740 --> 00:08:39,560
And you can actually look
at those sequences as,

120
00:08:39,560 --> 00:08:41,616
you can look at them as different tokens.

121
00:08:41,616 --> 00:08:45,521
And you can actually introduce
them to your model and

122
00:08:45,521 --> 00:08:49,693
maybe you will get a better
sentiment classification,

123
00:08:49,693 --> 00:08:55,392
because like a smiling face is better
than an angry face and you can use that.

124
00:08:55,392 --> 00:09:01,510
You should also try to normalize tokens
by applying stemming or lemmatization.

125
00:09:01,510 --> 00:09:05,736
You can try different models,
like SVM or Naive Bayes, or

126
00:09:05,736 --> 00:09:09,369
any other model that can
handle sparse features.

127
00:09:09,369 --> 00:09:12,942
Or another way is you can
throw bag of words away and

128
00:09:12,942 --> 00:09:19,580
use deep learning techniques to squeeze
the maximum accuracy from that dataset.

129
00:09:19,580 --> 00:09:25,030
And as for the 2016,
accuracy on this particular dataset

130
00:09:25,030 --> 00:09:29,970
is close to 92% and
that is a 2.5% improvement over the best

131
00:09:29,970 --> 00:09:35,090
model that we can get with
bag of words and 2-grams.

132
00:09:35,090 --> 00:09:40,830
So that might seem like not
a very good improvement,

133
00:09:40,830 --> 00:09:46,580
but that can actually make sense in
some tasks where you can get a lot

134
00:09:46,580 --> 00:09:51,430
of money even for 1% improvement, like ad
click prediction or anything like that.

135
00:09:52,620 --> 00:09:54,507
So let's summarize.

136
00:09:54,507 --> 00:10:00,250
Bag of words and simple linear models
over that features actually work.

137
00:10:00,250 --> 00:10:06,890
And you can add 2-grams and that is done
for free and you get a better model.

138
00:10:06,890 --> 00:10:10,610
The accuracy gained from deep learning
models is not mind-blowing but

139
00:10:10,610 --> 00:10:14,380
it is still there and
you might consider using deep learning

140
00:10:14,380 --> 00:10:18,850
techniques to solve the problems
of sentiment classification.

141
00:10:18,850 --> 00:10:22,571
In the next video,
we will look at spam filtering task,

142
00:10:22,571 --> 00:10:27,886
another example of task classification
that can be handled in a different way.

143
00:10:27,886 --> 00:10:37,886
[MUSIC]