1
00:00:00,000 --> 00:00:07,003
Sentiment analysis has become a very
popular big data application in the recent

2
00:00:07,003 --> 00:00:10,086
years.
Consider for example, the hundreds of

3
00:00:10,086 --> 00:00:16,032
millions of tweets everyday.
As a result, organizations can listen to

4
00:00:16,032 --> 00:00:20,045
the voice of their consumers like never
before.

5
00:00:20,045 --> 00:00:27,030
Manufacturers of consumer goods from
electronics to food products are able to

6
00:00:27,030 --> 00:00:34,034
measure the popularity of their brands
versus the those of their competitors by

7
00:00:34,034 --> 00:00:41,019
simply counting the number of positive
versus the number of negative comments

8
00:00:41,019 --> 00:00:46,224
that they find on forums like Twitter,
Facebook, emails, wherever.

9
00:00:46,224 --> 00:00:53,647
They even convert the voice that they
listen to in a call center to text and

10
00:00:53,647 --> 00:00:58,755
then measure the number of positive
sentiment versus the negative one.

11
00:00:58,755 --> 00:01:04,464
Well, most often the sentiment is negative
because people don't talk too many

12
00:01:04,464 --> 00:01:10,071
positive things about brands, but when
they do, that means the brand is probably

13
00:01:10,071 --> 00:01:14,343
very positive.
So you can build that bias in, but this

14
00:01:14,343 --> 00:01:17,899
has become an extremely popular
application.

15
00:01:17,899 --> 00:01:25,206
In the big data world and let's see how it
works using Beijian machine learning.

16
00:01:25,206 --> 00:01:32,602
Think about a few comments that all of you
might have been posting on the forum.

17
00:01:32,602 --> 00:01:36,997
While I haven't used real comments.
I made this up, obviously.

18
00:01:36,997 --> 00:01:42,001
There aren't that many comments on the
forum; I wish there were.

19
00:01:42,001 --> 00:01:47,071
Be that as it may, think about comments
like, like this, it's a positive comment.

20
00:01:47,071 --> 00:01:52,040
There may be lots of these.
But there may be some which are very

21
00:01:52,040 --> 00:01:56,003
negative.
And so on and so forth.

22
00:01:56,003 --> 00:02:04,000
Suppose we were able to manually label
comments as being positive or negative.

23
00:02:04,000 --> 00:02:11,179
In this obviously we use human intuition,
and not some automated technique.

24
00:02:11,179 --> 00:02:18,781
This is called the training phase.
After that, could we figure out using

25
00:02:18,781 --> 00:02:23,275
Naive Bayes.
Whether a new comment is positive or

26
00:02:23,275 --> 00:02:27,016
negative.
Let's see how to do that.

27
00:02:27,016 --> 00:02:32,926
First, we need to compute the a priori
probability, that is the over all chance

28
00:02:32,926 --> 00:02:39,040
that a comment is positive which is simply
the number of positive comments divided by

29
00:02:39,040 --> 00:02:44,573
the total number of comments which is
8000, in this case 6000 of them being

30
00:02:44,573 --> 00:02:48,044
positive.
And similarly, the probability of a

31
00:02:48,044 --> 00:02:55,075
comment being negative a prioiri, which is
the number of negative comments divided by

32
00:02:55,075 --> 00:02:59,036
the total.
And then needing to compute the

33
00:02:59,036 --> 00:03:04,000
likelihoods.
For example, the probability that the word

34
00:03:04,000 --> 00:03:10,087
like occurs within the positive comments.
So of the 6,000 positive comments, like

35
00:03:10,087 --> 00:03:15,059
occurs in only 2,000 of them.
So, we get this likelihood.

36
00:03:16,041 --> 00:03:23,078
Similarly the probability that enjoy
occurs amongst the positive comments can

37
00:03:23,078 --> 00:03:30,031
be computed similarly.
Now, I notice that there are no comments

38
00:03:30,031 --> 00:03:33,099
which are positive, but include the word,
hate.

39
00:03:33,099 --> 00:03:40,062
We can't put zero for this because that
would make the formula go completely out

40
00:03:40,062 --> 00:03:44,071
of whack.
So we replace those by a low number, like

41
00:03:44,071 --> 00:03:48,014
one.
This is called smoothing, in the Naive

42
00:03:48,014 --> 00:03:52,089
Bayes classifier.
And this is important because we have to

43
00:03:52,089 --> 00:04:01,014
include all likelihood probabilities.
Now we do the same thing for the negative

44
00:04:01,014 --> 00:04:05,061
comments.
The probability that hate occurs amongst

45
00:04:05,061 --> 00:04:12,075
negative comments is 800 out of the 2000
negative comments because this comment

46
00:04:12,075 --> 00:04:18,424
contains hate.
The probability that war occurs is one,

47
00:04:18,425 --> 00:04:24,050
like occurs again by smoothing is 1/2000
and so on.

48
00:04:27,083 --> 00:04:35,055
Not also that the probability that enjoy
which might be thought of as a, as a

49
00:04:35,055 --> 00:04:42,075
positive comment.
Amongst the negative comments is not that

50
00:04:42,075 --> 00:04:47,024
small.
I will come to this phenomena later.

51
00:04:47,024 --> 00:04:54,003
The reason is that the enjoy is occurring
along with a negative term, similarly with

52
00:04:54,003 --> 00:04:57,059
LUT.
For the moment, we need to factor in all

53
00:04:57,059 --> 00:05:03,024
the likelihood probabilities simply by
looking at the words and their

54
00:05:03,024 --> 00:05:07,028
occurrences.
And we'll worry about things like, not

55
00:05:07,028 --> 00:05:14,050
next week, or rather the week afterwards.
We're only going to consider the words

56
00:05:14,050 --> 00:05:18,082
marked bold.
And for these words the likelihood

57
00:05:18,082 --> 00:05:24,056
probabilities look like this.
So we can compute the probability that

58
00:05:24,056 --> 00:05:30,096
like occurs amongst all the positive
comments easy and so on and simply the

59
00:05:30,096 --> 00:05:34,075
same thing is done for the negative
comments.

60
00:05:35,062 --> 00:05:41,027
You can work this out for yourself.
It's a good exercise to do.

61
00:05:42,019 --> 00:05:48,030
Now faced with a new tweet, say, "I really
liked this simple course a lot; something

62
00:05:48,030 --> 00:05:54,026
we haven't seen before." We can compute
the likelihood ratio, which is simply the

63
00:05:54,026 --> 00:06:00,029
probability of like occurring, because
like happens to occur in this tweet, given

64
00:06:00,029 --> 00:06:05,042
that is positive and all.
Everything in the enumerator is given it

65
00:06:05,042 --> 00:06:09,012
is positive.
And for hate, which does not occur we

66
00:06:09,012 --> 00:06:15,280
compute the probability that hate does not
occur, given that the sentiment is

67
00:06:15,280 --> 00:06:20,014
positive by taking one minus the
likelihood ratio.

68
00:06:20,014 --> 00:06:23,099
Clearly the hate can be there or not
there.

69
00:06:23,099 --> 00:06:30,097
So if it's not there, the probably of not
hate given positive is one minus the

70
00:06:30,097 --> 00:06:37,882
probably of hate given positive.
We include every possible word amongst the

71
00:06:37,882 --> 00:06:44,349
bold words that we have considered and
even those which don't occur in this

72
00:06:44,349 --> 00:06:50,018
tweet, we include their probabilities by
taking one minus.

73
00:06:50,069 --> 00:06:55,068
Lastly, we multiply it by the A prior
probability.

74
00:06:57,001 --> 00:07:03,083
Similarly for the denominator.
And we get a likelihood ratio of .026 over

75
00:07:03,083 --> 00:07:09,029
a very small number .00005, which is very
much larger than one.

76
00:07:09,029 --> 00:07:16,077
So, the system can easily label this tweet
as being positive without ever having seen

77
00:07:16,077 --> 00:07:21,008
it before.
This is an example of a machine having

78
00:07:21,008 --> 00:07:27,094
learned to identify which tweets are
positive and which are negative, based on

79
00:07:27,094 --> 00:07:32,025
historical data using the naive Bayes
classifier.