1 00:00:00,000 --> 00:00:07,003 Sentiment analysis has become a very popular big data application in the recent 2 00:00:07,003 --> 00:00:10,086 years. Consider for example, the hundreds of 3 00:00:10,086 --> 00:00:16,032 millions of tweets everyday. As a result, organizations can listen to 4 00:00:16,032 --> 00:00:20,045 the voice of their consumers like never before. 5 00:00:20,045 --> 00:00:27,030 Manufacturers of consumer goods from electronics to food products are able to 6 00:00:27,030 --> 00:00:34,034 measure the popularity of their brands versus the those of their competitors by 7 00:00:34,034 --> 00:00:41,019 simply counting the number of positive versus the number of negative comments 8 00:00:41,019 --> 00:00:46,224 that they find on forums like Twitter, Facebook, emails, wherever. 9 00:00:46,224 --> 00:00:53,647 They even convert the voice that they listen to in a call center to text and 10 00:00:53,647 --> 00:00:58,755 then measure the number of positive sentiment versus the negative one. 11 00:00:58,755 --> 00:01:04,464 Well, most often the sentiment is negative because people don't talk too many 12 00:01:04,464 --> 00:01:10,071 positive things about brands, but when they do, that means the brand is probably 13 00:01:10,071 --> 00:01:14,343 very positive. So you can build that bias in, but this 14 00:01:14,343 --> 00:01:17,899 has become an extremely popular application. 15 00:01:17,899 --> 00:01:25,206 In the big data world and let's see how it works using Beijian machine learning. 16 00:01:25,206 --> 00:01:32,602 Think about a few comments that all of you might have been posting on the forum. 17 00:01:32,602 --> 00:01:36,997 While I haven't used real comments. I made this up, obviously. 18 00:01:36,997 --> 00:01:42,001 There aren't that many comments on the forum; I wish there were. 19 00:01:42,001 --> 00:01:47,071 Be that as it may, think about comments like, like this, it's a positive comment. 20 00:01:47,071 --> 00:01:52,040 There may be lots of these. But there may be some which are very 21 00:01:52,040 --> 00:01:56,003 negative. And so on and so forth. 22 00:01:56,003 --> 00:02:04,000 Suppose we were able to manually label comments as being positive or negative. 23 00:02:04,000 --> 00:02:11,179 In this obviously we use human intuition, and not some automated technique. 24 00:02:11,179 --> 00:02:18,781 This is called the training phase. After that, could we figure out using 25 00:02:18,781 --> 00:02:23,275 Naive Bayes. Whether a new comment is positive or 26 00:02:23,275 --> 00:02:27,016 negative. Let's see how to do that. 27 00:02:27,016 --> 00:02:32,926 First, we need to compute the a priori probability, that is the over all chance 28 00:02:32,926 --> 00:02:39,040 that a comment is positive which is simply the number of positive comments divided by 29 00:02:39,040 --> 00:02:44,573 the total number of comments which is 8000, in this case 6000 of them being 30 00:02:44,573 --> 00:02:48,044 positive. And similarly, the probability of a 31 00:02:48,044 --> 00:02:55,075 comment being negative a prioiri, which is the number of negative comments divided by 32 00:02:55,075 --> 00:02:59,036 the total. And then needing to compute the 33 00:02:59,036 --> 00:03:04,000 likelihoods. For example, the probability that the word 34 00:03:04,000 --> 00:03:10,087 like occurs within the positive comments. So of the 6,000 positive comments, like 35 00:03:10,087 --> 00:03:15,059 occurs in only 2,000 of them. So, we get this likelihood. 36 00:03:16,041 --> 00:03:23,078 Similarly the probability that enjoy occurs amongst the positive comments can 37 00:03:23,078 --> 00:03:30,031 be computed similarly. Now, I notice that there are no comments 38 00:03:30,031 --> 00:03:33,099 which are positive, but include the word, hate. 39 00:03:33,099 --> 00:03:40,062 We can't put zero for this because that would make the formula go completely out 40 00:03:40,062 --> 00:03:44,071 of whack. So we replace those by a low number, like 41 00:03:44,071 --> 00:03:48,014 one. This is called smoothing, in the Naive 42 00:03:48,014 --> 00:03:52,089 Bayes classifier. And this is important because we have to 43 00:03:52,089 --> 00:04:01,014 include all likelihood probabilities. Now we do the same thing for the negative 44 00:04:01,014 --> 00:04:05,061 comments. The probability that hate occurs amongst 45 00:04:05,061 --> 00:04:12,075 negative comments is 800 out of the 2000 negative comments because this comment 46 00:04:12,075 --> 00:04:18,424 contains hate. The probability that war occurs is one, 47 00:04:18,425 --> 00:04:24,050 like occurs again by smoothing is 1/2000 and so on. 48 00:04:27,083 --> 00:04:35,055 Not also that the probability that enjoy which might be thought of as a, as a 49 00:04:35,055 --> 00:04:42,075 positive comment. Amongst the negative comments is not that 50 00:04:42,075 --> 00:04:47,024 small. I will come to this phenomena later. 51 00:04:47,024 --> 00:04:54,003 The reason is that the enjoy is occurring along with a negative term, similarly with 52 00:04:54,003 --> 00:04:57,059 LUT. For the moment, we need to factor in all 53 00:04:57,059 --> 00:05:03,024 the likelihood probabilities simply by looking at the words and their 54 00:05:03,024 --> 00:05:07,028 occurrences. And we'll worry about things like, not 55 00:05:07,028 --> 00:05:14,050 next week, or rather the week afterwards. We're only going to consider the words 56 00:05:14,050 --> 00:05:18,082 marked bold. And for these words the likelihood 57 00:05:18,082 --> 00:05:24,056 probabilities look like this. So we can compute the probability that 58 00:05:24,056 --> 00:05:30,096 like occurs amongst all the positive comments easy and so on and simply the 59 00:05:30,096 --> 00:05:34,075 same thing is done for the negative comments. 60 00:05:35,062 --> 00:05:41,027 You can work this out for yourself. It's a good exercise to do. 61 00:05:42,019 --> 00:05:48,030 Now faced with a new tweet, say, "I really liked this simple course a lot; something 62 00:05:48,030 --> 00:05:54,026 we haven't seen before." We can compute the likelihood ratio, which is simply the 63 00:05:54,026 --> 00:06:00,029 probability of like occurring, because like happens to occur in this tweet, given 64 00:06:00,029 --> 00:06:05,042 that is positive and all. Everything in the enumerator is given it 65 00:06:05,042 --> 00:06:09,012 is positive. And for hate, which does not occur we 66 00:06:09,012 --> 00:06:15,280 compute the probability that hate does not occur, given that the sentiment is 67 00:06:15,280 --> 00:06:20,014 positive by taking one minus the likelihood ratio. 68 00:06:20,014 --> 00:06:23,099 Clearly the hate can be there or not there. 69 00:06:23,099 --> 00:06:30,097 So if it's not there, the probably of not hate given positive is one minus the 70 00:06:30,097 --> 00:06:37,882 probably of hate given positive. We include every possible word amongst the 71 00:06:37,882 --> 00:06:44,349 bold words that we have considered and even those which don't occur in this 72 00:06:44,349 --> 00:06:50,018 tweet, we include their probabilities by taking one minus. 73 00:06:50,069 --> 00:06:55,068 Lastly, we multiply it by the A prior probability. 74 00:06:57,001 --> 00:07:03,083 Similarly for the denominator. And we get a likelihood ratio of .026 over 75 00:07:03,083 --> 00:07:09,029 a very small number .00005, which is very much larger than one. 76 00:07:09,029 --> 00:07:16,077 So, the system can easily label this tweet as being positive without ever having seen 77 00:07:16,077 --> 00:07:21,008 it before. This is an example of a machine having 78 00:07:21,008 --> 00:07:27,094 learned to identify which tweets are positive and which are negative, based on 79 00:07:27,094 --> 00:07:32,025 historical data using the naive Bayes classifier.