1 00:00:00,000 --> 00:00:05,086 So it appears that mutual information tells us which features, words in our 2 00:00:05,086 --> 00:00:12,831 example earlier, are good predictors of the behavior we want to predict, should we 3 00:00:12,831 --> 00:00:18,012 not use those with the highest mutual information, as our features. 4 00:00:20,000 --> 00:00:24,076 The trouble is the actual mutual information from the formula is very 5 00:00:24,076 --> 00:00:29,093 difficult to compute exhaustively. There are so many different possibilities 6 00:00:29,093 --> 00:00:34,082 with large numbers of features. So in practice we use proxies. 7 00:00:34,082 --> 00:00:40,043 Well, a good proxy that we've seen earlier is the Universe Document Frequency. 8 00:00:40,043 --> 00:00:45,034 There are other techniques. Adaboost, in particular, is an important 9 00:00:45,034 --> 00:00:48,098 algorithm. But we won't go into that in detail in 10 00:00:48,098 --> 00:00:52,077 this course. For the moment, think about using those 11 00:00:52,077 --> 00:00:59,001 words with high English document frequency as a proxy for words which are likely to 12 00:00:59,001 --> 00:01:03,039 be good features. Another question we might ask is, are more 13 00:01:03,039 --> 00:01:09,022 features always good? The Y axis here is measuring the error in 14 00:01:09,022 --> 00:01:16,034 the classification, that is how often naive base gets the wrong answer. 15 00:01:17,018 --> 00:01:24,082 This error improves as we add features, but after some time it starts to degrade 16 00:01:24,082 --> 00:01:27,757 again. Why might this be happening? 17 00:01:27,757 --> 00:01:35,037 What do you think? Perhaps we are using the wrong features to 18 00:01:35,037 --> 00:01:39,099 start with. It turns out that, that's not the whole 19 00:01:39,099 --> 00:01:44,475 story either. In this example the features having the 20 00:01:44,475 --> 00:01:52,060 lowest, mutual information, or information gain which is another term for the same 21 00:01:52,087 --> 00:01:56,076 idea. It's those bad features which are used 22 00:01:56,076 --> 00:02:02,057 first and then the good features come later still the classifier goes a wire. 23 00:02:03,000 --> 00:02:07,090 So what's going on? Can you guess? 24 00:02:09,021 --> 00:02:15,054 Remember that there is a reason why naive base is called naive. 25 00:02:15,054 --> 00:02:23,018 It doesn't like redundant features. It assumes that features are independent. 26 00:02:23,018 --> 00:02:30,061 It likes the fact that features have very small mutual information amongst 27 00:02:30,061 --> 00:02:36,021 themselves. The trouble is that's not always the case. 28 00:02:36,021 --> 00:02:40,040 And this is one reason why the technique can fail. 29 00:02:41,081 --> 00:02:47,039 It gets confused because it assumes that features are independent. 30 00:02:47,039 --> 00:02:53,081 Well, in principle, one should be able to compute the best features, either by 31 00:02:53,081 --> 00:03:00,040 computing the mutual information directly or using a proxy, as well as somehow 32 00:03:00,040 --> 00:03:07,990 figuring out which features are dependent. And choose those best features which are 33 00:03:07,990 --> 00:03:12,893 also dependent. Many, many machine learning techniques do 34 00:03:12,893 --> 00:03:17,740 exactly this. We don't have time to go into the 35 00:03:17,740 --> 00:03:23,041 techniques in detail. But, the idea should be clear by now. 36 00:03:23,093 --> 00:03:30,033 Let's return now to looking at machine learning from the perspective of 37 00:03:30,033 --> 00:03:36,676 information theory. We have a machine learning algorithm which 38 00:03:36,676 --> 00:03:45,061 takes the sequence of observations such as comments and classifies them as positive 39 00:03:45,061 --> 00:03:51,063 or negative. In a manner so as to maximize the mutual 40 00:03:51,063 --> 00:03:59,278 information between the observations and their actual classifications versus the 41 00:03:59,278 --> 00:04:04,036 ones that the algorithm manages to predict. 42 00:04:06,018 --> 00:04:12,096 Relating this to what Shannon defined as the capacity of a communication channel in 43 00:04:12,096 --> 00:04:19,050 his this was an actual communication channel if you remember like a telephone 44 00:04:19,050 --> 00:04:25,099 channel or a radio channel. He, Luke, was worried about how fast you 45 00:04:25,099 --> 00:04:29,078 could transmit information on such channel. 46 00:04:29,078 --> 00:04:36,091 So, he defined the capacity of a channel as the maximum information that could be 47 00:04:36,091 --> 00:04:41,075 transferred between the sender and receiver per second. 48 00:04:41,075 --> 00:04:47,022 So, the element of speed comes in when you talk about capacity. 49 00:04:48,037 --> 00:04:52,082 What does this mean in the context of machine learning? 50 00:04:52,082 --> 00:04:56,021 Is there an equivalent, notion of capacity? 51 00:04:56,021 --> 00:05:00,009 How, fast can a machine algorithm actually learn? 52 00:05:00,009 --> 00:05:07,778 Or what does it mean to be fast? It turns out that there have been a lot of 53 00:05:07,778 --> 00:05:14,542 work on the theory of machine learning. The pioneer on this is Leslie Valiant who 54 00:05:14,542 --> 00:05:20,800 won the touring award in 2011. And other important papers defined 55 00:05:20,800 --> 00:05:30,111 something called the VC dimension using which it was shown in this paper the right 56 00:05:30,111 --> 00:05:39,123 bayesian classifier will eventually learn any concept, or any distinction between 57 00:05:39,123 --> 00:05:47,020 plus and minus, yin and yang. The trouble is it need not run as fast. 58 00:05:47,020 --> 00:05:52,363 What does that mean? How many training examples does a 59 00:05:52,363 --> 00:06:00,187 classifier require to learn a concept? That's the equivalent of speed in the 60 00:06:00,187 --> 00:06:06,271 world of machine learning. And how fast depends on the concept 61 00:06:06,271 --> 00:06:13,366 itself, and the VC dimension or the Vapnik Chervonenkis Dimension of a concept can be 62 00:06:13,366 --> 00:06:17,192 measured. Using which, this paper showed that 63 00:06:17,192 --> 00:06:21,677 Bayesian learning can eventually learn any concept. 64 00:06:21,677 --> 00:06:26,290 And the speed depends on the VC dimension of the concept. 65 00:06:26,290 --> 00:06:32,523 Well, that's all we're going to do regarding machine learning theory for the 66 00:06:32,523 --> 00:06:37,875 moment. Let's return now to the question of 67 00:06:37,875 --> 00:06:46,358 whether sentiment analysis is actually measuring an opinion about a product, a 68 00:06:46,358 --> 00:06:53,183 course or anything else. Remember there are hundreds of millions of 69 00:06:53,183 --> 00:06:59,066 tweets a day, we can listen to the voice of the consumer like never before, we can 70 00:06:59,066 --> 00:07:04,489 figure out the sentiments and all these things, just as we've discussed in our 71 00:07:04,489 --> 00:07:08,515 example. But how do we figure out what consumers 72 00:07:08,515 --> 00:07:14,829 are saying or complaining about, not just whether or not they are complaining? 73 00:07:14,829 --> 00:07:21,601 What is the object of their complaint, or for that matter their request or demand? 74 00:07:21,601 --> 00:07:28,226 Consider a comment such as, 'Book me an American flight to New York.' What does 75 00:07:28,226 --> 00:07:33,183 the word American mean? Does it mean the airline, American 76 00:07:33,183 --> 00:07:36,678 Airlines? Or does it mean the nationality of the 77 00:07:36,678 --> 00:07:41,306 airline, so that any airline of American origin will do. 78 00:07:41,306 --> 00:07:48,171 Obviously, this is an ambiguous sentence and language is full of such vagueness and 79 00:07:48,171 --> 00:07:56,682 ambiguities. Suppose the writer also said, 'I hate 80 00:07:56,682 --> 00:08:05,872 British food.' Maybe the guess is now it's probability American Airline because 81 00:08:05,872 --> 00:08:12,470 British Airways is also another airline and maybe they're talking about the food 82 00:08:12,470 --> 00:08:17,894 on British Airways. But suppose the comment was, 'I hate 83 00:08:17,894 --> 00:08:24,092 English food', well suddenly you change your decision and now he is thinking of 84 00:08:24,092 --> 00:08:30,501 any American carrier, not just American airlines, because American versus English 85 00:08:30,501 --> 00:08:36,824 clearly distinguishes the fact that he is talking about the nationality versus 86 00:08:36,824 --> 00:08:45,612 American versus British means that is more likely to be talking about the carriers 87 00:08:45,612 --> 00:08:51,100 themselves. Consider this sentence, 'I only eat 88 00:08:51,100 --> 00:08:58,384 Kellogg cereals' verses only 'I eat Kellogg cereals.' Two very different 89 00:08:58,384 --> 00:09:02,219 things. What can you say about this home's 90 00:09:02,219 --> 00:09:08,436 breakfast stockpile? Clearly in the first case it's possibly 91 00:09:08,436 --> 00:09:13,395 saying that, that person really wants to eat only Kellogg's. 92 00:09:13,395 --> 00:09:18,775 In the second case, he's saying, maybe he wants to eat Kellogg's, but the rest of 93 00:09:18,775 --> 00:09:25,216 his family just doesn't like it. Two very different meanings. 94 00:09:25,216 --> 00:09:31,642 'Took the new car on terribly bumpy road.' It did well though. 95 00:09:31,642 --> 00:09:38,492 Is this family happy with their new car? Just looking at sentiment, it has so many 96 00:09:38,492 --> 00:09:47,948 negative words - terrible, bumpy. It does have this positive word well, but 97 00:09:47,948 --> 00:09:55,212 would the basin classifier guess that this is a positive or a negative comment 98 00:09:55,212 --> 00:09:57,354 properly? Probably not. 99 00:09:57,354 --> 00:10:03,616 The point we're trying to get at is Beijing learning using a bag of words, 100 00:10:03,616 --> 00:10:07,791 just features being words themselves. Is it enough? 101 00:10:07,791 --> 00:10:16,510 And more deeply we're trying to ask the issue of Richard Montague and Nomchansky 102 00:10:16,510 --> 00:10:24,262 how do we actually discern the meaning of a sentence versus just classifying it as 103 00:10:24,262 --> 00:10:28,026 positive, negative, good or bad, yin or