So it appears that mutual information tells us which features, words in our example earlier, are good predictors of the behavior we want to predict, should we not use those with the highest mutual information, as our features. The trouble is the actual mutual information from the formula is very difficult to compute exhaustively. There are so many different possibilities with large numbers of features. So in practice we use proxies. Well, a good proxy that we've seen earlier is the Universe Document Frequency. There are other techniques. Adaboost, in particular, is an important algorithm. But we won't go into that in detail in this course. For the moment, think about using those words with high English document frequency as a proxy for words which are likely to be good features. Another question we might ask is, are more features always good? The Y axis here is measuring the error in the classification, that is how often naive base gets the wrong answer. This error improves as we add features, but after some time it starts to degrade again. Why might this be happening? What do you think? Perhaps we are using the wrong features to start with. It turns out that, that's not the whole story either. In this example the features having the lowest, mutual information, or information gain which is another term for the same idea. It's those bad features which are used first and then the good features come later still the classifier goes a wire. So what's going on? Can you guess? Remember that there is a reason why naive base is called naive. It doesn't like redundant features. It assumes that features are independent. It likes the fact that features have very small mutual information amongst themselves. The trouble is that's not always the case. And this is one reason why the technique can fail. It gets confused because it assumes that features are independent. Well, in principle, one should be able to compute the best features, either by computing the mutual information directly or using a proxy, as well as somehow figuring out which features are dependent. And choose those best features which are also dependent. Many, many machine learning techniques do exactly this. We don't have time to go into the techniques in detail. But, the idea should be clear by now. Let's return now to looking at machine learning from the perspective of information theory. We have a machine learning algorithm which takes the sequence of observations such as comments and classifies them as positive or negative. In a manner so as to maximize the mutual information between the observations and their actual classifications versus the ones that the algorithm manages to predict. Relating this to what Shannon defined as the capacity of a communication channel in his this was an actual communication channel if you remember like a telephone channel or a radio channel. He, Luke, was worried about how fast you could transmit information on such channel. So, he defined the capacity of a channel as the maximum information that could be transferred between the sender and receiver per second. So, the element of speed comes in when you talk about capacity. What does this mean in the context of machine learning? Is there an equivalent, notion of capacity? How, fast can a machine algorithm actually learn? Or what does it mean to be fast? It turns out that there have been a lot of work on the theory of machine learning. The pioneer on this is Leslie Valiant who won the touring award in 2011. And other important papers defined something called the VC dimension using which it was shown in this paper the right bayesian classifier will eventually learn any concept, or any distinction between plus and minus, yin and yang. The trouble is it need not run as fast. What does that mean? How many training examples does a classifier require to learn a concept? That's the equivalent of speed in the world of machine learning. And how fast depends on the concept itself, and the VC dimension or the Vapnik Chervonenkis Dimension of a concept can be measured. Using which, this paper showed that Bayesian learning can eventually learn any concept. And the speed depends on the VC dimension of the concept. Well, that's all we're going to do regarding machine learning theory for the moment. Let's return now to the question of whether sentiment analysis is actually measuring an opinion about a product, a course or anything else. Remember there are hundreds of millions of tweets a day, we can listen to the voice of the consumer like never before, we can figure out the sentiments and all these things, just as we've discussed in our example. But how do we figure out what consumers are saying or complaining about, not just whether or not they are complaining? What is the object of their complaint, or for that matter their request or demand? Consider a comment such as, 'Book me an American flight to New York.' What does the word American mean? Does it mean the airline, American Airlines? Or does it mean the nationality of the airline, so that any airline of American origin will do. Obviously, this is an ambiguous sentence and language is full of such vagueness and ambiguities. Suppose the writer also said, 'I hate British food.' Maybe the guess is now it's probability American Airline because British Airways is also another airline and maybe they're talking about the food on British Airways. But suppose the comment was, 'I hate English food', well suddenly you change your decision and now he is thinking of any American carrier, not just American airlines, because American versus English clearly distinguishes the fact that he is talking about the nationality versus American versus British means that is more likely to be talking about the carriers themselves. Consider this sentence, 'I only eat Kellogg cereals' verses only 'I eat Kellogg cereals.' Two very different things. What can you say about this home's breakfast stockpile? Clearly in the first case it's possibly saying that, that person really wants to eat only Kellogg's. In the second case, he's saying, maybe he wants to eat Kellogg's, but the rest of his family just doesn't like it. Two very different meanings. 'Took the new car on terribly bumpy road.' It did well though. Is this family happy with their new car? Just looking at sentiment, it has so many negative words - terrible, bumpy. It does have this positive word well, but would the basin classifier guess that this is a positive or a negative comment properly? Probably not. The point we're trying to get at is Beijing learning using a bag of words, just features being words themselves. Is it enough? And more deeply we're trying to ask the issue of Richard Montague and Nomchansky how do we actually discern the meaning of a sentence versus just classifying it as positive, negative, good or bad, yin or