Well as you might have guessed we are going to return to information theory. And look at machine learning from that perspective. We're transmitting signals consisting of features - words or other features for example and we are using those to predict the values of some behavior - browsers versus buyers etc. And our goal is to improve the mutual information between these two signals via a machine learning algorithm. Now, what you all are probably waiting for is the actual definition of mutual information. So the mutual information between a feature, a word for example, and a behavior, browser or buyer for example is defined formally as the probability of that feature and behavior occurring for some particular value of CF and B, multiplied by this quantity. And this sum is a double summation over all values of feature and all possible behavior. So, in our case, it would be four values in the sum, feature present, absent, behavior browser, buyer. And in each case you would compute the joint probability of a feature and a behavior multiplied by this ratio. Now, to understand this ratio better. Imagine if the feature and behavior are independent. If that's the case then the probability of the feature and the behavior together is nothing but the product of the probability of the feature times the probability of the behavior. So this ratio becomes one the logarithm become zero and so does the mutual information. But if the feature and the behavior are independent then obviously it's hopeless to try to predict the values of the behavior from the values of the feature. We'll do an example in a minute to actually compute the mutual information. But before that a little history about mutual information. What Shannon was really trying to do is measure the information content of various signals. So, there's a formal definition of information content called the entropy, which we again haven't defined. And we're not going to do that. But still, believe me, there is one and similarly there'll be an information content for the behavior signal. There also be an information content for the signal consisting of both observations - the feature and the behavior combined and the mutual information is nothing but the difference between the total information in f and b. And information of f and b when observed together. We won't go into the intuition behind this too much, except to note that the information content of two variables together. Then, the total information in both the variables put together. As a result the mutual information has to be positive. Just remember this, if during any of your calculations you get a negative value of mutual information, you've made a mistake. Now let's do an example. We'll take the same set of comments that we used earlier and compute the mutual information between a word such as hate and the sentiment positive or negative. The probability that a comment is positive is just a fraction of positive comments, which is 6000/8000 and similarly 2000/8000 for negative comments. The probability that hate occurs in a comment is just the ratio of hate out of the total number of 8000 comments and similarly probably that hate does not occur. The joint probabilities are a little more tricky. The're a little different from the conditional probabilities that we had earlier. We, for example there will probably be the hate does not occur in a positive comment is well it's all the positive comments that is six thousand of them but we divide by the total number of comments. Because this is the joint probability rather than the conditional one. In this case for probability of hate it doesn't occur in the positive comment so we have to smooth it by just making the value one over 8000. And similarly we can compute the probability of not hate. In a negative comment in the probability of hate in a negative comment. The mutual information between the word hate and the sentiment, + or -, is obtained by plugging into the formula. And you have four terms in the formula, one for hate +, hate not hate +, hate -, and not hate -. And the result is that we get the mutual information between hate and sentiment is .22. Well, is this good or bad? Let's check another word, the work course. It occurs in all the comments, so the probability of course is just one, and the probability of not course is actually zero but we smooth it by making it one over a thousand. The joint probability, of course, for positive comments, is just the fraction of positive comments, because it occurs, all, all positive comments. In fact, it occurs in all comments. And similarly, the probability, of course, for negative comments, is just the probability of negative comments, because it occurs in all comments. And not course doesn't occur, so again we smooth this. The resulting mutual information is.003. So what this is saying is that a word like course which occurs everywhere is not able to tell me anything about whether a sentiment is positive/negative. Quite obvious. But let's change the problem slightly. Let's now look at the case where these two comments don't actually have the word, course. So, we have just reworded them a bit. For course, now, we have different values, because course occurs only in some of the comments. And not course occurs in these 1,400 comments. The joint probability, of course, given the positive, again changes. Not all the positive comments, it's just 5,000 out of 6,000 positive comments have course, because you're removing these 1,000 comments that are positive and have, don't have course. And the pro-, probability that not course occurs in a positive comment is exactly this 1,000. So, you get these values. Now the value of mutual information between course and sentiment is a bit bigger than before. But still much, much smaller than .22. What this tells us is that course is still a poor determiner of whether or not a comment is positive or negative. Something which is intuitively obvious to us. What's interesting is that using mutual information a computer can determine such facts from examining vast volumes of data.