1 00:00:00,000 --> 00:00:05,563 Well as you might have guessed we are going to return to information theory. 2 00:00:05,563 --> 00:00:10,406 And look at machine learning from that perspective. 3 00:00:10,406 --> 00:00:19,378 We're transmitting signals consisting of features - words or other features for 4 00:00:19,378 --> 00:00:26,476 example and we are using those to predict the values of some behavior - browsers 5 00:00:26,476 --> 00:00:32,343 versus buyers etc. And our goal is to improve the mutual 6 00:00:32,343 --> 00:00:38,898 information between these two signals via a machine learning algorithm. 7 00:00:38,898 --> 00:00:46,793 Now, what you all are probably waiting for is the actual definition of mutual 8 00:00:46,793 --> 00:00:51,711 information. So the mutual information between a 9 00:00:51,711 --> 00:00:59,095 feature, a word for example, and a behavior, browser or buyer for example is 10 00:00:59,095 --> 00:01:08,366 defined formally as the probability of that feature and behavior occurring for 11 00:01:08,366 --> 00:01:16,054 some particular value of CF and B, multiplied by this quantity. 12 00:01:17,092 --> 00:01:24,074 And this sum is a double summation over all values of feature and all possible 13 00:01:24,074 --> 00:01:28,088 behavior. So, in our case, it would be four values 14 00:01:28,088 --> 00:01:33,081 in the sum, feature present, absent, behavior browser, buyer. 15 00:01:33,081 --> 00:01:40,037 And in each case you would compute the joint probability of a feature and a 16 00:01:40,037 --> 00:01:46,042 behavior multiplied by this ratio. Now, to understand this ratio better. 17 00:01:46,087 --> 00:01:51,048 Imagine if the feature and behavior are independent. 18 00:01:53,020 --> 00:02:00,938 If that's the case then the probability of the feature and the behavior together is 19 00:02:00,938 --> 00:02:07,838 nothing but the product of the probability of the feature times the probability of 20 00:02:07,838 --> 00:02:14,020 the behavior. So this ratio becomes one the logarithm 21 00:02:14,020 --> 00:02:18,080 become zero and so does the mutual information. 22 00:02:19,084 --> 00:02:26,080 But if the feature and the behavior are independent then obviously it's hopeless 23 00:02:26,080 --> 00:02:33,034 to try to predict the values of the behavior from the values of the feature. 24 00:02:34,047 --> 00:02:40,072 We'll do an example in a minute to actually compute the mutual information. 25 00:02:40,072 --> 00:02:45,098 But before that a little history about mutual information. 26 00:02:45,098 --> 00:02:52,092 What Shannon was really trying to do is measure the information content of various 27 00:02:52,092 --> 00:02:56,043 signals. So, there's a formal definition of 28 00:02:56,043 --> 00:03:02,028 information content called the entropy, which we again haven't defined. 29 00:03:02,028 --> 00:03:08,005 And we're not going to do that. But still, believe me, there is one and 30 00:03:08,005 --> 00:03:13,081 similarly there'll be an information content for the behavior signal. 31 00:03:14,084 --> 00:03:22,029 There also be an information content for the signal consisting of both observations 32 00:03:22,029 --> 00:03:29,048 - the feature and the behavior combined and the mutual information is nothing but 33 00:03:29,048 --> 00:03:34,046 the difference between the total information in f and b. 34 00:03:34,046 --> 00:03:39,031 And information of f and b when observed together. 35 00:03:39,031 --> 00:03:46,069 We won't go into the intuition behind this too much, except to note that the 36 00:03:46,069 --> 00:03:51,015 information content of two variables together. 37 00:03:51,015 --> 00:03:57,018 Then, the total information in both the variables put together. 38 00:03:57,018 --> 00:04:02,003 As a result the mutual information has to be positive. 39 00:04:02,097 --> 00:04:08,075 Just remember this, if during any of your calculations you get a negative value of 40 00:04:08,075 --> 00:04:13,274 mutual information, you've made a mistake. Now let's do an example. 41 00:04:13,274 --> 00:04:20,497 We'll take the same set of comments that we used earlier and compute the mutual 42 00:04:20,497 --> 00:04:27,837 information between a word such as hate and the sentiment positive or negative. 43 00:04:27,837 --> 00:04:35,413 The probability that a comment is positive is just a fraction of positive comments, 44 00:04:35,413 --> 00:04:41,307 which is 6000/8000 and similarly 2000/8000 for negative comments. 45 00:04:41,307 --> 00:04:48,927 The probability that hate occurs in a comment is just the ratio of hate out of 46 00:04:48,927 --> 00:04:56,379 the total number of 8000 comments and similarly probably that hate does not 47 00:04:56,379 --> 00:04:59,593 occur. The joint probabilities are a little more 48 00:04:59,593 --> 00:05:01,934 tricky. The're a little different from the 49 00:05:01,934 --> 00:05:04,504 conditional probabilities that we had earlier. 50 00:05:04,504 --> 00:05:09,651 We, for example there will probably be the hate does not occur in a positive comment 51 00:05:09,651 --> 00:05:14,367 is well it's all the positive comments that is six thousand of them but we divide 52 00:05:14,367 --> 00:05:18,760 by the total number of comments. Because this is the joint probability 53 00:05:18,760 --> 00:05:25,490 rather than the conditional one. In this case for probability of hate it 54 00:05:25,490 --> 00:05:33,040 doesn't occur in the positive comment so we have to smooth it by just making the 55 00:05:33,040 --> 00:05:37,943 value one over 8000. And similarly we can compute the 56 00:05:37,943 --> 00:05:44,908 probability of not hate. In a negative comment in the probability 57 00:05:44,908 --> 00:05:54,011 of hate in a negative comment. The mutual information between the word 58 00:05:54,011 --> 00:06:01,003 hate and the sentiment, + or -, is obtained by plugging into the formula. 59 00:06:01,003 --> 00:06:08,623 And you have four terms in the formula, one for hate +, hate not hate +, hate -, 60 00:06:08,623 --> 00:06:14,155 and not hate -. And the result is that we get the mutual 61 00:06:14,155 --> 00:06:18,719 information between hate and sentiment is .22. 62 00:06:18,719 --> 00:06:26,786 Well, is this good or bad? Let's check another word, the work course. 63 00:06:26,786 --> 00:06:34,516 It occurs in all the comments, so the probability of course is just one, and the 64 00:06:34,516 --> 00:06:42,123 probability of not course is actually zero but we smooth it by making it one over a 65 00:06:42,123 --> 00:06:47,332 thousand. The joint probability, of course, for 66 00:06:47,332 --> 00:06:52,616 positive comments, is just the fraction of positive comments, because it occurs, all, 67 00:06:52,616 --> 00:06:56,328 all positive comments. In fact, it occurs in all comments. 68 00:06:56,328 --> 00:07:00,868 And similarly, the probability, of course, for negative comments, is just the 69 00:07:00,868 --> 00:07:05,028 probability of negative comments, because it occurs in all comments. 70 00:07:05,028 --> 00:07:08,991 And not course doesn't occur, so again we smooth this. 71 00:07:08,991 --> 00:07:15,078 The resulting mutual information is.003. So what this is saying is that a word like 72 00:07:15,078 --> 00:07:20,784 course which occurs everywhere is not able to tell me anything about whether a 73 00:07:20,784 --> 00:07:24,170 sentiment is positive/negative. Quite obvious. 74 00:07:24,170 --> 00:07:32,219 But let's change the problem slightly. Let's now look at the case where these two 75 00:07:32,219 --> 00:07:35,701 comments don't actually have the word, course. 76 00:07:35,701 --> 00:07:41,963 So, we have just reworded them a bit. For course, now, we have different values, 77 00:07:41,963 --> 00:07:45,844 because course occurs only in some of the comments. 78 00:07:45,844 --> 00:07:49,993 And not course occurs in these 1,400 comments. 79 00:07:49,993 --> 00:07:55,658 The joint probability, of course, given the positive, again changes. 80 00:07:55,658 --> 00:08:01,234 Not all the positive comments, it's just 5,000 out of 6,000 positive comments have 81 00:08:01,234 --> 00:08:06,819 course, because you're removing these 1,000 comments that are positive and have, 82 00:08:06,819 --> 00:08:12,085 don't have course. And the pro-, probability that not course 83 00:08:12,085 --> 00:08:16,947 occurs in a positive comment is exactly this 1,000. 84 00:08:16,947 --> 00:08:21,964 So, you get these values. Now the value of mutual information 85 00:08:21,964 --> 00:08:26,772 between course and sentiment is a bit bigger than before. 86 00:08:26,772 --> 00:08:35,253 But still much, much smaller than .22. What this tells us is that course is still 87 00:08:35,253 --> 00:08:42,041 a poor determiner of whether or not a comment is positive or negative. 88 00:08:42,041 --> 00:08:46,313 Something which is intuitively obvious to us. 89 00:08:46,313 --> 00:08:54,049 What's interesting is that using mutual information a computer can determine such 90 00:08:54,049 --> 00:08:58,044 facts from examining vast volumes of data.