1
00:00:00,000 --> 00:00:05,563
Well as you might have guessed we are
going to return to information theory.

2
00:00:05,563 --> 00:00:10,406
And look at machine learning from that
perspective.

3
00:00:10,406 --> 00:00:19,378
We're transmitting signals consisting of
features - words or other features for

4
00:00:19,378 --> 00:00:26,476
example and we are using those to predict
the values of some behavior - browsers

5
00:00:26,476 --> 00:00:32,343
versus buyers etc.
And our goal is to improve the mutual

6
00:00:32,343 --> 00:00:38,898
information between these two signals via
a machine learning algorithm.

7
00:00:38,898 --> 00:00:46,793
Now, what you all are probably waiting for
is the actual definition of mutual

8
00:00:46,793 --> 00:00:51,711
information.
So the mutual information between a

9
00:00:51,711 --> 00:00:59,095
feature, a word for example, and a
behavior, browser or buyer for example is

10
00:00:59,095 --> 00:01:08,366
defined formally as the probability of
that feature and behavior occurring for

11
00:01:08,366 --> 00:01:16,054
some particular value of CF and B,
multiplied by this quantity.

12
00:01:17,092 --> 00:01:24,074
And this sum is a double summation over
all values of feature and all possible

13
00:01:24,074 --> 00:01:28,088
behavior.
So, in our case, it would be four values

14
00:01:28,088 --> 00:01:33,081
in the sum, feature present, absent,
behavior browser, buyer.

15
00:01:33,081 --> 00:01:40,037
And in each case you would compute the
joint probability of a feature and a

16
00:01:40,037 --> 00:01:46,042
behavior multiplied by this ratio.
Now, to understand this ratio better.

17
00:01:46,087 --> 00:01:51,048
Imagine if the feature and behavior are
independent.

18
00:01:53,020 --> 00:02:00,938
If that's the case then the probability of
the feature and the behavior together is

19
00:02:00,938 --> 00:02:07,838
nothing but the product of the probability
of the feature times the probability of

20
00:02:07,838 --> 00:02:14,020
the behavior.
So this ratio becomes one the logarithm

21
00:02:14,020 --> 00:02:18,080
become zero and so does the mutual
information.

22
00:02:19,084 --> 00:02:26,080
But if the feature and the behavior are
independent then obviously it's hopeless

23
00:02:26,080 --> 00:02:33,034
to try to predict the values of the
behavior from the values of the feature.

24
00:02:34,047 --> 00:02:40,072
We'll do an example in a minute to
actually compute the mutual information.

25
00:02:40,072 --> 00:02:45,098
But before that a little history about
mutual information.

26
00:02:45,098 --> 00:02:52,092
What Shannon was really trying to do is
measure the information content of various

27
00:02:52,092 --> 00:02:56,043
signals.
So, there's a formal definition of

28
00:02:56,043 --> 00:03:02,028
information content called the entropy,
which we again haven't defined.

29
00:03:02,028 --> 00:03:08,005
And we're not going to do that.
But still, believe me, there is one and

30
00:03:08,005 --> 00:03:13,081
similarly there'll be an information
content for the behavior signal.

31
00:03:14,084 --> 00:03:22,029
There also be an information content for
the signal consisting of both observations

32
00:03:22,029 --> 00:03:29,048
- the feature and the behavior combined
and the mutual information is nothing but

33
00:03:29,048 --> 00:03:34,046
the difference between the total
information in f and b.

34
00:03:34,046 --> 00:03:39,031
And information of f and b when observed
together.

35
00:03:39,031 --> 00:03:46,069
We won't go into the intuition behind this
too much, except to note that the

36
00:03:46,069 --> 00:03:51,015
information content of two variables
together.

37
00:03:51,015 --> 00:03:57,018
Then, the total information in both the
variables put together.

38
00:03:57,018 --> 00:04:02,003
As a result the mutual information has to
be positive.

39
00:04:02,097 --> 00:04:08,075
Just remember this, if during any of your
calculations you get a negative value of

40
00:04:08,075 --> 00:04:13,274
mutual information, you've made a mistake.
Now let's do an example.

41
00:04:13,274 --> 00:04:20,497
We'll take the same set of comments that
we used earlier and compute the mutual

42
00:04:20,497 --> 00:04:27,837
information between a word such as hate
and the sentiment positive or negative.

43
00:04:27,837 --> 00:04:35,413
The probability that a comment is positive
is just a fraction of positive comments,

44
00:04:35,413 --> 00:04:41,307
which is 6000/8000 and similarly 2000/8000
for negative comments.

45
00:04:41,307 --> 00:04:48,927
The probability that hate occurs in a
comment is just the ratio of hate out of

46
00:04:48,927 --> 00:04:56,379
the total number of 8000 comments and
similarly probably that hate does not

47
00:04:56,379 --> 00:04:59,593
occur.
The joint probabilities are a little more

48
00:04:59,593 --> 00:05:01,934
tricky.
The're a little different from the

49
00:05:01,934 --> 00:05:04,504
conditional probabilities that we had
earlier.

50
00:05:04,504 --> 00:05:09,651
We, for example there will probably be the
hate does not occur in a positive comment

51
00:05:09,651 --> 00:05:14,367
is well it's all the positive comments
that is six thousand of them but we divide

52
00:05:14,367 --> 00:05:18,760
by the total number of comments.
Because this is the joint probability

53
00:05:18,760 --> 00:05:25,490
rather than the conditional one.
In this case for probability of hate it

54
00:05:25,490 --> 00:05:33,040
doesn't occur in the positive comment so
we have to smooth it by just making the

55
00:05:33,040 --> 00:05:37,943
value one over 8000.
And similarly we can compute the

56
00:05:37,943 --> 00:05:44,908
probability of not hate.
In a negative comment in the probability

57
00:05:44,908 --> 00:05:54,011
of hate in a negative comment.
The mutual information between the word

58
00:05:54,011 --> 00:06:01,003
hate and the sentiment, + or -, is
obtained by plugging into the formula.

59
00:06:01,003 --> 00:06:08,623
And you have four terms in the formula,
one for hate +, hate not hate +, hate -,

60
00:06:08,623 --> 00:06:14,155
and not hate -.
And the result is that we get the mutual

61
00:06:14,155 --> 00:06:18,719
information between hate and sentiment is
.22.

62
00:06:18,719 --> 00:06:26,786
Well, is this good or bad?
Let's check another word, the work course.

63
00:06:26,786 --> 00:06:34,516
It occurs in all the comments, so the
probability of course is just one, and the

64
00:06:34,516 --> 00:06:42,123
probability of not course is actually zero
but we smooth it by making it one over a

65
00:06:42,123 --> 00:06:47,332
thousand.
The joint probability, of course, for

66
00:06:47,332 --> 00:06:52,616
positive comments, is just the fraction of
positive comments, because it occurs, all,

67
00:06:52,616 --> 00:06:56,328
all positive comments.
In fact, it occurs in all comments.

68
00:06:56,328 --> 00:07:00,868
And similarly, the probability, of course,
for negative comments, is just the

69
00:07:00,868 --> 00:07:05,028
probability of negative comments, because
it occurs in all comments.

70
00:07:05,028 --> 00:07:08,991
And not course doesn't occur, so again we
smooth this.

71
00:07:08,991 --> 00:07:15,078
The resulting mutual information is.003.
So what this is saying is that a word like

72
00:07:15,078 --> 00:07:20,784
course which occurs everywhere is not able
to tell me anything about whether a

73
00:07:20,784 --> 00:07:24,170
sentiment is positive/negative.
Quite obvious.

74
00:07:24,170 --> 00:07:32,219
But let's change the problem slightly.
Let's now look at the case where these two

75
00:07:32,219 --> 00:07:35,701
comments don't actually have the word,
course.

76
00:07:35,701 --> 00:07:41,963
So, we have just reworded them a bit.
For course, now, we have different values,

77
00:07:41,963 --> 00:07:45,844
because course occurs only in some of the
comments.

78
00:07:45,844 --> 00:07:49,993
And not course occurs in these 1,400
comments.

79
00:07:49,993 --> 00:07:55,658
The joint probability, of course, given
the positive, again changes.

80
00:07:55,658 --> 00:08:01,234
Not all the positive comments, it's just
5,000 out of 6,000 positive comments have

81
00:08:01,234 --> 00:08:06,819
course, because you're removing these
1,000 comments that are positive and have,

82
00:08:06,819 --> 00:08:12,085
don't have course.
And the pro-, probability that not course

83
00:08:12,085 --> 00:08:16,947
occurs in a positive comment is exactly
this 1,000.

84
00:08:16,947 --> 00:08:21,964
So, you get these values.
Now the value of mutual information

85
00:08:21,964 --> 00:08:26,772
between course and sentiment is a bit
bigger than before.

86
00:08:26,772 --> 00:08:35,253
But still much, much smaller than .22.
What this tells us is that course is still

87
00:08:35,253 --> 00:08:42,041
a poor determiner of whether or not a
comment is positive or negative.

88
00:08:42,041 --> 00:08:46,313
Something which is intuitively obvious to
us.

89
00:08:46,313 --> 00:08:54,049
What's interesting is that using mutual
information a computer can determine such

90
00:08:54,049 --> 00:08:58,044
facts from examining vast volumes of data.