Well as you might have guessed we are
going to return to information theory.
And look at machine learning from that
perspective.
We're transmitting signals consisting of
features - words or other features for
example and we are using those to predict
the values of some behavior - browsers
versus buyers etc.
And our goal is to improve the mutual
information between these two signals via
a machine learning algorithm.
Now, what you all are probably waiting for
is the actual definition of mutual
information.
So the mutual information between a
feature, a word for example, and a
behavior, browser or buyer for example is
defined formally as the probability of
that feature and behavior occurring for
some particular value of CF and B,
multiplied by this quantity.
And this sum is a double summation over
all values of feature and all possible
behavior.
So, in our case, it would be four values
in the sum, feature present, absent,
behavior browser, buyer.
And in each case you would compute the
joint probability of a feature and a
behavior multiplied by this ratio.
Now, to understand this ratio better.
Imagine if the feature and behavior are
independent.
If that's the case then the probability of
the feature and the behavior together is
nothing but the product of the probability
of the feature times the probability of
the behavior.
So this ratio becomes one the logarithm
become zero and so does the mutual
information.
But if the feature and the behavior are
independent then obviously it's hopeless
to try to predict the values of the
behavior from the values of the feature.
We'll do an example in a minute to
actually compute the mutual information.
But before that a little history about
mutual information.
What Shannon was really trying to do is
measure the information content of various
signals.
So, there's a formal definition of
information content called the entropy,
which we again haven't defined.
And we're not going to do that.
But still, believe me, there is one and
similarly there'll be an information
content for the behavior signal.
There also be an information content for
the signal consisting of both observations
- the feature and the behavior combined
and the mutual information is nothing but
the difference between the total
information in f and b.
And information of f and b when observed
together.
We won't go into the intuition behind this
too much, except to note that the
information content of two variables
together.
Then, the total information in both the
variables put together.
As a result the mutual information has to
be positive.
Just remember this, if during any of your
calculations you get a negative value of
mutual information, you've made a mistake.
Now let's do an example.
We'll take the same set of comments that
we used earlier and compute the mutual
information between a word such as hate
and the sentiment positive or negative.
The probability that a comment is positive
is just a fraction of positive comments,
which is 6000/8000 and similarly 2000/8000
for negative comments.
The probability that hate occurs in a
comment is just the ratio of hate out of
the total number of 8000 comments and
similarly probably that hate does not
occur.
The joint probabilities are a little more
tricky.
The're a little different from the
conditional probabilities that we had
earlier.
We, for example there will probably be the
hate does not occur in a positive comment
is well it's all the positive comments
that is six thousand of them but we divide
by the total number of comments.
Because this is the joint probability
rather than the conditional one.
In this case for probability of hate it
doesn't occur in the positive comment so
we have to smooth it by just making the
value one over 8000.
And similarly we can compute the
probability of not hate.
In a negative comment in the probability
of hate in a negative comment.
The mutual information between the word
hate and the sentiment, + or -, is
obtained by plugging into the formula.
And you have four terms in the formula,
one for hate +, hate not hate +, hate -,
and not hate -.
And the result is that we get the mutual
information between hate and sentiment is
.22.
Well, is this good or bad?
Let's check another word, the work course.
It occurs in all the comments, so the
probability of course is just one, and the
probability of not course is actually zero
but we smooth it by making it one over a
thousand.
The joint probability, of course, for
positive comments, is just the fraction of
positive comments, because it occurs, all,
all positive comments.
In fact, it occurs in all comments.
And similarly, the probability, of course,
for negative comments, is just the
probability of negative comments, because
it occurs in all comments.
And not course doesn't occur, so again we
smooth this.
The resulting mutual information is.003.
So what this is saying is that a word like
course which occurs everywhere is not able
to tell me anything about whether a
sentiment is positive/negative.
Quite obvious.
But let's change the problem slightly.
Let's now look at the case where these two
comments don't actually have the word,
course.
So, we have just reworded them a bit.
For course, now, we have different values,
because course occurs only in some of the
comments.
And not course occurs in these 1,400
comments.
The joint probability, of course, given
the positive, again changes.
Not all the positive comments, it's just
5,000 out of 6,000 positive comments have
course, because you're removing these
1,000 comments that are positive and have,
don't have course.
And the pro-, probability that not course
occurs in a positive comment is exactly
this 1,000.
So, you get these values.
Now the value of mutual information
between course and sentiment is a bit
bigger than before.
But still much, much smaller than .22.
What this tells us is that course is still
a poor determiner of whether or not a
comment is positive or negative.
Something which is intuitively obvious to
us.
What's interesting is that using mutual
information a computer can determine such
facts from examining vast volumes of data.