To see mutual information in action, let's
turn to Adsense which is the mechanism
which Google uses to place ads in webpages
other than search results.
In such cases there are no search terms in
which to match the keyword 'Bates', so
what Adsense does is it figures out which
are the right keywords that best represent
webpages actual content and use these to
decide which keyword 'Bates' should get ad
space on the page.
For example, suppose you are reading this
review of a camera.
Then the ad on the right has probably been
posted by a company in multi-brand retail
who happened to bid high for the camera
keyword.
On the other hand, if you turned to a
story about smartphones, you end up seeing
an ad about mobiles.
What is happening, is that the Adsense
code that Google asks you to put on your
site is figuring out which are the best
keywords that represent the content on
your page.
And matches that to the key words that are
being bid on the key word auction.
In a sense this is an inverse of search.
Think about it this way.
When you're searching, you give some query
keywords.
And you wanna come back with the pages
that best match them.
In this case, you're shown a page and the
system needs to guess what are the best
possible keywords that you would have
searched with if this was the page that
you really wanted as a result.
Viewing this problem in the language of
information theory.
The transmitted signal is the content of
the web page and what your receiving are
the keywords.The channel that enables you
to do this is Ad sense, which is Google's
technique to guess keywords from content.
And it rise to maximize the mutual
information between these two signals.
Such on the other hand is the reverse.
You're given some key works and you want
to find those pages which best match the
key words that you chose.
So, the question now reduces to how to
maximize the mutual information between
the words on one side and the words that
you want to receive on the other, that
would define the way you design and Ad
sense channel.
Of course you might know this that we
haven't yet defined what exactly is the
technique to compute the mutual
information.
Be patient, we're going to come to that
because mutual information is so deep a
concept that it applies in many contexts.
For the moment.
Bear with me and believe that there is a
formula using which one can exactly
compute the mutual information between two
signals.
Disregarding mutual information for the
time being, let's think about how one
might construct the best possible keywords
given a web page.
The converse problem of search is equally
related.
Which terms in a query should one consider
while searching?
Obviously, you don't need to worry about
which documents match the, and, a.
We really should focus on those words in
the query which are likely to be keywords
in the documents that you want to search
for.
Let's figure this out intuitively.
Merely a word like, the can weighs much
like less about the content of some page
describing the computer science concepts
than say the word touring which really
like me to be on pages about computer
science.
Clearly, rarer words, that means, words
that are not that common in all documents,
like the, a, and an, make better keywords.
Even other keywords like computer, might
be present in many, many documents but.
They are certainly rarer than the in, or
etcetera.
So, based on the principle that rarer
words make better keywords.
The concept of inverse document frequency
of a word.
Becomes, useful.
Now what is inverse, inverse document
frequency?
Let.
N is the total number of documents.
Say all the documents on the web.
And out of these, N sub W contain the word
W.
Then, the ratio of N over N sub W.
Obviously, N sub W will be less than N.
Tells us which fraction, or rather, the
inverse of the fraction of the words which
contain W, as compared to all the words.
And then we take the logorithm of this
term and, obviously, we reverse the
fraction.
Because, otherwise, the logorithm would
become negative.
And we get what is called the inverse
document frequency of a word.
Well, that's obviously not enough.
Because a document needs to contain the
word itself if the word needs to be a
keyword.
And if it contains many instances of the
word touring, maybe fifteen times for
example its much more likely that
[inaudible] is a keyword for that document
compared to say a document where the word
appears only twice.
So, the second principle that we apply in
our intuition, is that more frequent words
make better key words...
More frequent in the document that we are
considering, not more frequent in general.
So rarer words overall, but more frequent
in the document that we are considering.
So we simply multiply.
The, inverse document frequency with
another term.
Which is the frequency of the word in, a
given document.
Cuz the word occurs five times.
N, sub w, sub t, is five, and so on.
So TFIDF is nothing but the Term Frequency
multiplied by the Inverse Document
Frequency.
Words having a high TFIDF are considered
to be good keywords.
Apart from guessing key words think about
it from the search perspective.
If you're searching with a query which has
certain words with, whose idf is high, you
would like to use those in your query.
At the same time, when you index a word,
you want to weight it by its tf-idf value.
If a word occurs hundred times in a
document, but the word is, the, weighting
that element in the Index by 100 doesn't
make sense.
But if a word like queuing [inaudible]
occurs a 100 times than weighting it with
a high value makes sense.
The TF idea accurately captures this
intuition.