Now let's see what TF-IDF has to do with
mutual information.
Remember that we have a transmitted signal
which is the content of the web page.
And we want to somehow compute what are
the best keywords that should represent
this webpage, so that the mutual
information between the content and the
webpage keywords is high.
Our channel in this case is TF-IDF.
We're trying to figure out what this
channel does in terms of maximizing the
mutual information.
Tf-idf was actually invented as we just
argued as in intuitive or a heuristics
technique.
By as from shown recently that the mutual
information between all pages in a
collection and all the words in the
collection is actually proportional to
this sum which is essentially the
individual TF-IDF of each word summed up
over all the documents so you take every
word in the collection computed TF-IDF and
add that up across all documents and all
words, you'll get the mutual information
between all words and all pages.
This is certainly very interesting because
it puts this fairly intuitive, huristic
technique on a firm mathamatical footing.
I am conscious of the fact that I haven't
defined for you exactly what is mutual
information.
But like I said earlier, bear with me
because there are many instances where
mutual information is important so when
you finally see the formula, it'll become
extremely interesting.
Let's try to compute now the best keywords
that represent this paragraph taken from
the landing page for this course.
Well, let's try to compute the best
keywords for this paragraph taken from the
course landing page using TF-IDF.
The turn frequencies for each word are
easily calculated there merely the number
of times each word occurs in this
paragraph, but what about the document
frequencies?
We only have one paragraph so where do we
look?
What do you think?
Well what is the largest document
collection available to all of us?
The web obviously.
So to find out if a word is rare or
common, we just search for it on the web.
And look at the number of results that
turn up.
We also need an estimate of all the
documents on the web.
And we estimated that last week using
search of common words that told us that
around 50 billion pages are indexed by a
search engine like Google.
I would like to mention here that the
search engines don't actually don't index
every possible URLs, so that be the total
number of URLs is much, much larger than
50 billion.
There has been an animated discussion in
forum regarding this point and I would
like to thank everyone who contributed to
that.
However, for the purpose of this
discussion, we nearly need an estimate of
how rare or frequent the word is and
taking just the indexed web as our
estimate is good enough.
So let's see what we get by searching for
the different words in this paragraph.
Searching for 'the', we get around
25,000,000,000 results.
Searching for 'map reduce', on the other
hand, we get close to 200,000,000 results.
We can similarly calculate the number of
hits we get for the other words in this
paragraph.
To compute the ratio of the number of hits
with 50, which is our estimate for the
total number of documents on the web.
To get the idea before takings logs, let
me take the log multiply it by the
frequency of the term of the paragraph
itself and we get the TF-IDF value.
Well here log of two is one obviously and
so you multiply it by two and you get two,
but interestingly for the others you get
slightly surprising results but also
intuitive ones.
'Course' is a very much more common word
than 'map reduce but it also occurs twice.
So it comes up high in TF-IDF.
So do 'map reduce and 'web intelligence',
even though they occur only once.
What taking the log does is it makes sure
that you keep a higher weightage to the
term frequency as opposed to this ratio,
but this ratio is also taken into account.
So the top keywords for a paragraph can be
automatically computed, just as, we might
have guessed looking at the paragraph,
this is about a course on web intelligence
and 'map reduce'; makes lot of sense.
It's certainly not about media and
certainly not about 'the'.
So machine has already done what we do
fairly intuitively.
Now let's ask the question, once you've
got the key word, could you possibly
choose a good title for this document.
Well, this is an open problem today.
And I'll leave you to think about it and
discuss this in the forum.