Now let's see what TF-IDF has to do with mutual information. Remember that we have a transmitted signal which is the content of the web page. And we want to somehow compute what are the best keywords that should represent this webpage, so that the mutual information between the content and the webpage keywords is high. Our channel in this case is TF-IDF. We're trying to figure out what this channel does in terms of maximizing the mutual information. Tf-idf was actually invented as we just argued as in intuitive or a heuristics technique. By as from shown recently that the mutual information between all pages in a collection and all the words in the collection is actually proportional to this sum which is essentially the individual TF-IDF of each word summed up over all the documents so you take every word in the collection computed TF-IDF and add that up across all documents and all words, you'll get the mutual information between all words and all pages. This is certainly very interesting because it puts this fairly intuitive, huristic technique on a firm mathamatical footing. I am conscious of the fact that I haven't defined for you exactly what is mutual information. But like I said earlier, bear with me because there are many instances where mutual information is important so when you finally see the formula, it'll become extremely interesting. Let's try to compute now the best keywords that represent this paragraph taken from the landing page for this course. Well, let's try to compute the best keywords for this paragraph taken from the course landing page using TF-IDF. The turn frequencies for each word are easily calculated there merely the number of times each word occurs in this paragraph, but what about the document frequencies? We only have one paragraph so where do we look? What do you think? Well what is the largest document collection available to all of us? The web obviously. So to find out if a word is rare or common, we just search for it on the web. And look at the number of results that turn up. We also need an estimate of all the documents on the web. And we estimated that last week using search of common words that told us that around 50 billion pages are indexed by a search engine like Google. I would like to mention here that the search engines don't actually don't index every possible URLs, so that be the total number of URLs is much, much larger than 50 billion. There has been an animated discussion in forum regarding this point and I would like to thank everyone who contributed to that. However, for the purpose of this discussion, we nearly need an estimate of how rare or frequent the word is and taking just the indexed web as our estimate is good enough. So let's see what we get by searching for the different words in this paragraph. Searching for 'the', we get around 25,000,000,000 results. Searching for 'map reduce', on the other hand, we get close to 200,000,000 results. We can similarly calculate the number of hits we get for the other words in this paragraph. To compute the ratio of the number of hits with 50, which is our estimate for the total number of documents on the web. To get the idea before takings logs, let me take the log multiply it by the frequency of the term of the paragraph itself and we get the TF-IDF value. Well here log of two is one obviously and so you multiply it by two and you get two, but interestingly for the others you get slightly surprising results but also intuitive ones. 'Course' is a very much more common word than 'map reduce but it also occurs twice. So it comes up high in TF-IDF. So do 'map reduce and 'web intelligence', even though they occur only once. What taking the log does is it makes sure that you keep a higher weightage to the term frequency as opposed to this ratio, but this ratio is also taken into account. So the top keywords for a paragraph can be automatically computed, just as, we might have guessed looking at the paragraph, this is about a course on web intelligence and 'map reduce'; makes lot of sense. It's certainly not about media and certainly not about 'the'. So machine has already done what we do fairly intuitively. Now let's ask the question, once you've got the key word, could you possibly choose a good title for this document. Well, this is an open problem today. And I'll leave you to think about it and discuss this in the forum.