To see mutual information in action, let's turn to Adsense which is the mechanism which Google uses to place ads in webpages other than search results. In such cases there are no search terms in which to match the keyword 'Bates', so what Adsense does is it figures out which are the right keywords that best represent webpages actual content and use these to decide which keyword 'Bates' should get ad space on the page. For example, suppose you are reading this review of a camera. Then the ad on the right has probably been posted by a company in multi-brand retail who happened to bid high for the camera keyword. On the other hand, if you turned to a story about smartphones, you end up seeing an ad about mobiles. What is happening, is that the Adsense code that Google asks you to put on your site is figuring out which are the best keywords that represent the content on your page. And matches that to the key words that are being bid on the key word auction. In a sense this is an inverse of search. Think about it this way. When you're searching, you give some query keywords. And you wanna come back with the pages that best match them. In this case, you're shown a page and the system needs to guess what are the best possible keywords that you would have searched with if this was the page that you really wanted as a result. Viewing this problem in the language of information theory. The transmitted signal is the content of the web page and what your receiving are the keywords.The channel that enables you to do this is Ad sense, which is Google's technique to guess keywords from content. And it rise to maximize the mutual information between these two signals. Such on the other hand is the reverse. You're given some key works and you want to find those pages which best match the key words that you chose. So, the question now reduces to how to maximize the mutual information between the words on one side and the words that you want to receive on the other, that would define the way you design and Ad sense channel. Of course you might know this that we haven't yet defined what exactly is the technique to compute the mutual information. Be patient, we're going to come to that because mutual information is so deep a concept that it applies in many contexts. For the moment. Bear with me and believe that there is a formula using which one can exactly compute the mutual information between two signals. Disregarding mutual information for the time being, let's think about how one might construct the best possible keywords given a web page. The converse problem of search is equally related. Which terms in a query should one consider while searching? Obviously, you don't need to worry about which documents match the, and, a. We really should focus on those words in the query which are likely to be keywords in the documents that you want to search for. Let's figure this out intuitively. Merely a word like, the can weighs much like less about the content of some page describing the computer science concepts than say the word touring which really like me to be on pages about computer science. Clearly, rarer words, that means, words that are not that common in all documents, like the, a, and an, make better keywords. Even other keywords like computer, might be present in many, many documents but. They are certainly rarer than the in, or etcetera. So, based on the principle that rarer words make better keywords. The concept of inverse document frequency of a word. Becomes, useful. Now what is inverse, inverse document frequency? Let. N is the total number of documents. Say all the documents on the web. And out of these, N sub W contain the word W. Then, the ratio of N over N sub W. Obviously, N sub W will be less than N. Tells us which fraction, or rather, the inverse of the fraction of the words which contain W, as compared to all the words. And then we take the logorithm of this term and, obviously, we reverse the fraction. Because, otherwise, the logorithm would become negative. And we get what is called the inverse document frequency of a word. Well, that's obviously not enough. Because a document needs to contain the word itself if the word needs to be a keyword. And if it contains many instances of the word touring, maybe fifteen times for example its much more likely that [inaudible] is a keyword for that document compared to say a document where the word appears only twice. So, the second principle that we apply in our intuition, is that more frequent words make better key words... More frequent in the document that we are considering, not more frequent in general. So rarer words overall, but more frequent in the document that we are considering. So we simply multiply. The, inverse document frequency with another term. Which is the frequency of the word in, a given document. Cuz the word occurs five times. N, sub w, sub t, is five, and so on. So TFIDF is nothing but the Term Frequency multiplied by the Inverse Document Frequency. Words having a high TFIDF are considered to be good keywords. Apart from guessing key words think about it from the search perspective. If you're searching with a query which has certain words with, whose idf is high, you would like to use those in your query. At the same time, when you index a word, you want to weight it by its tf-idf value. If a word occurs hundred times in a document, but the word is, the, weighting that element in the Index by 100 doesn't make sense. But if a word like queuing [inaudible] occurs a 100 times than weighting it with a high value makes sense. The TF idea accurately captures this intuition.