1 00:00:00,000 --> 00:00:08,041 To see mutual information in action, let's turn to Adsense which is the mechanism 2 00:00:08,041 --> 00:00:15,058 which Google uses to place ads in webpages other than search results. 3 00:00:15,058 --> 00:00:23,503 In such cases there are no search terms in which to match the keyword 'Bates', so 4 00:00:23,503 --> 00:00:32,088 what Adsense does is it figures out which are the right keywords that best represent 5 00:00:32,088 --> 00:00:40,648 webpages actual content and use these to decide which keyword 'Bates' should get ad 6 00:00:40,648 --> 00:00:46,072 space on the page. For example, suppose you are reading this 7 00:00:46,072 --> 00:00:52,909 review of a camera. Then the ad on the right has probably been 8 00:00:52,909 --> 00:00:59,422 posted by a company in multi-brand retail who happened to bid high for the camera 9 00:00:59,422 --> 00:01:03,345 keyword. On the other hand, if you turned to a 10 00:01:03,345 --> 00:01:08,635 story about smartphones, you end up seeing an ad about mobiles. 11 00:01:08,635 --> 00:01:14,917 What is happening, is that the Adsense code that Google asks you to put on your 12 00:01:14,917 --> 00:01:21,277 site is figuring out which are the best keywords that represent the content on 13 00:01:21,277 --> 00:01:25,998 your page. And matches that to the key words that are 14 00:01:25,998 --> 00:01:35,454 being bid on the key word auction. In a sense this is an inverse of search. 15 00:01:35,454 --> 00:01:42,214 Think about it this way. When you're searching, you give some query 16 00:01:42,214 --> 00:01:46,991 keywords. And you wanna come back with the pages 17 00:01:46,991 --> 00:01:53,054 that best match them. In this case, you're shown a page and the 18 00:01:53,054 --> 00:02:00,074 system needs to guess what are the best possible keywords that you would have 19 00:02:00,074 --> 00:02:06,315 searched with if this was the page that you really wanted as a result. 20 00:02:06,315 --> 00:02:11,625 Viewing this problem in the language of information theory. 21 00:02:11,625 --> 00:02:19,224 The transmitted signal is the content of the web page and what your receiving are 22 00:02:19,224 --> 00:02:27,645 the keywords.The channel that enables you to do this is Ad sense, which is Google's 23 00:02:27,645 --> 00:02:35,059 technique to guess keywords from content. And it rise to maximize the mutual 24 00:02:35,059 --> 00:02:42,043 information between these two signals. Such on the other hand is the reverse. 25 00:02:43,000 --> 00:02:50,058 You're given some key works and you want to find those pages which best match the 26 00:02:50,058 --> 00:02:56,061 key words that you chose. So, the question now reduces to how to 27 00:02:56,061 --> 00:03:03,093 maximize the mutual information between the words on one side and the words that 28 00:03:03,093 --> 00:03:10,088 you want to receive on the other, that would define the way you design and Ad 29 00:03:10,088 --> 00:03:15,058 sense channel. Of course you might know this that we 30 00:03:15,058 --> 00:03:22,009 haven't yet defined what exactly is the technique to compute the mutual 31 00:03:22,009 --> 00:03:26,046 information. Be patient, we're going to come to that 32 00:03:26,046 --> 00:03:33,029 because mutual information is so deep a concept that it applies in many contexts. 33 00:03:33,029 --> 00:03:38,022 For the moment. Bear with me and believe that there is a 34 00:03:38,022 --> 00:03:44,075 formula using which one can exactly compute the mutual information between two 35 00:03:44,075 --> 00:03:50,000 signals. Disregarding mutual information for the 36 00:03:50,000 --> 00:03:57,074 time being, let's think about how one might construct the best possible keywords 37 00:03:57,074 --> 00:04:03,044 given a web page. The converse problem of search is equally 38 00:04:03,044 --> 00:04:08,034 related. Which terms in a query should one consider 39 00:04:08,034 --> 00:04:13,032 while searching? Obviously, you don't need to worry about 40 00:04:13,032 --> 00:04:19,081 which documents match the, and, a. We really should focus on those words in 41 00:04:19,081 --> 00:04:27,019 the query which are likely to be keywords in the documents that you want to search 42 00:04:27,019 --> 00:04:30,056 for. Let's figure this out intuitively. 43 00:04:30,056 --> 00:04:36,043 Merely a word like, the can weighs much like less about the content of some page 44 00:04:36,043 --> 00:04:42,030 describing the computer science concepts than say the word touring which really 45 00:04:42,030 --> 00:04:45,068 like me to be on pages about computer science. 46 00:04:47,069 --> 00:04:53,067 Clearly, rarer words, that means, words that are not that common in all documents, 47 00:04:53,067 --> 00:04:59,088 like the, a, and an, make better keywords. Even other keywords like computer, might 48 00:04:59,088 --> 00:05:06,084 be present in many, many documents but. They are certainly rarer than the in, or 49 00:05:06,084 --> 00:05:11,011 etcetera. So, based on the principle that rarer 50 00:05:11,011 --> 00:05:17,073 words make better keywords. The concept of inverse document frequency 51 00:05:17,073 --> 00:05:19,067 of a word. Becomes, useful. 52 00:05:19,067 --> 00:05:23,031 Now what is inverse, inverse document frequency? 53 00:05:23,057 --> 00:05:28,082 Let. N is the total number of documents. 54 00:05:28,082 --> 00:05:35,087 Say all the documents on the web. And out of these, N sub W contain the word 55 00:05:35,087 --> 00:05:39,034 W. Then, the ratio of N over N sub W. 56 00:05:39,034 --> 00:05:45,062 Obviously, N sub W will be less than N. Tells us which fraction, or rather, the 57 00:05:45,062 --> 00:05:52,039 inverse of the fraction of the words which contain W, as compared to all the words. 58 00:05:52,039 --> 00:05:58,042 And then we take the logorithm of this term and, obviously, we reverse the 59 00:05:58,042 --> 00:06:02,030 fraction. Because, otherwise, the logorithm would 60 00:06:02,030 --> 00:06:06,076 become negative. And we get what is called the inverse 61 00:06:06,076 --> 00:06:12,000 document frequency of a word. Well, that's obviously not enough. 62 00:06:12,000 --> 00:06:17,067 Because a document needs to contain the word itself if the word needs to be a 63 00:06:17,067 --> 00:06:21,064 keyword. And if it contains many instances of the 64 00:06:21,064 --> 00:06:27,005 word touring, maybe fifteen times for example its much more likely that 65 00:06:27,005 --> 00:06:33,053 [inaudible] is a keyword for that document compared to say a document where the word 66 00:06:33,053 --> 00:06:39,052 appears only twice. So, the second principle that we apply in 67 00:06:39,052 --> 00:06:45,000 our intuition, is that more frequent words make better key words... 68 00:06:45,000 --> 00:06:51,090 More frequent in the document that we are considering, not more frequent in general. 69 00:06:51,090 --> 00:06:58,063 So rarer words overall, but more frequent in the document that we are considering. 70 00:06:58,063 --> 00:07:04,008 So we simply multiply. The, inverse document frequency with 71 00:07:04,008 --> 00:07:09,062 another term. Which is the frequency of the word in, a 72 00:07:09,062 --> 00:07:14,071 given document. Cuz the word occurs five times. 73 00:07:14,071 --> 00:07:22,087 N, sub w, sub t, is five, and so on. So TFIDF is nothing but the Term Frequency 74 00:07:22,087 --> 00:07:27,076 multiplied by the Inverse Document Frequency. 75 00:07:27,076 --> 00:07:34,039 Words having a high TFIDF are considered to be good keywords. 76 00:07:35,019 --> 00:07:40,048 Apart from guessing key words think about it from the search perspective. 77 00:07:41,018 --> 00:07:49,041 If you're searching with a query which has certain words with, whose idf is high, you 78 00:07:49,041 --> 00:07:56,095 would like to use those in your query. At the same time, when you index a word, 79 00:07:56,095 --> 00:08:04,000 you want to weight it by its tf-idf value. If a word occurs hundred times in a 80 00:08:04,000 --> 00:08:10,076 document, but the word is, the, weighting that element in the Index by 100 doesn't 81 00:08:10,076 --> 00:08:14,083 make sense. But if a word like queuing [inaudible] 82 00:08:14,083 --> 00:08:20,029 occurs a 100 times than weighting it with a high value makes sense. 83 00:08:20,029 --> 00:08:24,012 The TF idea accurately captures this intuition.