1 00:00:00,000 --> 00:00:04,092 Now let's see what TF-IDF has to do with mutual information. 2 00:00:04,092 --> 00:00:11,060 Remember that we have a transmitted signal which is the content of the web page. 3 00:00:11,060 --> 00:00:18,028 And we want to somehow compute what are the best keywords that should represent 4 00:00:18,028 --> 00:00:24,029 this webpage, so that the mutual information between the content and the 5 00:00:24,029 --> 00:00:29,098 webpage keywords is high. Our channel in this case is TF-IDF. 6 00:00:29,098 --> 00:00:36,092 We're trying to figure out what this channel does in terms of maximizing the 7 00:00:36,092 --> 00:00:43,069 mutual information. Tf-idf was actually invented as we just 8 00:00:43,069 --> 00:00:47,079 argued as in intuitive or a heuristics technique. 9 00:00:47,079 --> 00:00:54,023 By as from shown recently that the mutual information between all pages in a 10 00:00:54,023 --> 00:01:00,050 collection and all the words in the collection is actually proportional to 11 00:01:00,050 --> 00:01:06,926 this sum which is essentially the individual TF-IDF of each word summed up 12 00:01:06,926 --> 00:01:14,557 over all the documents so you take every word in the collection computed TF-IDF and 13 00:01:14,557 --> 00:01:22,932 add that up across all documents and all words, you'll get the mutual information 14 00:01:22,932 --> 00:01:29,015 between all words and all pages. This is certainly very interesting because 15 00:01:29,015 --> 00:01:35,040 it puts this fairly intuitive, huristic technique on a firm mathamatical footing. 16 00:01:35,073 --> 00:01:42,003 I am conscious of the fact that I haven't defined for you exactly what is mutual 17 00:01:42,003 --> 00:01:45,083 information. But like I said earlier, bear with me 18 00:01:45,083 --> 00:01:51,758 because there are many instances where mutual information is important so when 19 00:01:51,758 --> 00:01:57,114 you finally see the formula, it'll become extremely interesting. 20 00:01:57,114 --> 00:02:04,393 Let's try to compute now the best keywords that represent this paragraph taken from 21 00:02:04,393 --> 00:02:10,207 the landing page for this course. Well, let's try to compute the best 22 00:02:10,207 --> 00:02:18,081 keywords for this paragraph taken from the course landing page using TF-IDF. 23 00:02:19,052 --> 00:02:25,056 The turn frequencies for each word are easily calculated there merely the number 24 00:02:25,056 --> 00:02:30,094 of times each word occurs in this paragraph, but what about the document 25 00:02:30,094 --> 00:02:34,098 frequencies? We only have one paragraph so where do we 26 00:02:34,098 --> 00:02:36,069 look? What do you think? 27 00:02:37,002 --> 00:02:43,004 Well what is the largest document collection available to all of us? 28 00:02:43,004 --> 00:02:47,087 The web obviously. So to find out if a word is rare or 29 00:02:47,087 --> 00:02:53,090 common, we just search for it on the web. And look at the number of results that 30 00:02:53,090 --> 00:02:59,067 turn up. We also need an estimate of all the 31 00:02:59,067 --> 00:03:06,025 documents on the web. And we estimated that last week using 32 00:03:06,025 --> 00:03:13,634 search of common words that told us that around 50 billion pages are indexed by a 33 00:03:13,634 --> 00:03:18,868 search engine like Google. I would like to mention here that the 34 00:03:18,868 --> 00:03:24,575 search engines don't actually don't index every possible URLs, so that be the total 35 00:03:24,575 --> 00:03:27,925 number of URLs is much, much larger than 50 billion. 36 00:03:27,925 --> 00:03:33,936 There has been an animated discussion in forum regarding this point and I would 37 00:03:33,936 --> 00:03:37,438 like to thank everyone who contributed to that. 38 00:03:37,438 --> 00:03:42,662 However, for the purpose of this discussion, we nearly need an estimate of 39 00:03:42,662 --> 00:03:47,449 how rare or frequent the word is and taking just the indexed web as our 40 00:03:47,449 --> 00:03:52,726 estimate is good enough. So let's see what we get by searching for 41 00:03:52,726 --> 00:03:58,404 the different words in this paragraph. Searching for 'the', we get around 42 00:03:58,404 --> 00:04:03,003 25,000,000,000 results. Searching for 'map reduce', on the other 43 00:04:03,003 --> 00:04:09,606 hand, we get close to 200,000,000 results. We can similarly calculate the number of 44 00:04:09,606 --> 00:04:13,980 hits we get for the other words in this paragraph. 45 00:04:13,980 --> 00:04:21,618 To compute the ratio of the number of hits with 50, which is our estimate for the 46 00:04:21,618 --> 00:04:29,333 total number of documents on the web. To get the idea before takings logs, let 47 00:04:29,333 --> 00:04:35,692 me take the log multiply it by the frequency of the term of the paragraph 48 00:04:35,692 --> 00:04:41,078 itself and we get the TF-IDF value. Well here log of two is one obviously and 49 00:04:41,078 --> 00:04:47,222 so you multiply it by two and you get two, but interestingly for the others you get 50 00:04:47,222 --> 00:04:51,143 slightly surprising results but also intuitive ones. 51 00:04:51,143 --> 00:04:58,012 'Course' is a very much more common word than 'map reduce but it also occurs twice. 52 00:04:58,012 --> 00:05:04,002 So it comes up high in TF-IDF. So do 'map reduce and 'web intelligence', 53 00:05:04,002 --> 00:05:10,580 even though they occur only once. What taking the log does is it makes sure 54 00:05:10,580 --> 00:05:18,661 that you keep a higher weightage to the term frequency as opposed to this ratio, 55 00:05:18,661 --> 00:05:27,631 but this ratio is also taken into account. So the top keywords for a paragraph can be 56 00:05:27,631 --> 00:05:34,288 automatically computed, just as, we might have guessed looking at the paragraph, 57 00:05:34,288 --> 00:05:40,081 this is about a course on web intelligence and 'map reduce'; makes lot of sense. 58 00:05:40,081 --> 00:05:46,001 It's certainly not about media and certainly not about 'the'. 59 00:05:46,001 --> 00:05:50,095 So machine has already done what we do fairly intuitively. 60 00:05:52,016 --> 00:05:57,088 Now let's ask the question, once you've got the key word, could you possibly 61 00:05:57,088 --> 00:06:04,012 choose a good title for this document. Well, this is an open problem today. 62 00:06:04,012 --> 00:06:09,026 And I'll leave you to think about it and discuss this in the forum.