1
00:00:00,000 --> 00:00:04,092
Now let's see what TF-IDF has to do with
mutual information.

2
00:00:04,092 --> 00:00:11,060
Remember that we have a transmitted signal
which is the content of the web page.

3
00:00:11,060 --> 00:00:18,028
And we want to somehow compute what are
the best keywords that should represent

4
00:00:18,028 --> 00:00:24,029
this webpage, so that the mutual
information between the content and the

5
00:00:24,029 --> 00:00:29,098
webpage keywords is high.
Our channel in this case is TF-IDF.

6
00:00:29,098 --> 00:00:36,092
We're trying to figure out what this
channel does in terms of maximizing the

7
00:00:36,092 --> 00:00:43,069
mutual information.
Tf-idf was actually invented as we just

8
00:00:43,069 --> 00:00:47,079
argued as in intuitive or a heuristics
technique.

9
00:00:47,079 --> 00:00:54,023
By as from shown recently that the mutual
information between all pages in a

10
00:00:54,023 --> 00:01:00,050
collection and all the words in the
collection is actually proportional to

11
00:01:00,050 --> 00:01:06,926
this sum which is essentially the
individual TF-IDF of each word summed up

12
00:01:06,926 --> 00:01:14,557
over all the documents so you take every
word in the collection computed TF-IDF and

13
00:01:14,557 --> 00:01:22,932
add that up across all documents and all
words, you'll get the mutual information

14
00:01:22,932 --> 00:01:29,015
between all words and all pages.
This is certainly very interesting because

15
00:01:29,015 --> 00:01:35,040
it puts this fairly intuitive, huristic
technique on a firm mathamatical footing.

16
00:01:35,073 --> 00:01:42,003
I am conscious of the fact that I haven't
defined for you exactly what is mutual

17
00:01:42,003 --> 00:01:45,083
information.
But like I said earlier, bear with me

18
00:01:45,083 --> 00:01:51,758
because there are many instances where
mutual information is important so when

19
00:01:51,758 --> 00:01:57,114
you finally see the formula, it'll become
extremely interesting.

20
00:01:57,114 --> 00:02:04,393
Let's try to compute now the best keywords
that represent this paragraph taken from

21
00:02:04,393 --> 00:02:10,207
the landing page for this course.
Well, let's try to compute the best

22
00:02:10,207 --> 00:02:18,081
keywords for this paragraph taken from the
course landing page using TF-IDF.

23
00:02:19,052 --> 00:02:25,056
The turn frequencies for each word are
easily calculated there merely the number

24
00:02:25,056 --> 00:02:30,094
of times each word occurs in this
paragraph, but what about the document

25
00:02:30,094 --> 00:02:34,098
frequencies?
We only have one paragraph so where do we

26
00:02:34,098 --> 00:02:36,069
look?
What do you think?

27
00:02:37,002 --> 00:02:43,004
Well what is the largest document
collection available to all of us?

28
00:02:43,004 --> 00:02:47,087
The web obviously.
So to find out if a word is rare or

29
00:02:47,087 --> 00:02:53,090
common, we just search for it on the web.
And look at the number of results that

30
00:02:53,090 --> 00:02:59,067
turn up.
We also need an estimate of all the

31
00:02:59,067 --> 00:03:06,025
documents on the web.
And we estimated that last week using

32
00:03:06,025 --> 00:03:13,634
search of common words that told us that
around 50 billion pages are indexed by a

33
00:03:13,634 --> 00:03:18,868
search engine like Google.
I would like to mention here that the

34
00:03:18,868 --> 00:03:24,575
search engines don't actually don't index
every possible URLs, so that be the total

35
00:03:24,575 --> 00:03:27,925
number of URLs is much, much larger than
50 billion.

36
00:03:27,925 --> 00:03:33,936
There has been an animated discussion in
forum regarding this point and I would

37
00:03:33,936 --> 00:03:37,438
like to thank everyone who contributed to
that.

38
00:03:37,438 --> 00:03:42,662
However, for the purpose of this
discussion, we nearly need an estimate of

39
00:03:42,662 --> 00:03:47,449
how rare or frequent the word is and
taking just the indexed web as our

40
00:03:47,449 --> 00:03:52,726
estimate is good enough.
So let's see what we get by searching for

41
00:03:52,726 --> 00:03:58,404
the different words in this paragraph.
Searching for 'the', we get around

42
00:03:58,404 --> 00:04:03,003
25,000,000,000 results.
Searching for 'map reduce', on the other

43
00:04:03,003 --> 00:04:09,606
hand, we get close to 200,000,000 results.
We can similarly calculate the number of

44
00:04:09,606 --> 00:04:13,980
hits we get for the other words in this
paragraph.

45
00:04:13,980 --> 00:04:21,618
To compute the ratio of the number of hits
with 50, which is our estimate for the

46
00:04:21,618 --> 00:04:29,333
total number of documents on the web.
To get the idea before takings logs, let

47
00:04:29,333 --> 00:04:35,692
me take the log multiply it by the
frequency of the term of the paragraph

48
00:04:35,692 --> 00:04:41,078
itself and we get the TF-IDF value.
Well here log of two is one obviously and

49
00:04:41,078 --> 00:04:47,222
so you multiply it by two and you get two,
but interestingly for the others you get

50
00:04:47,222 --> 00:04:51,143
slightly surprising results but also
intuitive ones.

51
00:04:51,143 --> 00:04:58,012
'Course' is a very much more common word
than 'map reduce but it also occurs twice.

52
00:04:58,012 --> 00:05:04,002
So it comes up high in TF-IDF.
So do 'map reduce and 'web intelligence',

53
00:05:04,002 --> 00:05:10,580
even though they occur only once.
What taking the log does is it makes sure

54
00:05:10,580 --> 00:05:18,661
that you keep a higher weightage to the
term frequency as opposed to this ratio,

55
00:05:18,661 --> 00:05:27,631
but this ratio is also taken into account.
So the top keywords for a paragraph can be

56
00:05:27,631 --> 00:05:34,288
automatically computed, just as, we might
have guessed looking at the paragraph,

57
00:05:34,288 --> 00:05:40,081
this is about a course on web intelligence
and 'map reduce'; makes lot of sense.

58
00:05:40,081 --> 00:05:46,001
It's certainly not about media and
certainly not about 'the'.

59
00:05:46,001 --> 00:05:50,095
So machine has already done what we do
fairly intuitively.

60
00:05:52,016 --> 00:05:57,088
Now let's ask the question, once you've
got the key word, could you possibly

61
00:05:57,088 --> 00:06:04,012
choose a good title for this document.
Well, this is an open problem today.

62
00:06:04,012 --> 00:06:09,026
And I'll leave you to think about it and
discuss this in the forum.