1
00:00:00,000 --> 00:00:08,041
To see mutual information in action, let's
turn to Adsense which is the mechanism

2
00:00:08,041 --> 00:00:15,058
which Google uses to place ads in webpages
other than search results.

3
00:00:15,058 --> 00:00:23,503
In such cases there are no search terms in
which to match the keyword 'Bates', so

4
00:00:23,503 --> 00:00:32,088
what Adsense does is it figures out which
are the right keywords that best represent

5
00:00:32,088 --> 00:00:40,648
webpages actual content and use these to
decide which keyword 'Bates' should get ad

6
00:00:40,648 --> 00:00:46,072
space on the page.
For example, suppose you are reading this

7
00:00:46,072 --> 00:00:52,909
review of a camera.
Then the ad on the right has probably been

8
00:00:52,909 --> 00:00:59,422
posted by a company in multi-brand retail
who happened to bid high for the camera

9
00:00:59,422 --> 00:01:03,345
keyword.
On the other hand, if you turned to a

10
00:01:03,345 --> 00:01:08,635
story about smartphones, you end up seeing
an ad about mobiles.

11
00:01:08,635 --> 00:01:14,917
What is happening, is that the Adsense
code that Google asks you to put on your

12
00:01:14,917 --> 00:01:21,277
site is figuring out which are the best
keywords that represent the content on

13
00:01:21,277 --> 00:01:25,998
your page.
And matches that to the key words that are

14
00:01:25,998 --> 00:01:35,454
being bid on the key word auction.
In a sense this is an inverse of search.

15
00:01:35,454 --> 00:01:42,214
Think about it this way.
When you're searching, you give some query

16
00:01:42,214 --> 00:01:46,991
keywords.
And you wanna come back with the pages

17
00:01:46,991 --> 00:01:53,054
that best match them.
In this case, you're shown a page and the

18
00:01:53,054 --> 00:02:00,074
system needs to guess what are the best
possible keywords that you would have

19
00:02:00,074 --> 00:02:06,315
searched with if this was the page that
you really wanted as a result.

20
00:02:06,315 --> 00:02:11,625
Viewing this problem in the language of
information theory.

21
00:02:11,625 --> 00:02:19,224
The transmitted signal is the content of
the web page and what your receiving are

22
00:02:19,224 --> 00:02:27,645
the keywords.The channel that enables you
to do this is Ad sense, which is Google's

23
00:02:27,645 --> 00:02:35,059
technique to guess keywords from content.
And it rise to maximize the mutual

24
00:02:35,059 --> 00:02:42,043
information between these two signals.
Such on the other hand is the reverse.

25
00:02:43,000 --> 00:02:50,058
You're given some key works and you want
to find those pages which best match the

26
00:02:50,058 --> 00:02:56,061
key words that you chose.
So, the question now reduces to how to

27
00:02:56,061 --> 00:03:03,093
maximize the mutual information between
the words on one side and the words that

28
00:03:03,093 --> 00:03:10,088
you want to receive on the other, that
would define the way you design and Ad

29
00:03:10,088 --> 00:03:15,058
sense channel.
Of course you might know this that we

30
00:03:15,058 --> 00:03:22,009
haven't yet defined what exactly is the
technique to compute the mutual

31
00:03:22,009 --> 00:03:26,046
information.
Be patient, we're going to come to that

32
00:03:26,046 --> 00:03:33,029
because mutual information is so deep a
concept that it applies in many contexts.

33
00:03:33,029 --> 00:03:38,022
For the moment.
Bear with me and believe that there is a

34
00:03:38,022 --> 00:03:44,075
formula using which one can exactly
compute the mutual information between two

35
00:03:44,075 --> 00:03:50,000
signals.
Disregarding mutual information for the

36
00:03:50,000 --> 00:03:57,074
time being, let's think about how one
might construct the best possible keywords

37
00:03:57,074 --> 00:04:03,044
given a web page.
The converse problem of search is equally

38
00:04:03,044 --> 00:04:08,034
related.
Which terms in a query should one consider

39
00:04:08,034 --> 00:04:13,032
while searching?
Obviously, you don't need to worry about

40
00:04:13,032 --> 00:04:19,081
which documents match the, and, a.
We really should focus on those words in

41
00:04:19,081 --> 00:04:27,019
the query which are likely to be keywords
in the documents that you want to search

42
00:04:27,019 --> 00:04:30,056
for.
Let's figure this out intuitively.

43
00:04:30,056 --> 00:04:36,043
Merely a word like, the can weighs much
like less about the content of some page

44
00:04:36,043 --> 00:04:42,030
describing the computer science concepts
than say the word touring which really

45
00:04:42,030 --> 00:04:45,068
like me to be on pages about computer
science.

46
00:04:47,069 --> 00:04:53,067
Clearly, rarer words, that means, words
that are not that common in all documents,

47
00:04:53,067 --> 00:04:59,088
like the, a, and an, make better keywords.
Even other keywords like computer, might

48
00:04:59,088 --> 00:05:06,084
be present in many, many documents but.
They are certainly rarer than the in, or

49
00:05:06,084 --> 00:05:11,011
etcetera.
So, based on the principle that rarer

50
00:05:11,011 --> 00:05:17,073
words make better keywords.
The concept of inverse document frequency

51
00:05:17,073 --> 00:05:19,067
of a word.
Becomes, useful.

52
00:05:19,067 --> 00:05:23,031
Now what is inverse, inverse document
frequency?

53
00:05:23,057 --> 00:05:28,082
Let.
N is the total number of documents.

54
00:05:28,082 --> 00:05:35,087
Say all the documents on the web.
And out of these, N sub W contain the word

55
00:05:35,087 --> 00:05:39,034
W.
Then, the ratio of N over N sub W.

56
00:05:39,034 --> 00:05:45,062
Obviously, N sub W will be less than N.
Tells us which fraction, or rather, the

57
00:05:45,062 --> 00:05:52,039
inverse of the fraction of the words which
contain W, as compared to all the words.

58
00:05:52,039 --> 00:05:58,042
And then we take the logorithm of this
term and, obviously, we reverse the

59
00:05:58,042 --> 00:06:02,030
fraction.
Because, otherwise, the logorithm would

60
00:06:02,030 --> 00:06:06,076
become negative.
And we get what is called the inverse

61
00:06:06,076 --> 00:06:12,000
document frequency of a word.
Well, that's obviously not enough.

62
00:06:12,000 --> 00:06:17,067
Because a document needs to contain the
word itself if the word needs to be a

63
00:06:17,067 --> 00:06:21,064
keyword.
And if it contains many instances of the

64
00:06:21,064 --> 00:06:27,005
word touring, maybe fifteen times for
example its much more likely that

65
00:06:27,005 --> 00:06:33,053
[inaudible] is a keyword for that document
compared to say a document where the word

66
00:06:33,053 --> 00:06:39,052
appears only twice.
So, the second principle that we apply in

67
00:06:39,052 --> 00:06:45,000
our intuition, is that more frequent words
make better key words...

68
00:06:45,000 --> 00:06:51,090
More frequent in the document that we are
considering, not more frequent in general.

69
00:06:51,090 --> 00:06:58,063
So rarer words overall, but more frequent
in the document that we are considering.

70
00:06:58,063 --> 00:07:04,008
So we simply multiply.
The, inverse document frequency with

71
00:07:04,008 --> 00:07:09,062
another term.
Which is the frequency of the word in, a

72
00:07:09,062 --> 00:07:14,071
given document.
Cuz the word occurs five times.

73
00:07:14,071 --> 00:07:22,087
N, sub w, sub t, is five, and so on.
So TFIDF is nothing but the Term Frequency

74
00:07:22,087 --> 00:07:27,076
multiplied by the Inverse Document
Frequency.

75
00:07:27,076 --> 00:07:34,039
Words having a high TFIDF are considered
to be good keywords.

76
00:07:35,019 --> 00:07:40,048
Apart from guessing key words think about
it from the search perspective.

77
00:07:41,018 --> 00:07:49,041
If you're searching with a query which has
certain words with, whose idf is high, you

78
00:07:49,041 --> 00:07:56,095
would like to use those in your query.
At the same time, when you index a word,

79
00:07:56,095 --> 00:08:04,000
you want to weight it by its tf-idf value.
If a word occurs hundred times in a

80
00:08:04,000 --> 00:08:10,076
document, but the word is, the, weighting
that element in the Index by 100 doesn't

81
00:08:10,076 --> 00:08:14,083
make sense.
But if a word like queuing [inaudible]

82
00:08:14,083 --> 00:08:20,029
occurs a 100 times than weighting it with
a high value makes sense.

83
00:08:20,029 --> 00:08:24,012
The TF idea accurately captures this
intuition.