1
00:00:00,000 --> 00:00:05,029
Now let's look at language itself, in
terms of information theory.

2
00:00:05,029 --> 00:00:11,298
Clearly language is a channel where we try
to convey our meaning via spoken or

3
00:00:11,298 --> 00:00:15,016
written words, just as I'm doing right
now.

4
00:00:15,074 --> 00:00:22,062
And we try to ensure that the mutual
information between what you receive in

5
00:00:22,062 --> 00:00:30,003
terms of what you hear or what you see is
text is close to what I intend to convey

6
00:00:30,003 --> 00:00:34,077
as the meaning.
Many magicians, philosophers, linguists

7
00:00:34,077 --> 00:00:41,312
and computer scientists have studied this
idea in great detail and it's far from

8
00:00:41,312 --> 00:00:47,042
being a resolved issue.
For example, Richard Montague viewed this

9
00:00:47,042 --> 00:00:51,038
from the perspective of truth versus
falsehood.

10
00:00:51,038 --> 00:00:57,096
Assuming that I'm conveying something
true, are you able to discern that truth

11
00:00:57,096 --> 00:01:02,018
from the spoken or written words that you
receive?

12
00:01:02,018 --> 00:01:09,010
So Montauge's view was a logical one where
the purpose of language is to convey a

13
00:01:09,010 --> 00:01:14,068
truth and the issue is whether or not that
truth can be discerned.

14
00:01:15,093 --> 00:01:21,064
Chomsky, on the other hand viewed the
problem from the perspective of grammar,

15
00:01:21,064 --> 00:01:25,042
whether or not a sentence is grammatically
correct.

16
00:01:25,042 --> 00:01:31,005
And, deeper grammar in terms of the
different roles played by the actors and

17
00:01:31,005 --> 00:01:37,000
the verbs in the sentence.
These constituted the meaning for Chomsky,

18
00:01:37,000 --> 00:01:42,090
regardless of whether there's actually
some real truth being conveyed.

19
00:01:42,090 --> 00:01:49,014
Some of you might think this is too
philosophical for a discussion on web

20
00:01:49,014 --> 00:01:55,051
intelligence, but consider this.
Sentences which are grammatical need not

21
00:01:55,051 --> 00:02:00,076
be conveying any meaning; they could be
completely nonsensical.

22
00:02:01,029 --> 00:02:08,011
At the same time tweets or SMS's are
hardly ever grammatical, but they are

23
00:02:08,011 --> 00:02:14,038
conveying some real meaning.
So, the distinctions are not necessarily

24
00:02:14,038 --> 00:02:19,090
purely philosophical, but actually have
some practical value.

25
00:02:19,090 --> 00:02:27,033
We'll return to this much later when we
talk about extracting information from

26
00:02:27,033 --> 00:02:32,087
spoken or written words.
But, for the moment let's return to

27
00:02:32,087 --> 00:02:40,040
information theory and language with the
point that language is actually highly

28
00:02:40,040 --> 00:02:45,028
redundant.
In particular, Shannon figured out that

29
00:02:45,028 --> 00:02:51,028
English is 75 percent redundant.
He came to this conclusion by conducting

30
00:02:51,028 --> 00:02:58,049
experiments, such as asking somebody to
guess the next letter in this sentence.

31
00:02:58,049 --> 00:03:04,031
For example, the lamp was on the D.
Most of you would guess desk.

32
00:03:04,069 --> 00:03:13,735
Many such examples show that context,
history, experiences allow us to

33
00:03:13,735 --> 00:03:21,015
essentially predict the next word, or the
next letter.

34
00:03:22,018 --> 00:03:31,043
His conclusion was that language is highly
redundant and for exactly similar reasons

35
00:03:31,043 --> 00:03:38,072
as we thought earlier, efficiency.
Communicating information is more

36
00:03:38,072 --> 00:03:47,049
efficient if we use more bits or more
words to transmit concepts which are

37
00:03:47,049 --> 00:03:51,073
rarer.
And therefore have more information

38
00:03:51,073 --> 00:03:55,083
content.
As opposed to cases where we're

39
00:03:55,083 --> 00:04:04,021
transmitting something fairly obvious.
Turns out that actual experiments with

40
00:04:04,021 --> 00:04:10,079
human subjects have confirmed that
language actually tries to maintain a

41
00:04:10,079 --> 00:04:17,002
uniform information density.
So we use more words, therefore more bits

42
00:04:17,002 --> 00:04:24,023
when trying to convey something which is
deeper or has more information or is a

43
00:04:24,023 --> 00:04:29,001
rarer event that the listener might not be
expecting.

44
00:04:29,001 --> 00:04:39,027
So is language all about statistics?
Redundancy, TFIDF, counts.

45
00:04:40,007 --> 00:04:46,072
Well, imagine yourself at a party.
You hear snippets of conversation, which

46
00:04:46,072 --> 00:04:52,278
ones catch your interest.
Similarly imagine a web intelligence

47
00:04:52,278 --> 00:04:58,010
program which is tapping Twitter, Facebook
or even mails.

48
00:04:59,036 --> 00:05:05,025
It needs to figure out what people are
talking about, which ones have similar

49
00:05:05,025 --> 00:05:08,054
interests.
How might it do so?

50
00:05:08,054 --> 00:05:15,048
Well, as we have seen in our discussion
about keyword extraction, similar

51
00:05:15,048 --> 00:05:20,058
documents probably have similar have
Tf-idf keywords.

52
00:05:20,058 --> 00:05:28,066
So maybe we just need to compare documents
by looking at the keywords that we might

53
00:05:28,066 --> 00:05:33,030
be getting using Tf-idf.
Is this enough?

54
00:05:33,098 --> 00:05:43,019
But think about words like, river, bank,
account, boat, sand, deposit.

55
00:05:43,019 --> 00:05:50,012
Well.
River bank versus a bank account are two

56
00:05:50,012 --> 00:05:56,050
different contexts for the word bank.
Similarly.

57
00:05:58,033 --> 00:06:05,770
Sand, bank, and river occurring together
versus deposit, sand deposits.

58
00:06:05,770 --> 00:06:12,396
There are again two different concepts for
the word sand.

59
00:06:12,396 --> 00:06:19,857
So the semantics of the use of a word
depend on the context on which it is in

60
00:06:19,857 --> 00:06:25,576
which it is being used.
Is this context itself computable?

61
00:06:25,576 --> 00:06:30,740
It requires a little more work than merely
TFIDF.

62
00:06:30,740 --> 00:06:39,115
To figure out the semantics of a word,
that is the context in which it is used,

63
00:06:39,115 --> 00:06:47,071
we need to investigate which documents,
keywords, co-occurring very often.

64
00:06:47,071 --> 00:06:55,546
The idea behind many techniques that try
to compute the semantics is to view

65
00:06:55,546 --> 00:07:02,625
documents and words in a sense as a
Bipartite Graph, where you have document

66
00:07:02,625 --> 00:07:09,346
on the one hand, keywords on the other,
and you figure out which words are

67
00:07:09,346 --> 00:07:16,582
contained in a document and figure out
which other documents contain that word

68
00:07:16,582 --> 00:07:23,586
and then iterate further to figure out
which documents are closer because they

69
00:07:23,586 --> 00:07:29,054
contain the same words.
As well as which keywords are closer

70
00:07:29,054 --> 00:07:37,086
because they occur in the same document.
Techniques that exploit such iterations

71
00:07:37,086 --> 00:07:44,052
probabilistically in fact.
Try to uncover the latent semantics, so

72
00:07:44,052 --> 00:07:50,086
they are called latent models.
They try to discover the topics, which are

73
00:07:50,086 --> 00:07:57,638
collection of documents are talking about.
And they are also used in diverse areas

74
00:07:57,638 --> 00:08:04,011
such as computer vision to figure out
which objects are similar, which

75
00:08:04,011 --> 00:08:11,543
collection or sequences are moving objects
represent the same kinds of activity and

76
00:08:11,543 --> 00:08:15,583
variety of other.
Kinds of meaning that we almost

77
00:08:15,583 --> 00:08:21,945
intuitively and unconsciously extract from
words, spoken language as well as the

78
00:08:21,945 --> 00:08:27,872
video that we continously see around us
when we look around at the world.

79
00:08:27,872 --> 00:08:34,332
All these techniques, whether they are
simple counts words and in the various

80
00:08:34,332 --> 00:08:40,852
document frequencies.
Or more complicated co-occurrences across

81
00:08:40,852 --> 00:08:47,049
large collections of documents are
nevertheless statistical models.

82
00:08:47,049 --> 00:08:54,983
So the question we also need to ask is, is
meaning or semantics just statistics or is

83
00:08:54,983 --> 00:08:56,000
there more?