Now let's look at language itself, in
terms of information theory.
Clearly language is a channel where we try
to convey our meaning via spoken or
written words, just as I'm doing right
now.
And we try to ensure that the mutual
information between what you receive in
terms of what you hear or what you see is
text is close to what I intend to convey
as the meaning.
Many magicians, philosophers, linguists
and computer scientists have studied this
idea in great detail and it's far from
being a resolved issue.
For example, Richard Montague viewed this
from the perspective of truth versus
falsehood.
Assuming that I'm conveying something
true, are you able to discern that truth
from the spoken or written words that you
receive?
So Montauge's view was a logical one where
the purpose of language is to convey a
truth and the issue is whether or not that
truth can be discerned.
Chomsky, on the other hand viewed the
problem from the perspective of grammar,
whether or not a sentence is grammatically
correct.
And, deeper grammar in terms of the
different roles played by the actors and
the verbs in the sentence.
These constituted the meaning for Chomsky,
regardless of whether there's actually
some real truth being conveyed.
Some of you might think this is too
philosophical for a discussion on web
intelligence, but consider this.
Sentences which are grammatical need not
be conveying any meaning; they could be
completely nonsensical.
At the same time tweets or SMS's are
hardly ever grammatical, but they are
conveying some real meaning.
So, the distinctions are not necessarily
purely philosophical, but actually have
some practical value.
We'll return to this much later when we
talk about extracting information from
spoken or written words.
But, for the moment let's return to
information theory and language with the
point that language is actually highly
redundant.
In particular, Shannon figured out that
English is 75 percent redundant.
He came to this conclusion by conducting
experiments, such as asking somebody to
guess the next letter in this sentence.
For example, the lamp was on the D.
Most of you would guess desk.
Many such examples show that context,
history, experiences allow us to
essentially predict the next word, or the
next letter.
His conclusion was that language is highly
redundant and for exactly similar reasons
as we thought earlier, efficiency.
Communicating information is more
efficient if we use more bits or more
words to transmit concepts which are
rarer.
And therefore have more information
content.
As opposed to cases where we're
transmitting something fairly obvious.
Turns out that actual experiments with
human subjects have confirmed that
language actually tries to maintain a
uniform information density.
So we use more words, therefore more bits
when trying to convey something which is
deeper or has more information or is a
rarer event that the listener might not be
expecting.
So is language all about statistics?
Redundancy, TFIDF, counts.
Well, imagine yourself at a party.
You hear snippets of conversation, which
ones catch your interest.
Similarly imagine a web intelligence
program which is tapping Twitter, Facebook
or even mails.
It needs to figure out what people are
talking about, which ones have similar
interests.
How might it do so?
Well, as we have seen in our discussion
about keyword extraction, similar
documents probably have similar have
Tf-idf keywords.
So maybe we just need to compare documents
by looking at the keywords that we might
be getting using Tf-idf.
Is this enough?
But think about words like, river, bank,
account, boat, sand, deposit.
Well.
River bank versus a bank account are two
different contexts for the word bank.
Similarly.
Sand, bank, and river occurring together
versus deposit, sand deposits.
There are again two different concepts for
the word sand.
So the semantics of the use of a word
depend on the context on which it is in
which it is being used.
Is this context itself computable?
It requires a little more work than merely
TFIDF.
To figure out the semantics of a word,
that is the context in which it is used,
we need to investigate which documents,
keywords, co-occurring very often.
The idea behind many techniques that try
to compute the semantics is to view
documents and words in a sense as a
Bipartite Graph, where you have document
on the one hand, keywords on the other,
and you figure out which words are
contained in a document and figure out
which other documents contain that word
and then iterate further to figure out
which documents are closer because they
contain the same words.
As well as which keywords are closer
because they occur in the same document.
Techniques that exploit such iterations
probabilistically in fact.
Try to uncover the latent semantics, so
they are called latent models.
They try to discover the topics, which are
collection of documents are talking about.
And they are also used in diverse areas
such as computer vision to figure out
which objects are similar, which
collection or sequences are moving objects
represent the same kinds of activity and
variety of other.
Kinds of meaning that we almost
intuitively and unconsciously extract from
words, spoken language as well as the
video that we continously see around us
when we look around at the world.
All these techniques, whether they are
simple counts words and in the various
document frequencies.
Or more complicated co-occurrences across
large collections of documents are
nevertheless statistical models.
So the question we also need to ask is, is
meaning or semantics just statistics or is
there more?