Now let's look at language itself, in terms of information theory. Clearly language is a channel where we try to convey our meaning via spoken or written words, just as I'm doing right now. And we try to ensure that the mutual information between what you receive in terms of what you hear or what you see is text is close to what I intend to convey as the meaning. Many magicians, philosophers, linguists and computer scientists have studied this idea in great detail and it's far from being a resolved issue. For example, Richard Montague viewed this from the perspective of truth versus falsehood. Assuming that I'm conveying something true, are you able to discern that truth from the spoken or written words that you receive? So Montauge's view was a logical one where the purpose of language is to convey a truth and the issue is whether or not that truth can be discerned. Chomsky, on the other hand viewed the problem from the perspective of grammar, whether or not a sentence is grammatically correct. And, deeper grammar in terms of the different roles played by the actors and the verbs in the sentence. These constituted the meaning for Chomsky, regardless of whether there's actually some real truth being conveyed. Some of you might think this is too philosophical for a discussion on web intelligence, but consider this. Sentences which are grammatical need not be conveying any meaning; they could be completely nonsensical. At the same time tweets or SMS's are hardly ever grammatical, but they are conveying some real meaning. So, the distinctions are not necessarily purely philosophical, but actually have some practical value. We'll return to this much later when we talk about extracting information from spoken or written words. But, for the moment let's return to information theory and language with the point that language is actually highly redundant. In particular, Shannon figured out that English is 75 percent redundant. He came to this conclusion by conducting experiments, such as asking somebody to guess the next letter in this sentence. For example, the lamp was on the D. Most of you would guess desk. Many such examples show that context, history, experiences allow us to essentially predict the next word, or the next letter. His conclusion was that language is highly redundant and for exactly similar reasons as we thought earlier, efficiency. Communicating information is more efficient if we use more bits or more words to transmit concepts which are rarer. And therefore have more information content. As opposed to cases where we're transmitting something fairly obvious. Turns out that actual experiments with human subjects have confirmed that language actually tries to maintain a uniform information density. So we use more words, therefore more bits when trying to convey something which is deeper or has more information or is a rarer event that the listener might not be expecting. So is language all about statistics? Redundancy, TFIDF, counts. Well, imagine yourself at a party. You hear snippets of conversation, which ones catch your interest. Similarly imagine a web intelligence program which is tapping Twitter, Facebook or even mails. It needs to figure out what people are talking about, which ones have similar interests. How might it do so? Well, as we have seen in our discussion about keyword extraction, similar documents probably have similar have Tf-idf keywords. So maybe we just need to compare documents by looking at the keywords that we might be getting using Tf-idf. Is this enough? But think about words like, river, bank, account, boat, sand, deposit. Well. River bank versus a bank account are two different contexts for the word bank. Similarly. Sand, bank, and river occurring together versus deposit, sand deposits. There are again two different concepts for the word sand. So the semantics of the use of a word depend on the context on which it is in which it is being used. Is this context itself computable? It requires a little more work than merely TFIDF. To figure out the semantics of a word, that is the context in which it is used, we need to investigate which documents, keywords, co-occurring very often. The idea behind many techniques that try to compute the semantics is to view documents and words in a sense as a Bipartite Graph, where you have document on the one hand, keywords on the other, and you figure out which words are contained in a document and figure out which other documents contain that word and then iterate further to figure out which documents are closer because they contain the same words. As well as which keywords are closer because they occur in the same document. Techniques that exploit such iterations probabilistically in fact. Try to uncover the latent semantics, so they are called latent models. They try to discover the topics, which are collection of documents are talking about. And they are also used in diverse areas such as computer vision to figure out which objects are similar, which collection or sequences are moving objects represent the same kinds of activity and variety of other. Kinds of meaning that we almost intuitively and unconsciously extract from words, spoken language as well as the video that we continously see around us when we look around at the world. All these techniques, whether they are simple counts words and in the various document frequencies. Or more complicated co-occurrences across large collections of documents are nevertheless statistical models. So the question we also need to ask is, is meaning or semantics just statistics or is there more?