1 00:00:00,000 --> 00:00:05,029 Now let's look at language itself, in terms of information theory. 2 00:00:05,029 --> 00:00:11,298 Clearly language is a channel where we try to convey our meaning via spoken or 3 00:00:11,298 --> 00:00:15,016 written words, just as I'm doing right now. 4 00:00:15,074 --> 00:00:22,062 And we try to ensure that the mutual information between what you receive in 5 00:00:22,062 --> 00:00:30,003 terms of what you hear or what you see is text is close to what I intend to convey 6 00:00:30,003 --> 00:00:34,077 as the meaning. Many magicians, philosophers, linguists 7 00:00:34,077 --> 00:00:41,312 and computer scientists have studied this idea in great detail and it's far from 8 00:00:41,312 --> 00:00:47,042 being a resolved issue. For example, Richard Montague viewed this 9 00:00:47,042 --> 00:00:51,038 from the perspective of truth versus falsehood. 10 00:00:51,038 --> 00:00:57,096 Assuming that I'm conveying something true, are you able to discern that truth 11 00:00:57,096 --> 00:01:02,018 from the spoken or written words that you receive? 12 00:01:02,018 --> 00:01:09,010 So Montauge's view was a logical one where the purpose of language is to convey a 13 00:01:09,010 --> 00:01:14,068 truth and the issue is whether or not that truth can be discerned. 14 00:01:15,093 --> 00:01:21,064 Chomsky, on the other hand viewed the problem from the perspective of grammar, 15 00:01:21,064 --> 00:01:25,042 whether or not a sentence is grammatically correct. 16 00:01:25,042 --> 00:01:31,005 And, deeper grammar in terms of the different roles played by the actors and 17 00:01:31,005 --> 00:01:37,000 the verbs in the sentence. These constituted the meaning for Chomsky, 18 00:01:37,000 --> 00:01:42,090 regardless of whether there's actually some real truth being conveyed. 19 00:01:42,090 --> 00:01:49,014 Some of you might think this is too philosophical for a discussion on web 20 00:01:49,014 --> 00:01:55,051 intelligence, but consider this. Sentences which are grammatical need not 21 00:01:55,051 --> 00:02:00,076 be conveying any meaning; they could be completely nonsensical. 22 00:02:01,029 --> 00:02:08,011 At the same time tweets or SMS's are hardly ever grammatical, but they are 23 00:02:08,011 --> 00:02:14,038 conveying some real meaning. So, the distinctions are not necessarily 24 00:02:14,038 --> 00:02:19,090 purely philosophical, but actually have some practical value. 25 00:02:19,090 --> 00:02:27,033 We'll return to this much later when we talk about extracting information from 26 00:02:27,033 --> 00:02:32,087 spoken or written words. But, for the moment let's return to 27 00:02:32,087 --> 00:02:40,040 information theory and language with the point that language is actually highly 28 00:02:40,040 --> 00:02:45,028 redundant. In particular, Shannon figured out that 29 00:02:45,028 --> 00:02:51,028 English is 75 percent redundant. He came to this conclusion by conducting 30 00:02:51,028 --> 00:02:58,049 experiments, such as asking somebody to guess the next letter in this sentence. 31 00:02:58,049 --> 00:03:04,031 For example, the lamp was on the D. Most of you would guess desk. 32 00:03:04,069 --> 00:03:13,735 Many such examples show that context, history, experiences allow us to 33 00:03:13,735 --> 00:03:21,015 essentially predict the next word, or the next letter. 34 00:03:22,018 --> 00:03:31,043 His conclusion was that language is highly redundant and for exactly similar reasons 35 00:03:31,043 --> 00:03:38,072 as we thought earlier, efficiency. Communicating information is more 36 00:03:38,072 --> 00:03:47,049 efficient if we use more bits or more words to transmit concepts which are 37 00:03:47,049 --> 00:03:51,073 rarer. And therefore have more information 38 00:03:51,073 --> 00:03:55,083 content. As opposed to cases where we're 39 00:03:55,083 --> 00:04:04,021 transmitting something fairly obvious. Turns out that actual experiments with 40 00:04:04,021 --> 00:04:10,079 human subjects have confirmed that language actually tries to maintain a 41 00:04:10,079 --> 00:04:17,002 uniform information density. So we use more words, therefore more bits 42 00:04:17,002 --> 00:04:24,023 when trying to convey something which is deeper or has more information or is a 43 00:04:24,023 --> 00:04:29,001 rarer event that the listener might not be expecting. 44 00:04:29,001 --> 00:04:39,027 So is language all about statistics? Redundancy, TFIDF, counts. 45 00:04:40,007 --> 00:04:46,072 Well, imagine yourself at a party. You hear snippets of conversation, which 46 00:04:46,072 --> 00:04:52,278 ones catch your interest. Similarly imagine a web intelligence 47 00:04:52,278 --> 00:04:58,010 program which is tapping Twitter, Facebook or even mails. 48 00:04:59,036 --> 00:05:05,025 It needs to figure out what people are talking about, which ones have similar 49 00:05:05,025 --> 00:05:08,054 interests. How might it do so? 50 00:05:08,054 --> 00:05:15,048 Well, as we have seen in our discussion about keyword extraction, similar 51 00:05:15,048 --> 00:05:20,058 documents probably have similar have Tf-idf keywords. 52 00:05:20,058 --> 00:05:28,066 So maybe we just need to compare documents by looking at the keywords that we might 53 00:05:28,066 --> 00:05:33,030 be getting using Tf-idf. Is this enough? 54 00:05:33,098 --> 00:05:43,019 But think about words like, river, bank, account, boat, sand, deposit. 55 00:05:43,019 --> 00:05:50,012 Well. River bank versus a bank account are two 56 00:05:50,012 --> 00:05:56,050 different contexts for the word bank. Similarly. 57 00:05:58,033 --> 00:06:05,770 Sand, bank, and river occurring together versus deposit, sand deposits. 58 00:06:05,770 --> 00:06:12,396 There are again two different concepts for the word sand. 59 00:06:12,396 --> 00:06:19,857 So the semantics of the use of a word depend on the context on which it is in 60 00:06:19,857 --> 00:06:25,576 which it is being used. Is this context itself computable? 61 00:06:25,576 --> 00:06:30,740 It requires a little more work than merely TFIDF. 62 00:06:30,740 --> 00:06:39,115 To figure out the semantics of a word, that is the context in which it is used, 63 00:06:39,115 --> 00:06:47,071 we need to investigate which documents, keywords, co-occurring very often. 64 00:06:47,071 --> 00:06:55,546 The idea behind many techniques that try to compute the semantics is to view 65 00:06:55,546 --> 00:07:02,625 documents and words in a sense as a Bipartite Graph, where you have document 66 00:07:02,625 --> 00:07:09,346 on the one hand, keywords on the other, and you figure out which words are 67 00:07:09,346 --> 00:07:16,582 contained in a document and figure out which other documents contain that word 68 00:07:16,582 --> 00:07:23,586 and then iterate further to figure out which documents are closer because they 69 00:07:23,586 --> 00:07:29,054 contain the same words. As well as which keywords are closer 70 00:07:29,054 --> 00:07:37,086 because they occur in the same document. Techniques that exploit such iterations 71 00:07:37,086 --> 00:07:44,052 probabilistically in fact. Try to uncover the latent semantics, so 72 00:07:44,052 --> 00:07:50,086 they are called latent models. They try to discover the topics, which are 73 00:07:50,086 --> 00:07:57,638 collection of documents are talking about. And they are also used in diverse areas 74 00:07:57,638 --> 00:08:04,011 such as computer vision to figure out which objects are similar, which 75 00:08:04,011 --> 00:08:11,543 collection or sequences are moving objects represent the same kinds of activity and 76 00:08:11,543 --> 00:08:15,583 variety of other. Kinds of meaning that we almost 77 00:08:15,583 --> 00:08:21,945 intuitively and unconsciously extract from words, spoken language as well as the 78 00:08:21,945 --> 00:08:27,872 video that we continously see around us when we look around at the world. 79 00:08:27,872 --> 00:08:34,332 All these techniques, whether they are simple counts words and in the various 80 00:08:34,332 --> 00:08:40,852 document frequencies. Or more complicated co-occurrences across 81 00:08:40,852 --> 00:08:47,049 large collections of documents are nevertheless statistical models. 82 00:08:47,049 --> 00:08:54,983 So the question we also need to ask is, is meaning or semantics just statistics or is 83 00:08:54,983 --> 00:08:56,000 there more?