1 00:00:00,000 --> 00:00:07,440 Let's recall our Bayesian classifiers for figuring out whether a query or a comment 2 00:00:07,440 --> 00:00:13,860 had positive or negative sentiment. Our Bayesian classifier looks something 3 00:00:13,860 --> 00:00:17,299 like this. We wanted to figure out whether somebody 4 00:00:17,299 --> 00:00:22,560 was a buyer or a browser depending on the words that they used in their query. 5 00:00:24,020 --> 00:00:31,676 Trouble is, if the word cheap and gift are not independent, that is the probability 6 00:00:31,676 --> 00:00:40,061 that somebody uses the word gift depends on the probability that they use the word 7 00:00:40,061 --> 00:00:46,532 cheap. That means, the probability of using the word gift given that they use 8 00:00:46,532 --> 00:00:54,280 the word cheap and that they are a buyer is not the same as the probability of gift 9 00:00:54,560 --> 00:01:00,816 independently. As a result, there, we actually have a 10 00:01:00,816 --> 00:01:08,203 dependence between cheap and gift which is exhibited in this network by an arc 11 00:01:08,203 --> 00:01:14,628 between these two nodes. In the language of Bayesian networks, what 12 00:01:14,628 --> 00:01:21,743 this means is to compute the, a posterior probability of a buy given any of these 13 00:01:21,743 --> 00:01:25,993 occurances. Our expansion needs to include the 14 00:01:25,993 --> 00:01:33,625 probability of a gift given C and B. Or alternatively, the probability of cheap 15 00:01:33,625 --> 00:01:39,020 given G and B. Depending on the order in which we expand 16 00:01:39,020 --> 00:01:45,860 the joint probability of G, C, and B. We've seen this in the last segment. 17 00:01:46,980 --> 00:01:51,281 Another example could be the sentiment from comments. 18 00:01:51,281 --> 00:01:57,448 A comment like, I don't like the course, and I like the course, don't complain, 19 00:01:57,448 --> 00:02:03,860 both contain the word don't and like, but they are clearly different sentiments. 20 00:02:04,780 --> 00:02:10,667 So, at first, we might include don't in our list of features along with other 21 00:02:10,667 --> 00:02:16,853 nega, negatives like not, etc. But that doesn't quite do the work because 22 00:02:16,853 --> 00:02:22,369 we also need to deal with the positional order in which these words occur. 23 00:02:22,369 --> 00:02:28,480 Just including negatives doesn't allow us to disambiguate between I don't like the 24 00:02:28,480 --> 00:02:31,760 course, and I like the course, don't complain. 25 00:01:53,820 --> 00:02:39,100 So, the graphical model which might help us here is that we have the class 26 00:02:39,100 --> 00:02:46,400 variable, which is a sentiment, having observed i elements of the comment, and 27 00:02:46,400 --> 00:02:54,276 the class variable, which is the sentiment after observing i + one variables, or 28 00:02:54,276 --> 00:03:00,899 words in the comment. And we have the probability of Xi + one 29 00:03:00,900 --> 00:03:10,960 given probability of Xi, and S for every position i. 30 00:03:13,060 --> 00:03:21,154 In this case, it's Si + one, In this case it is Si. The appropriate S is used for 31 00:03:21,154 --> 00:03:26,453 this likelihood. If we use these likelihoods and try to 32 00:03:26,453 --> 00:03:33,199 compute the most likely estimate for the probability of S being yes or no at the 33 00:03:33,199 --> 00:03:39,599 n-th position, we get what is called a hidden Markov model, which is another type 34 00:03:39,599 --> 00:03:44,460 of Bayesian network. It's especially use, useful when dealing 35 00:03:44,460 --> 00:03:50,616 with sequences like sentences or spoken words, and trying to figure out what 36 00:03:50,616 --> 00:03:56,692 actual text you are, one is trying to speak, trying to extract phonemes from 37 00:03:56,692 --> 00:04:04,121 speech, and other sequences. In such situations, we may also need to 38 00:04:04,121 --> 00:04:08,185 accommodate holes, For example, P probability of Xi + kk 39 00:04:08,185 --> 00:04:13,357 given the probability of X and i. So, whatever is in between, you might have 40 00:04:13,357 --> 00:04:17,938 a don't before a like, It might give you a positive or a negative 41 00:04:17,938 --> 00:04:21,411 sentiment. Whereas, if the don't comes after the 42 00:04:21,411 --> 00:04:25,401 like, we might get a positive sentiment as more likely. 43 00:04:25,401 --> 00:04:30,573 So, this is one example of how Bayesian networks allow us to go beyond 44 00:04:30,573 --> 00:04:34,120 independence features while building classifers. 45 00:04:34,800 --> 00:04:40,588 There's another application of probabilistic networks, like Bayesian 46 00:04:40,588 --> 00:04:48,802 networks and graphical models in general. We ask the question, how do these facts 47 00:04:48,802 --> 00:04:56,216 like Obama is the president of USA, or Manmohan Singh is the leader of India 48 00:04:56,216 --> 00:05:00,558 arise after learning from large volumes of text? 49 00:05:00,558 --> 00:05:07,613 How do we get those individual facts and rules so critical to be semantic web 50 00:05:07,613 --> 00:05:11,773 vision? Suppose we want to learn facts of the 51 00:05:11,773 --> 00:05:19,082 forum subject group object from text. We might use a Baysian network of the 52 00:05:19,082 --> 00:05:26,805 following form where we have subject, verb, object, which processes triples in 53 00:05:26,805 --> 00:05:33,600 the text and might come up with, with a situation like antibiotics kill bacteria. 54 00:05:35,880 --> 00:05:42,596 The single class variable is clearly not enough, since we need to have subject, 55 00:05:42,596 --> 00:05:49,657 verb and object in the language of the unified formulation of learning that we 56 00:05:49,657 --> 00:05:53,704 did last time. We have many y's in the Y, X data. 57 00:05:53,704 --> 00:06:00,679 In the, when we looked at learning from the perspective of f of x, if you remember 58 00:06:00,679 --> 00:06:04,176 that. In addition, one needs to deal with 59 00:06:04,176 --> 00:06:11,271 positional order so we can use a different graphical model like, the hierarchical 60 00:06:11,271 --> 00:06:16,440 Markov models or other types of models that look like this. 61 00:06:18,380 --> 00:06:23,660 We need to know the probability of X sub i, 62 00:06:23,660 --> 00:06:29,842 That means, any particular word occurring, given a, the previous word, the class of 63 00:06:29,842 --> 00:06:33,860 the previous word, and the class of the present word. 64 00:06:34,420 --> 00:06:40,495 For example, the probability that kill following antibiotics is a verb will 65 00:06:40,495 --> 00:06:44,141 depend on whether antibiotics is the subject. 66 00:06:44,141 --> 00:06:50,460 Situation is probably more apparent for the example, person gains weight, where 67 00:06:50,460 --> 00:06:57,708 the word gains can be a verb or a noun. So, whether or not gains is a verb would 68 00:06:57,708 --> 00:07:03,120 depend on whether or not person is classified as a subject. 69 00:07:03,120 --> 00:07:07,247 Now remember, this is all supervised learning. 70 00:07:07,247 --> 00:07:14,310 So, what we have is a whole bunch of text labeled as subject, verb, object from 71 00:07:14,310 --> 00:07:19,080 which we compute these likelihoods or probabilities. 72 00:07:19,080 --> 00:07:25,568 And using that, we figure out the a-posterior probabilities of subject, 73 00:07:25,568 --> 00:07:32,984 verb, and object for every word in the text. And where we have a high combination 74 00:07:32,984 --> 00:07:39,473 of S, V, and O, we can assert the fact that subject, verb, object has been 75 00:07:39,473 --> 00:07:46,197 learned from this piece of text. We also have to alive, allow for holes 76 00:07:46,197 --> 00:07:51,653 such as, so that we have to deal with things like subjected i minus k, verb at 77 00:07:51,653 --> 00:07:56,763 i, and object at i plus P. Now, this gets very complicated especially 78 00:07:56,763 --> 00:08:02,863 since we can have a large number of words and a large number of possible places, 79 00:08:02,863 --> 00:08:08,887 ways of including holes, so the Bayesian network models are not necessarily the 80 00:08:08,887 --> 00:08:14,683 most efficient. And other models, such as conditional random fields or Markov 81 00:08:14,683 --> 00:08:19,640 networks turn out to be more efficient in this kind of situation. 82 00:08:20,700 --> 00:08:26,380 Once we have found many facts, like Obama is President of USA, etc., and many 83 00:08:26,380 --> 00:08:30,983 instances of such facts, We cull from all these facts using this 84 00:08:30,983 --> 00:08:35,513 support and confidence. So, we'll disregard facts which we learn 85 00:08:35,513 --> 00:08:41,122 only once or twice and keep those facts which we have learned many times from 86 00:08:41,122 --> 00:08:46,587 different independent pieces of text. And this is how large volumes of facts 87 00:08:46,587 --> 00:08:52,052 are, in fact, learned from the many millions and billions of documents on the 88 00:08:52,052 --> 00:08:55,772 web. This whole exercise of learning from the 89 00:08:55,772 --> 00:09:00,968 web is called information extraction, or open information extraction, 90 00:09:00,968 --> 00:09:04,330 And there are many examples of such efforts. 91 00:09:04,330 --> 00:09:10,825 One of the oldest efforts is called Cyc. It's a semi-automated technique and has so 92 00:09:10,825 --> 00:09:14,264 far accumulated about two billion such facts. 93 00:09:14,264 --> 00:09:17,855 Yago is more recent and is the largest to date. 94 00:09:17,855 --> 00:09:23,576 It's run out of the Max Planck Institute in Germany, and has uncovered more than 95 00:09:23,576 --> 00:09:27,042 six billion facts, and they're all linked together as a graph. 96 00:09:27,042 --> 00:09:31,929 So, now Obama is president of USA, USA lies in North America, etc. etc. etc., 97 00:09:31,929 --> 00:09:38,409 They're all linked together. Albert Einstein was born in Ulm, for 98 00:09:38,409 --> 00:09:42,102 example, Is, is a fact that Watson could've learned 99 00:09:42,102 --> 00:09:46,829 from a database like Yago. And, Watson actually uses facts culled 100 00:09:46,829 --> 00:09:52,073 from the web internally. It doesn't use Yago or Cyc but it uses 101 00:09:52,073 --> 00:09:56,062 many, many webpages and textual documents and rules, 102 00:09:56,062 --> 00:10:00,420 And this is another example of open information extraction. 103 00:10:01,260 --> 00:10:07,310 Reverb is by the same group as Yago It's more recent, it's lightweight, it's only 104 00:10:07,310 --> 00:10:12,763 got fifteen million S, V, O triple so far. For example, it has things like potatoes 105 00:10:12,763 --> 00:10:18,739 are also rich in Vitamin C, so that the verbs are also verb phrases, that's sort 106 00:10:18,739 --> 00:10:25,238 of less useful than Yago in this sense but it's much more diverse in the sense of the 107 00:10:25,238 --> 00:10:32,995 kind of phrases that it actually includes. The way REVERB works just to give you a 108 00:10:32,995 --> 00:10:39,184 flavor for how such systems work is, first it tags each piece of text using natural 109 00:10:39,184 --> 00:10:44,851 language processing classifiers to say which is a noun-phrase, which is a 110 00:10:44,851 --> 00:10:50,539 word-phrase, which is a preposition, etc. And then, it focuses only on the verb 111 00:10:50,539 --> 00:10:56,033 phrases and figures out what are the nearby non-phrases using classifiers just 112 00:10:56,033 --> 00:11:00,382 as we have discussed. It prefers proper nouns, especially if 113 00:11:00,382 --> 00:11:05,476 they occur often in other facts so that, you know, Ein, words like Einstein are 114 00:11:05,476 --> 00:11:11,400 preferred rather than person or scientist. And, wherever possible, it manages to 115 00:11:11,400 --> 00:11:15,187 extract more than one fact from a piece of text. 116 00:11:15,187 --> 00:11:21,184 So, from a text like, Mozart was born in Salzburg, but moved to Vienna in 1781, 117 00:11:21,184 --> 00:11:25,760 yields two facts. Mozart moved to Vienna in addition to, in 118 00:11:25,760 --> 00:11:31,915 addition to Mozart was born in Salzburg. Now, I admit we have gone through this 119 00:11:31,915 --> 00:11:36,807 section fairly quickly. The point I wanted to make is that, the 120 00:11:36,807 --> 00:11:43,835 ability to extract facts is significantly enhanced by the fact that we have so many, 121 00:11:43,835 --> 00:11:51,041 many documents available on the web Using a combination of supervised learning and 122 00:11:51,041 --> 00:11:57,705 unsupervised extraction like, REVERB, which is unsupervised in the sense that 123 00:11:57,705 --> 00:12:03,229 one is not actually labeling any small fraction of text as S, 124 00:12:03,229 --> 00:12:08,448 V, O. We're just using lower level classifiers for part of speech tagging. 125 00:12:08,448 --> 00:12:14,589 By using a combination of these supervised and unsupervised learning techniques, one 126 00:12:14,589 --> 00:12:20,218 can actually extract large volumes of facts and rules from text, and then use 127 00:12:20,218 --> 00:12:26,360 the reasoning techniques that we started with in this lecture this week to actually 128 00:12:26,900 --> 00:12:32,436 move towards the semantic web vision. Along the way, of course, we have to deal 129 00:12:32,436 --> 00:12:38,331 with the situation that we have the limits of logic, which are fundamental, as well 130 00:12:38,331 --> 00:12:41,495 as those limits which come from uncertainty.