Let's recall our Bayesian classifiers for figuring out whether a query or a comment had positive or negative sentiment. Our Bayesian classifier looks something like this. We wanted to figure out whether somebody was a buyer or a browser depending on the words that they used in their query. Trouble is, if the word cheap and gift are not independent, that is the probability that somebody uses the word gift depends on the probability that they use the word cheap. That means, the probability of using the word gift given that they use the word cheap and that they are a buyer is not the same as the probability of gift independently. As a result, there, we actually have a dependence between cheap and gift which is exhibited in this network by an arc between these two nodes. In the language of Bayesian networks, what this means is to compute the, a posterior probability of a buy given any of these occurances. Our expansion needs to include the probability of a gift given C and B. Or alternatively, the probability of cheap given G and B. Depending on the order in which we expand the joint probability of G, C, and B. We've seen this in the last segment. Another example could be the sentiment from comments. A comment like, I don't like the course, and I like the course, don't complain, both contain the word don't and like, but they are clearly different sentiments. So, at first, we might include don't in our list of features along with other nega, negatives like not, etc. But that doesn't quite do the work because we also need to deal with the positional order in which these words occur. Just including negatives doesn't allow us to disambiguate between I don't like the course, and I like the course, don't complain. So, the graphical model which might help us here is that we have the class variable, which is a sentiment, having observed i elements of the comment, and the class variable, which is the sentiment after observing i + one variables, or words in the comment. And we have the probability of Xi + one given probability of Xi, and S for every position i. In this case, it's Si + one, In this case it is Si. The appropriate S is used for this likelihood. If we use these likelihoods and try to compute the most likely estimate for the probability of S being yes or no at the n-th position, we get what is called a hidden Markov model, which is another type of Bayesian network. It's especially use, useful when dealing with sequences like sentences or spoken words, and trying to figure out what actual text you are, one is trying to speak, trying to extract phonemes from speech, and other sequences. In such situations, we may also need to accommodate holes, For example, P probability of Xi + kk given the probability of X and i. So, whatever is in between, you might have a don't before a like, It might give you a positive or a negative sentiment. Whereas, if the don't comes after the like, we might get a positive sentiment as more likely. So, this is one example of how Bayesian networks allow us to go beyond independence features while building classifers. There's another application of probabilistic networks, like Bayesian networks and graphical models in general. We ask the question, how do these facts like Obama is the president of USA, or Manmohan Singh is the leader of India arise after learning from large volumes of text? How do we get those individual facts and rules so critical to be semantic web vision? Suppose we want to learn facts of the forum subject group object from text. We might use a Baysian network of the following form where we have subject, verb, object, which processes triples in the text and might come up with, with a situation like antibiotics kill bacteria. The single class variable is clearly not enough, since we need to have subject, verb and object in the language of the unified formulation of learning that we did last time. We have many y's in the Y, X data. In the, when we looked at learning from the perspective of f of x, if you remember that. In addition, one needs to deal with positional order so we can use a different graphical model like, the hierarchical Markov models or other types of models that look like this. We need to know the probability of X sub i, That means, any particular word occurring, given a, the previous word, the class of the previous word, and the class of the present word. For example, the probability that kill following antibiotics is a verb will depend on whether antibiotics is the subject. Situation is probably more apparent for the example, person gains weight, where the word gains can be a verb or a noun. So, whether or not gains is a verb would depend on whether or not person is classified as a subject. Now remember, this is all supervised learning. So, what we have is a whole bunch of text labeled as subject, verb, object from which we compute these likelihoods or probabilities. And using that, we figure out the a-posterior probabilities of subject, verb, and object for every word in the text. And where we have a high combination of S, V, and O, we can assert the fact that subject, verb, object has been learned from this piece of text. We also have to alive, allow for holes such as, so that we have to deal with things like subjected i minus k, verb at i, and object at i plus P. Now, this gets very complicated especially since we can have a large number of words and a large number of possible places, ways of including holes, so the Bayesian network models are not necessarily the most efficient. And other models, such as conditional random fields or Markov networks turn out to be more efficient in this kind of situation. Once we have found many facts, like Obama is President of USA, etc., and many instances of such facts, We cull from all these facts using this support and confidence. So, we'll disregard facts which we learn only once or twice and keep those facts which we have learned many times from different independent pieces of text. And this is how large volumes of facts are, in fact, learned from the many millions and billions of documents on the web. This whole exercise of learning from the web is called information extraction, or open information extraction, And there are many examples of such efforts. One of the oldest efforts is called Cyc. It's a semi-automated technique and has so far accumulated about two billion such facts. Yago is more recent and is the largest to date. It's run out of the Max Planck Institute in Germany, and has uncovered more than six billion facts, and they're all linked together as a graph. So, now Obama is president of USA, USA lies in North America, etc. etc. etc., They're all linked together. Albert Einstein was born in Ulm, for example, Is, is a fact that Watson could've learned from a database like Yago. And, Watson actually uses facts culled from the web internally. It doesn't use Yago or Cyc but it uses many, many webpages and textual documents and rules, And this is another example of open information extraction. Reverb is by the same group as Yago It's more recent, it's lightweight, it's only got fifteen million S, V, O triple so far. For example, it has things like potatoes are also rich in Vitamin C, so that the verbs are also verb phrases, that's sort of less useful than Yago in this sense but it's much more diverse in the sense of the kind of phrases that it actually includes. The way REVERB works just to give you a flavor for how such systems work is, first it tags each piece of text using natural language processing classifiers to say which is a noun-phrase, which is a word-phrase, which is a preposition, etc. And then, it focuses only on the verb phrases and figures out what are the nearby non-phrases using classifiers just as we have discussed. It prefers proper nouns, especially if they occur often in other facts so that, you know, Ein, words like Einstein are preferred rather than person or scientist. And, wherever possible, it manages to extract more than one fact from a piece of text. So, from a text like, Mozart was born in Salzburg, but moved to Vienna in 1781, yields two facts. Mozart moved to Vienna in addition to, in addition to Mozart was born in Salzburg. Now, I admit we have gone through this section fairly quickly. The point I wanted to make is that, the ability to extract facts is significantly enhanced by the fact that we have so many, many documents available on the web Using a combination of supervised learning and unsupervised extraction like, REVERB, which is unsupervised in the sense that one is not actually labeling any small fraction of text as S, V, O. We're just using lower level classifiers for part of speech tagging. By using a combination of these supervised and unsupervised learning techniques, one can actually extract large volumes of facts and rules from text, and then use the reasoning techniques that we started with in this lecture this week to actually move towards the semantic web vision. Along the way, of course, we have to deal with the situation that we have the limits of logic, which are fundamental, as well as those limits which come from uncertainty.