Let's recall our Bayesian classifiers for
figuring out whether a query or a comment
had positive or negative sentiment.
Our Bayesian classifier looks something
like this.
We wanted to figure out whether somebody
was a buyer or a browser depending on the
words that they used in their query.
Trouble is, if the word cheap and gift are
not independent, that is the probability
that somebody uses the word gift depends
on the probability that they use the word
cheap. That means, the probability of
using the word gift given that they use
the word cheap and that they are a buyer
is not the same as the probability of gift
independently.
As a result, there, we actually have a
dependence between cheap and gift which is
exhibited in this network by an arc
between these two nodes.
In the language of Bayesian networks, what
this means is to compute the, a posterior
probability of a buy given any of these
occurances.
Our expansion needs to include the
probability of a gift given C and B.
Or alternatively, the probability of cheap
given G and B.
Depending on the order in which we expand
the joint probability of G, C, and B.
We've seen this in the last segment.
Another example could be the sentiment
from comments.
A comment like, I don't like the course,
and I like the course, don't complain,
both contain the word don't and like, but
they are clearly different sentiments.
So, at first, we might include don't in
our list of features along with other
nega, negatives like not, etc.
But that doesn't quite do the work because
we also need to deal with the positional
order in which these words occur.
Just including negatives doesn't allow us
to disambiguate between I don't like the
course, and I like the course, don't
complain.
So, the graphical model which might help
us here is that we have the class
variable, which is a sentiment, having
observed i elements of the comment, and
the class variable, which is the sentiment
after observing i + one variables, or
words in the comment.
And we have the probability of Xi + one
given probability of Xi, and S for every
position i.
In this case, it's Si + one, In this case
it is Si. The appropriate S is used for
this likelihood.
If we use these likelihoods and try to
compute the most likely estimate for the
probability of S being yes or no at the
n-th position, we get what is called a
hidden Markov model, which is another type
of Bayesian network.
It's especially use, useful when dealing
with sequences like sentences or spoken
words, and trying to figure out what
actual text you are, one is trying to
speak, trying to extract phonemes from
speech, and other sequences.
In such situations, we may also need to
accommodate holes,
For example, P probability of Xi + kk
given the probability of X and i.
So, whatever is in between, you might have
a don't before a like,
It might give you a positive or a negative
sentiment.
Whereas, if the don't comes after the
like, we might get a positive sentiment as
more likely.
So, this is one example of how Bayesian
networks allow us to go beyond
independence features while building
classifers.
There's another application of
probabilistic networks, like Bayesian
networks and graphical models in general.
We ask the question, how do these facts
like Obama is the president of USA, or
Manmohan Singh is the leader of India
arise after learning from large volumes of
text?
How do we get those individual facts and
rules so critical to be semantic web
vision?
Suppose we want to learn facts of the
forum subject group object from text.
We might use a Baysian network of the
following form where we have subject,
verb, object, which processes triples in
the text and might come up with, with a
situation like antibiotics kill bacteria.
The single class variable is clearly not
enough, since we need to have subject,
verb and object in the language of the
unified formulation of learning that we
did last time.
We have many y's in the Y, X data.
In the, when we looked at learning from
the perspective of f of x, if you remember
that.
In addition, one needs to deal with
positional order so we can use a different
graphical model like, the hierarchical
Markov models or other types of models
that look like this.
We need to know the probability of X sub
i,
That means, any particular word occurring,
given a, the previous word, the class of
the previous word, and the class of the
present word.
For example, the probability that kill
following antibiotics is a verb will
depend on whether antibiotics is the
subject.
Situation is probably more apparent for
the example, person gains weight, where
the word gains can be a verb or a noun.
So, whether or not gains is a verb would
depend on whether or not person is
classified as a subject.
Now remember, this is all supervised
learning.
So, what we have is a whole bunch of text
labeled as subject, verb, object from
which we compute these likelihoods or
probabilities.
And using that, we figure out the
a-posterior probabilities of subject,
verb, and object for every word in the
text. And where we have a high combination
of S, V, and O, we can assert the fact
that subject, verb, object has been
learned from this piece of text.
We also have to alive, allow for holes
such as, so that we have to deal with
things like subjected i minus k, verb at
i, and object at i plus P.
Now, this gets very complicated especially
since we can have a large number of words
and a large number of possible places,
ways of including holes, so the Bayesian
network models are not necessarily the
most efficient. And other models, such as
conditional random fields or Markov
networks turn out to be more efficient in
this kind of situation.
Once we have found many facts, like Obama
is President of USA, etc., and many
instances of such facts,
We cull from all these facts using this
support and confidence.
So, we'll disregard facts which we learn
only once or twice and keep those facts
which we have learned many times from
different independent pieces of text.
And this is how large volumes of facts
are, in fact, learned from the many
millions and billions of documents on the
web.
This whole exercise of learning from the
web is called information extraction, or
open information extraction,
And there are many examples of such
efforts.
One of the oldest efforts is called Cyc.
It's a semi-automated technique and has so
far accumulated about two billion such
facts.
Yago is more recent and is the largest to
date.
It's run out of the Max Planck Institute
in Germany, and has uncovered more than
six billion facts, and they're all linked
together as a graph.
So, now Obama is president of USA, USA
lies in North America, etc. etc. etc.,
They're all linked together.
Albert Einstein was born in Ulm, for
example,
Is, is a fact that Watson could've learned
from a database like Yago.
And, Watson actually uses facts culled
from the web internally.
It doesn't use Yago or Cyc but it uses
many, many webpages and textual documents
and rules,
And this is another example of open
information extraction.
Reverb is by the same group as Yago It's
more recent, it's lightweight, it's only
got fifteen million S, V, O triple so far.
For example, it has things like potatoes
are also rich in Vitamin C, so that the
verbs are also verb phrases, that's sort
of less useful than Yago in this sense but
it's much more diverse in the sense of the
kind of phrases that it actually includes.
The way REVERB works just to give you a
flavor for how such systems work is, first
it tags each piece of text using natural
language processing classifiers to say
which is a noun-phrase, which is a
word-phrase, which is a preposition, etc.
And then, it focuses only on the verb
phrases and figures out what are the
nearby non-phrases using classifiers just
as we have discussed.
It prefers proper nouns, especially if
they occur often in other facts so that,
you know, Ein, words like Einstein are
preferred rather than person or scientist.
And, wherever possible, it manages to
extract more than one fact from a piece of
text.
So, from a text like, Mozart was born in
Salzburg, but moved to Vienna in 1781,
yields two facts.
Mozart moved to Vienna in addition to, in
addition to Mozart was born in Salzburg.
Now, I admit we have gone through this
section fairly quickly.
The point I wanted to make is that, the
ability to extract facts is significantly
enhanced by the fact that we have so many,
many documents available on the web Using
a combination of supervised learning and
unsupervised extraction like, REVERB,
which is unsupervised in the sense that
one is not actually labeling any small
fraction of text as S,
V, O. We're just using lower level
classifiers for part of speech tagging.
By using a combination of these supervised
and unsupervised learning techniques, one
can actually extract large volumes of
facts and rules from text, and then use
the reasoning techniques that we started
with in this lecture this week to actually
move towards the semantic web vision.
Along the way, of course, we have to deal
with the situation that we have the limits
of logic, which are fundamental, as well
as those limits which come from
uncertainty.