1
00:00:00,000 --> 00:00:07,440
Let's recall our Bayesian classifiers for
figuring out whether a query or a comment

2
00:00:07,440 --> 00:00:13,860
had positive or negative sentiment.
Our Bayesian classifier looks something

3
00:00:13,860 --> 00:00:17,299
like this.
We wanted to figure out whether somebody

4
00:00:17,299 --> 00:00:22,560
was a buyer or a browser depending on the
words that they used in their query.

5
00:00:24,020 --> 00:00:31,676
Trouble is, if the word cheap and gift are
not independent, that is the probability

6
00:00:31,676 --> 00:00:40,061
that somebody uses the word gift depends
on the probability that they use the word

7
00:00:40,061 --> 00:00:46,532
cheap. That means, the probability of
using the word gift given that they use

8
00:00:46,532 --> 00:00:54,280
the word cheap and that they are a buyer
is not the same as the probability of gift

9
00:00:54,560 --> 00:01:00,816
independently.
As a result, there, we actually have a

10
00:01:00,816 --> 00:01:08,203
dependence between cheap and gift which is
exhibited in this network by an arc

11
00:01:08,203 --> 00:01:14,628
between these two nodes.
In the language of Bayesian networks, what

12
00:01:14,628 --> 00:01:21,743
this means is to compute the, a posterior
probability of a buy given any of these

13
00:01:21,743 --> 00:01:25,993
occurances.
Our expansion needs to include the

14
00:01:25,993 --> 00:01:33,625
probability of a gift given C and B.
Or alternatively, the probability of cheap

15
00:01:33,625 --> 00:01:39,020
given G and B.
Depending on the order in which we expand

16
00:01:39,020 --> 00:01:45,860
the joint probability of G, C, and B.
We've seen this in the last segment.

17
00:01:46,980 --> 00:01:51,281
Another example could be the sentiment
from comments.

18
00:01:51,281 --> 00:01:57,448
A comment like, I don't like the course,
and I like the course, don't complain,

19
00:01:57,448 --> 00:02:03,860
both contain the word don't and like, but
they are clearly different sentiments.

20
00:02:04,780 --> 00:02:10,667
So, at first, we might include don't in
our list of features along with other

21
00:02:10,667 --> 00:02:16,853
nega, negatives like not, etc.
But that doesn't quite do the work because

22
00:02:16,853 --> 00:02:22,369
we also need to deal with the positional
order in which these words occur.

23
00:02:22,369 --> 00:02:28,480
Just including negatives doesn't allow us
to disambiguate between I don't like the

24
00:02:28,480 --> 00:02:31,760
course, and I like the course, don't
complain.

25
00:01:53,820 --> 00:02:39,100
So, the graphical model which might help
us here is that we have the class

26
00:02:39,100 --> 00:02:46,400
variable, which is a sentiment, having
observed i elements of the comment, and

27
00:02:46,400 --> 00:02:54,276
the class variable, which is the sentiment
after observing i + one variables, or

28
00:02:54,276 --> 00:03:00,899
words in the comment.
And we have the probability of Xi + one

29
00:03:00,900 --> 00:03:10,960
given probability of Xi, and S for every
position i.

30
00:03:13,060 --> 00:03:21,154
In this case, it's Si + one, In this case
it is Si. The appropriate S is used for

31
00:03:21,154 --> 00:03:26,453
this likelihood.
If we use these likelihoods and try to

32
00:03:26,453 --> 00:03:33,199
compute the most likely estimate for the
probability of S being yes or no at the

33
00:03:33,199 --> 00:03:39,599
n-th position, we get what is called a
hidden Markov model, which is another type

34
00:03:39,599 --> 00:03:44,460
of Bayesian network.
It's especially use, useful when dealing

35
00:03:44,460 --> 00:03:50,616
with sequences like sentences or spoken
words, and trying to figure out what

36
00:03:50,616 --> 00:03:56,692
actual text you are, one is trying to
speak, trying to extract phonemes from

37
00:03:56,692 --> 00:04:04,121
speech, and other sequences.
In such situations, we may also need to

38
00:04:04,121 --> 00:04:08,185
accommodate holes,
For example, P probability of Xi + kk

39
00:04:08,185 --> 00:04:13,357
given the probability of X and i.
So, whatever is in between, you might have

40
00:04:13,357 --> 00:04:17,938
a don't before a like,
It might give you a positive or a negative

41
00:04:17,938 --> 00:04:21,411
sentiment.
Whereas, if the don't comes after the

42
00:04:21,411 --> 00:04:25,401
like, we might get a positive sentiment as
more likely.

43
00:04:25,401 --> 00:04:30,573
So, this is one example of how Bayesian
networks allow us to go beyond

44
00:04:30,573 --> 00:04:34,120
independence features while building
classifers.

45
00:04:34,800 --> 00:04:40,588
There's another application of
probabilistic networks, like Bayesian

46
00:04:40,588 --> 00:04:48,802
networks and graphical models in general.
We ask the question, how do these facts

47
00:04:48,802 --> 00:04:56,216
like Obama is the president of USA, or
Manmohan Singh is the leader of India

48
00:04:56,216 --> 00:05:00,558
arise after learning from large volumes of
text?

49
00:05:00,558 --> 00:05:07,613
How do we get those individual facts and
rules so critical to be semantic web

50
00:05:07,613 --> 00:05:11,773
vision?
Suppose we want to learn facts of the

51
00:05:11,773 --> 00:05:19,082
forum subject group object from text.
We might use a Baysian network of the

52
00:05:19,082 --> 00:05:26,805
following form where we have subject,
verb, object, which processes triples in

53
00:05:26,805 --> 00:05:33,600
the text and might come up with, with a
situation like antibiotics kill bacteria.

54
00:05:35,880 --> 00:05:42,596
The single class variable is clearly not
enough, since we need to have subject,

55
00:05:42,596 --> 00:05:49,657
verb and object in the language of the
unified formulation of learning that we

56
00:05:49,657 --> 00:05:53,704
did last time.
We have many y's in the Y, X data.

57
00:05:53,704 --> 00:06:00,679
In the, when we looked at learning from
the perspective of f of x, if you remember

58
00:06:00,679 --> 00:06:04,176
that.
In addition, one needs to deal with

59
00:06:04,176 --> 00:06:11,271
positional order so we can use a different
graphical model like, the hierarchical

60
00:06:11,271 --> 00:06:16,440
Markov models or other types of models
that look like this.

61
00:06:18,380 --> 00:06:23,660
We need to know the probability of X sub
i,

62
00:06:23,660 --> 00:06:29,842
That means, any particular word occurring,
given a, the previous word, the class of

63
00:06:29,842 --> 00:06:33,860
the previous word, and the class of the
present word.

64
00:06:34,420 --> 00:06:40,495
For example, the probability that kill
following antibiotics is a verb will

65
00:06:40,495 --> 00:06:44,141
depend on whether antibiotics is the
subject.

66
00:06:44,141 --> 00:06:50,460
Situation is probably more apparent for
the example, person gains weight, where

67
00:06:50,460 --> 00:06:57,708
the word gains can be a verb or a noun.
So, whether or not gains is a verb would

68
00:06:57,708 --> 00:07:03,120
depend on whether or not person is
classified as a subject.

69
00:07:03,120 --> 00:07:07,247
Now remember, this is all supervised
learning.

70
00:07:07,247 --> 00:07:14,310
So, what we have is a whole bunch of text
labeled as subject, verb, object from

71
00:07:14,310 --> 00:07:19,080
which we compute these likelihoods or
probabilities.

72
00:07:19,080 --> 00:07:25,568
And using that, we figure out the
a-posterior probabilities of subject,

73
00:07:25,568 --> 00:07:32,984
verb, and object for every word in the
text. And where we have a high combination

74
00:07:32,984 --> 00:07:39,473
of S, V, and O, we can assert the fact
that subject, verb, object has been

75
00:07:39,473 --> 00:07:46,197
learned from this piece of text.
We also have to alive, allow for holes

76
00:07:46,197 --> 00:07:51,653
such as, so that we have to deal with
things like subjected i minus k, verb at

77
00:07:51,653 --> 00:07:56,763
i, and object at i plus P.
Now, this gets very complicated especially

78
00:07:56,763 --> 00:08:02,863
since we can have a large number of words
and a large number of possible places,

79
00:08:02,863 --> 00:08:08,887
ways of including holes, so the Bayesian
network models are not necessarily the

80
00:08:08,887 --> 00:08:14,683
most efficient. And other models, such as
conditional random fields or Markov

81
00:08:14,683 --> 00:08:19,640
networks turn out to be more efficient in
this kind of situation.

82
00:08:20,700 --> 00:08:26,380
Once we have found many facts, like Obama
is President of USA, etc., and many

83
00:08:26,380 --> 00:08:30,983
instances of such facts,
We cull from all these facts using this

84
00:08:30,983 --> 00:08:35,513
support and confidence.
So, we'll disregard facts which we learn

85
00:08:35,513 --> 00:08:41,122
only once or twice and keep those facts
which we have learned many times from

86
00:08:41,122 --> 00:08:46,587
different independent pieces of text.
And this is how large volumes of facts

87
00:08:46,587 --> 00:08:52,052
are, in fact, learned from the many
millions and billions of documents on the

88
00:08:52,052 --> 00:08:55,772
web.
This whole exercise of learning from the

89
00:08:55,772 --> 00:09:00,968
web is called information extraction, or
open information extraction,

90
00:09:00,968 --> 00:09:04,330
And there are many examples of such
efforts.

91
00:09:04,330 --> 00:09:10,825
One of the oldest efforts is called Cyc.
It's a semi-automated technique and has so

92
00:09:10,825 --> 00:09:14,264
far accumulated about two billion such
facts.

93
00:09:14,264 --> 00:09:17,855
Yago is more recent and is the largest to
date.

94
00:09:17,855 --> 00:09:23,576
It's run out of the Max Planck Institute
in Germany, and has uncovered more than

95
00:09:23,576 --> 00:09:27,042
six billion facts, and they're all linked
together as a graph.

96
00:09:27,042 --> 00:09:31,929
So, now Obama is president of USA, USA
lies in North America, etc. etc. etc.,

97
00:09:31,929 --> 00:09:38,409
They're all linked together.
Albert Einstein was born in Ulm, for

98
00:09:38,409 --> 00:09:42,102
example,
Is, is a fact that Watson could've learned

99
00:09:42,102 --> 00:09:46,829
from a database like Yago.
And, Watson actually uses facts culled

100
00:09:46,829 --> 00:09:52,073
from the web internally.
It doesn't use Yago or Cyc but it uses

101
00:09:52,073 --> 00:09:56,062
many, many webpages and textual documents
and rules,

102
00:09:56,062 --> 00:10:00,420
And this is another example of open
information extraction.

103
00:10:01,260 --> 00:10:07,310
Reverb is by the same group as Yago It's
more recent, it's lightweight, it's only

104
00:10:07,310 --> 00:10:12,763
got fifteen million S, V, O triple so far.
For example, it has things like potatoes

105
00:10:12,763 --> 00:10:18,739
are also rich in Vitamin C, so that the
verbs are also verb phrases, that's sort

106
00:10:18,739 --> 00:10:25,238
of less useful than Yago in this sense but
it's much more diverse in the sense of the

107
00:10:25,238 --> 00:10:32,995
kind of phrases that it actually includes.
The way REVERB works just to give you a

108
00:10:32,995 --> 00:10:39,184
flavor for how such systems work is, first
it tags each piece of text using natural

109
00:10:39,184 --> 00:10:44,851
language processing classifiers to say
which is a noun-phrase, which is a

110
00:10:44,851 --> 00:10:50,539
word-phrase, which is a preposition, etc.
And then, it focuses only on the verb

111
00:10:50,539 --> 00:10:56,033
phrases and figures out what are the
nearby non-phrases using classifiers just

112
00:10:56,033 --> 00:11:00,382
as we have discussed.
It prefers proper nouns, especially if

113
00:11:00,382 --> 00:11:05,476
they occur often in other facts so that,
you know, Ein, words like Einstein are

114
00:11:05,476 --> 00:11:11,400
preferred rather than person or scientist.
And, wherever possible, it manages to

115
00:11:11,400 --> 00:11:15,187
extract more than one fact from a piece of
text.

116
00:11:15,187 --> 00:11:21,184
So, from a text like, Mozart was born in
Salzburg, but moved to Vienna in 1781,

117
00:11:21,184 --> 00:11:25,760
yields two facts.
Mozart moved to Vienna in addition to, in

118
00:11:25,760 --> 00:11:31,915
addition to Mozart was born in Salzburg.
Now, I admit we have gone through this

119
00:11:31,915 --> 00:11:36,807
section fairly quickly.
The point I wanted to make is that, the

120
00:11:36,807 --> 00:11:43,835
ability to extract facts is significantly
enhanced by the fact that we have so many,

121
00:11:43,835 --> 00:11:51,041
many documents available on the web Using
a combination of supervised learning and

122
00:11:51,041 --> 00:11:57,705
unsupervised extraction like, REVERB,
which is unsupervised in the sense that

123
00:11:57,705 --> 00:12:03,229
one is not actually labeling any small
fraction of text as S,

124
00:12:03,229 --> 00:12:08,448
V, O. We're just using lower level
classifiers for part of speech tagging.

125
00:12:08,448 --> 00:12:14,589
By using a combination of these supervised
and unsupervised learning techniques, one

126
00:12:14,589 --> 00:12:20,218
can actually extract large volumes of
facts and rules from text, and then use

127
00:12:20,218 --> 00:12:26,360
the reasoning techniques that we started
with in this lecture this week to actually

128
00:12:26,900 --> 00:12:32,436
move towards the semantic web vision.
Along the way, of course, we have to deal

129
00:12:32,436 --> 00:12:38,331
with the situation that we have the limits
of logic, which are fundamental, as well

130
00:12:38,331 --> 00:12:41,495
as those limits which come from
uncertainty.