1
00:00:00,000 --> 00:00:07,470
Let's return to our Naive Bayes Classifier
now, with rain as the class either yes or

2
00:00:07,470 --> 00:00:13,860
no, and grass being wet as the only
feature that we're going to measure.

3
00:00:14,380 --> 00:00:19,891
But suppose we have the conditional
probability that grass is wet, given that

4
00:00:19,891 --> 00:00:23,828
there is rain.
So obviously, if there is rain, the chance

5
00:00:23,828 --> 00:00:28,768
the grass is wet is very high.
Otherwise, it's still there but not too

6
00:00:28,768 --> 00:00:31,917
much.
And if there's no rain, there is a very

7
00:00:31,917 --> 00:00:37,259
low chance that the grass is wet.
And of course, if there's no rain, the

8
00:00:37,259 --> 00:00:40,964
chance that the grass is not wet is very
high.

9
00:00:40,964 --> 00:00:45,635
Notice again that this is a conditional
probability table.

10
00:00:45,635 --> 00:00:51,997
So, the values for a particular value of
rain being yes, these have to add up to

11
00:00:51,997 --> 00:00:56,180
one.
The prior or the a priori probability that

12
00:00:56,180 --> 00:01:02,897
it rains regardless of whether the grass
is wet or not is a twenty percent of the

13
00:01:02,897 --> 00:01:07,800
time it rains in this locality and 80% it
doesn't rain.

14
00:01:08,140 --> 00:01:15,878
Now, if we just have this one feature, our
Bayes rule gives us the fact that the

15
00:01:15,878 --> 00:01:23,818
joint probability of R and W can be
factored in two ways as the probability of

16
00:01:23,818 --> 00:01:31,285
R given W times the probability of W,
Or this conditional probability times the

17
00:01:31,285 --> 00:01:37,897
prior probability of R.
Now, suppose we are given some evidence

18
00:01:37,897 --> 00:01:41,605
that we actually observe that the grass is
wet,

19
00:01:41,605 --> 00:01:46,577
That is W equal to yes.
Now, we can condition this joint

20
00:01:46,577 --> 00:01:51,474
probability by restricting it to the case
where W is yes.

21
00:01:51,474 --> 00:01:58,731
So, we get probability of R given W equal
to yes times the probability that W equal

22
00:01:58,731 --> 00:02:03,015
to yes.
And this side gets probability of W equal

23
00:02:03,015 --> 00:02:10,360
to yes given R which is a likelihood times
the prior which doesn't have anything to

24
00:02:10,360 --> 00:02:14,120
do with W so we don't have to condition
it.

25
00:02:16,920 --> 00:02:26,061
This element probability that of R given W
equal to yes, can now be written as the

26
00:02:26,061 --> 00:02:30,893
right hand side here divided by the
probability of the evidence.

27
00:02:30,893 --> 00:02:36,253
The inverse of probability W equal to yes,
we'll just write it as sigma.

28
00:02:36,253 --> 00:02:42,444
It's, it's simply the probability of the
evidence and we'll use the sigma wherever

29
00:02:42,444 --> 00:02:47,955
we need to refer to one over the
probability of the evidence that we are

30
00:02:47,955 --> 00:02:52,488
observing.
Pardon the use of sigma for this purpose

31
00:02:52,488 --> 00:02:56,451
here,
Only are we use sigma as a selection

32
00:02:56,451 --> 00:03:04,100
operator but now we will use sigma only as
the inverse of the evidence probability.

33
00:03:05,140 --> 00:03:11,611
Study this carefully so that what we want
is the a posterior probability of R given

34
00:03:11,611 --> 00:03:17,697
the evidence, which is simply proportional
to the likelihoods multiplied by the

35
00:03:17,697 --> 00:03:25,028
prior.
In sequel, we can write this as select sum

36
00:03:25,028 --> 00:03:36,863
of P times P from these two tables where W
equal to yes, and R equal to R because we

37
00:03:36,863 --> 00:03:43,386
have to make sure that we're joining these
tables on the common attribute R and

38
00:03:43,386 --> 00:03:48,867
finally grouping by R.
This is what we've stated before, how to

39
00:03:48,867 --> 00:03:54,037
multiple two probability tables.
So, the result is what we get by

40
00:03:54,037 --> 00:03:58,550
restricting our case, cases to the, the
rows where W is yes,

41
00:03:58,550 --> 00:04:01,974
Multiplying the P values and adding them
up.

42
00:04:01,974 --> 00:04:06,876
But it doesn't add up since there are only
two values,

43
00:04:06,876 --> 00:04:11,311
Each of them having two rows having
distinct values of R.

44
00:04:11,311 --> 00:04:16,136
So we simply multiply 0.9 by 0.2 to get
the first row R equal to yes,

45
00:04:16,136 --> 00:04:21,272
And 0.2 by 0.8 to get the second row.
This is the product of these two

46
00:04:21,272 --> 00:04:28,241
potentials or probability tables.
Now, we need to normalize so that the

47
00:04:28,241 --> 00:04:34,651
total probability that, of R is one.
So, the sum has to be one.

48
00:04:34,651 --> 00:04:42,531
So essentially, we were, we're taking the
probability of R being yes as 0.18 divided

49
00:04:42,531 --> 00:04:51,843
by the sum of these two values which we
get as 0.18 divided by the sum, and you

50
00:04:51,843 --> 00:05:00,473
get 53% of 0.53. This is the chance that,
or our belief that it's raining once we

51
00:05:00,473 --> 00:05:07,635
see the grass as being wet.
The reason it's so small, one might expect

52
00:05:07,635 --> 00:05:12,921
it's a little bit higher,
Is that there are cases where the grass

53
00:05:12,921 --> 00:05:18,937
can be wet without there being any rain.
In fact, there are cases of twenty percent

54
00:05:18,940 --> 00:05:22,518
where the grass is wet when there's no
rain.

55
00:05:22,518 --> 00:05:28,373
In addition, it hardly ever rains.
So, combining these two things together

56
00:05:28,373 --> 00:05:35,042
essentially gives us only a 53% chance of
saying that it's actually raining if the

57
00:05:35,042 --> 00:05:40,033
grass is wet.
Now, let's see what happens when we have

58
00:05:40,033 --> 00:05:43,640
more than one feature as we normally do in
a Baysian classifier.

59
00:05:43,640 --> 00:05:48,373
We're going to have another feature called
thunder, which will say whether or not we

60
00:05:48,373 --> 00:05:52,702
are hearing thunder.
And, let's for the moment, assume that we

61
00:05:52,702 --> 00:05:56,330
don't really know whether we heard thunder
at all.

62
00:05:56,330 --> 00:06:01,371
But the probability of hearing thunder,
given, that it's raining, is 0.8, not

63
00:06:01,371 --> 00:06:05,993
hearing it is 0.2 and so on.
So we have a conditional table even for

64
00:06:05,993 --> 00:06:11,455
this variable thunder, but in this case
let us assume that we haven't actually

65
00:06:11,455 --> 00:06:15,516
observed thunder.
We could be asking our neighbor over the

66
00:06:15,516 --> 00:06:21,048
phone whether the grass is wet, and then
trying to conclude whether or not it's

67
00:06:21,048 --> 00:06:26,440
raining there. But, we didn't ask our
neighbor whether they're hearing thunder.

68
00:06:27,020 --> 00:06:34,102
So now, the probability of RT given W
equal to yes, because that's all we know,

69
00:06:34,102 --> 00:06:40,894
is by the same equation that we had
earlier, probability of W equal to yes

70
00:06:40,894 --> 00:06:49,940
given R, probability of T given R times
the probability of R. This is our Baysian

71
00:06:49,940 --> 00:06:55,206
formula.
Again, we have the probability of the

72
00:06:55,206 --> 00:07:01,485
evidence over here, but this time we don't
have anything that says T equal to yes or

73
00:07:01,485 --> 00:07:04,700
no.
So, we need to sum out T in this product

74
00:07:04,700 --> 00:07:09,560
if we want to understand or, or know only
the probability of rain.

75
00:07:09,560 --> 00:07:13,250
The way we sum that out is that we, again,
do sequel.

76
00:07:13,250 --> 00:07:18,533
So, we select R and the sum of the
products of all three of these columns

77
00:07:18,533 --> 00:07:24,178
from these three tables where W is yes,
and all the common variables, which are

78
00:07:24,178 --> 00:07:28,231
only R over here.
The only common variable between these

79
00:07:28,231 --> 00:07:32,600
tables is R,
And then you group by R.

80
00:07:33,160 --> 00:07:41,679
This effectively sums out T because we're
only selecting R and summing up the values

81
00:07:41,679 --> 00:07:48,142
for different values of T.
Now you can verify that you could do this

82
00:07:48,142 --> 00:07:55,300
by first joining T1 and T3, that is this
table and this table that was the PW given

83
00:07:55,300 --> 00:08:02,113
R and P of R, just like we did earlier,
and then joining the result of that with

84
00:08:02,113 --> 00:08:07,201
the T given R table.
So, we're just going to take the result we

85
00:08:07,201 --> 00:08:10,995
had earlier and join it with this new
table.

86
00:08:10,995 --> 00:08:18,240
But, notice that we get the same result
because when we multiplied this element by

87
00:08:18,240 --> 00:08:24,102
0.8, for R equal to yes, and again by 0.2
for R equal to no. And then, sum them up,

88
00:08:24,102 --> 00:08:28,002
we'll get the same result because 0.8 plus
0.2 is one.

89
00:08:28,002 --> 00:08:32,059
Similarly, for 0.1 and 0.9, so it doesn't
change anything.

90
00:08:32,059 --> 00:08:38,143
This is to be expected since there was no
new evidence as compared to earlier.

91
00:08:38,143 --> 00:08:44,383
Just by including something new in our
diagram, we couldn't expect to change our

92
00:08:44,383 --> 00:08:49,919
belief in R.
Another important point is that if you

93
00:08:49,919 --> 00:08:56,170
remember in one of the homeworks, we asked
whether ignoring some of the features

94
00:08:56,170 --> 00:09:00,252
changes our belief.
Well, it does, change our belief as

95
00:09:00,252 --> 00:09:04,333
compared to if we had actually observed
the features.

96
00:09:04,333 --> 00:09:10,263
But it's in some sense, equally correct,
because suppose we didn't observe the

97
00:09:10,263 --> 00:09:13,275
feature.
So this is actually the same as a Bayesian

98
00:09:13,275 --> 00:09:17,832
classifier with only two features, but
only partially being observed, that only

99
00:09:17,832 --> 00:09:22,097
one of them is being observed.
So you could have millions of features and

100
00:09:22,097 --> 00:09:26,888
only observe one or two, and you'll still
get the same result just by putting them

101
00:09:26,888 --> 00:09:30,277
in your classifier.
The summing out process make sure that

102
00:09:30,277 --> 00:09:33,900
this process always works and you don't
get any wrong results.

103
00:09:34,260 --> 00:09:40,383
Go back to that example where we asked
whether partial evidence changes things

104
00:09:40,383 --> 00:09:45,577
and, in fact, it doesn't. It is simply as
if the feature didn't exist.

105
00:09:45,577 --> 00:09:50,848
Of course, if we are observing the
feature, it's better to include it.

106
00:09:50,848 --> 00:09:56,816
But, by ignoring it, we're not saying that
we had a wrong result, it's just that we

107
00:09:56,816 --> 00:10:00,460
didn't have that feature to measure,
that's all.

108
00:10:01,260 --> 00:10:07,177
Now, let's see what happens if we actually
do have evidence about T equal to yes or

109
00:10:07,177 --> 00:10:12,311
no and that suppose we have T equal to
yes. So now, we're looking for the

110
00:10:12,311 --> 00:10:18,014
probability of rain given W and T are both
yes. And now, in this case, we restrict

111
00:10:18,014 --> 00:10:23,790
our multiplication by the PT given R table
to the case where T equal to yes only.

112
00:10:23,790 --> 00:10:29,084
In this case, again, we use the same
sequel, only we have another select

113
00:10:29,084 --> 00:10:34,077
statement which restricts us to use only
those rows of this table.

114
00:10:34,077 --> 00:10:38,540
Again, joining with the prior join that we
had of T1 and T3.

115
00:10:39,380 --> 00:10:45,881
But with the restriction that T = to yes,
we'll now get a different result because

116
00:10:45,881 --> 00:10:51,821
now we'll multiply this by 0.8 and the
second row by 0.1, and these rows don't

117
00:10:51,821 --> 00:10:57,840
count, so we get different results.
And normalizing gives us the probability

118
00:10:57,840 --> 00:11:01,452
of rain equal to yes, given the evidence
is now 90%.

119
00:11:01,452 --> 00:11:07,941
In a sense, our belief has undergrown a
revision from the earlier value of 53% to

120
00:11:07,941 --> 00:11:11,501
90%..
So, new evidence has changed our belief.

121
00:11:11,501 --> 00:11:18,197
In classical logic, once you asserted that
say, rain occurs or doesn't occur, you

122
00:11:18,197 --> 00:11:23,452
can't change our belief.
But, in probabilistic reasoning, belief

123
00:11:23,452 --> 00:11:26,419
can be revised.
It's very important.