So in order to compute these conditional
probabilities when there are a large
number of possible words or features, we
need to do a little more work.
Let's look at the simple case of just one
word, say red.
There are two possibilities.
The red could be present or absent.
So, for our cases red is present.
Out of these I have a Bi and the rest
don't.
The total number of cases which are Y
situations are obviously K as before.
So lets see what the simple condition
probabilities are in this case.
The probability of a Bi given that there
is an R is just I over R.
On the other hand, the probability of an R
over all is just R over the total number
N.
And the probability that both R occurs and
there is a Bi is I over N.
Now this is the joint probability, so
you'll have to divide by the total number
of instances N rather than any of the
conditional ones K or R.
Baye's rule is actually just simple
arithmetic.
In particular if you write I over N as I
over R times R over N, lets see what you
get.
Well we already know what I over R is its
just the conditional probability of a Bi
given R.
And R over N is the probability of R
itself, so simple arithmetic tells us that
the joint probability of R and B is the
product of the conditional probability and
the a prior I probability.
This is just Bayes rule.
The probability of B and R is the
conditional probability of B given R times
P of R which is also the same thing as the
probability of R given B times the
probability of B.
We can see that by just rewriting I over
N, not by introducing an R, but by
introducing a K.
In which case you'll get I over K, which
is just the probability of R out of all
those that are B and the probability of B
which is just K over N, right?
Bayes' Rule which some of you may or may
not remember turns out to be just simple
arithmetic.
As we shall soon see, Bayes' Rule is
critical to machine learning because it
allows us to compute any of those many,
many joint probabilities even if there is
no data for a particular combination.
Before we can do machine learning using
Bayesian techniques we need one more
important concept that is independence.
Think about two words like red and cheap.
As before, we have R equal to yes for R
queries, cheap equal to yes for C cases
and in I cases both key words are present.
The probability of an R occurring is
clearly R over N.
And the probability of cheap occurring is
C over N, as before.
And similarly, the conditional
probabilities of an R given that C occurs
can be computed.
Independence says that the probability of
R does not depend on whether or not the
word cheap is already present in the
query.
In other words the probability of R should
be the same as the probability of R given
C.
Similarly the probability that cheap
occurs should not depend on whether or not
red occurs.
In such situations these two features are
independent.
Of course, that might not necessarily
always be the case.
For example, somebody searching for big
data might actually search for map reduce
at the same time rather than something
else, like red or flower.