The most simple and fundamental technique
of machine learning is called the bayesian
classifier.
And it's also called a Naive Bayesian
Classifier, because it makes the
assumption that all the futures that one
uses to learn about historical data are
independent for a particular choice of the
behavior one is trying to learn.
So, for example, if one is trying to
predict whether a given user is shopping
or just surfing using the query words that
they use to search.
We assume that the words itself are all
independent and equally likely to occur
independent of what other words are used
in the query.
It's not necessarily of valid assumptions
in most cases as we have just argued a
while back.
But that's what is assumed and that's why
it's called naive based classifiers.
Anyway, let's try to figure this out from
first principles.
The probability, of B that is, whether
some user is a buyer, or just a browser.
Given, the combination of R and C that
they use times the probability of R and C
itself.
By Bayes rule, this is nothing but the
probability of R and C, given B the
probability of B.
This is easy to see just by thinking of R,
C as one random variable, and just using
Bayes rule.
Next, we further expand the first term
over here.
Again using Bayes rule, forgetting about B
for the time being.
And, noting that the probability of R and
C given B is the probability of R given C
and B is there just because it was
earlier, times the probability of c given
b which again is there because it was
there up there.
Again, by applying Bayes rule we get this
expansion.
Now comes the independence assumption.
The probability that of R given C and B or
A or whatever else, it doesn't matter
because of independence of R and C is the
same as the probability of R given
whatever else was there on the conditional
side.
Because of this independence we can
essentially separately compute these terms
probability of R given by equal to yes,
Probability of C given by equal to yes.
And then just simply multiply by it by the
a priori, or the overall probability, of
by equal to yes.
We can do the same for by equal to know.
So given some values of R and C, that is
whether the word red occurs or the word
cheap occurs or not.
We compute the probability of R being
there given that by is yes.
C being there or not there given by equal
to yes.
And the a priori probability of B equal to
yes and divide it by the same expression
that we have here but for the other
possibility that i equal to a no.
Now, since this expression is proportional
to the conditional probability that we're
looking for.
That is whether or not the user is more
likely to be a buyer versus a browser
given the query words that they used, we
simply compute this ratio for both
possibilities buyer or browser and if this
ratio is greater than one.
In general creating some threshold we
chose the option of buy equal to yes as
being the most likely estimate.
Otherwise, we decide that buyer equal to
no.
So, congratulations.
We have just derived the most fundamental
classification mechanism on machine
learning technique called the Naive Bayes
Classifier from first principles.
And, as you can see, the derivation is
simple arithmetic and basic probability.
Given any combination of words and the
past history say you have four features
that we had earlier.
Red flower gift and cheap and in general N
features.
Arbitrary long queries for example, taking
particular values yes, no, yes, no,
etcetera, etcetera, weakens merely compute
the ratio that we had earlier.
Simply now that we have N terms in the
product as opposed to just two.
And in the end, the a priori probabilities
of the two different types of behavior
that we're trying to predict.
This ratio is called a likelihood ratio
because each of these terms in the product
is a likelihood in the sense that if x1 =
yes then the probability here is the
probability that x1 = yes for the
situation when people are buying versus
the probability that x1 = yes when people
are just browsing.
And we do that for all the possible words.
Many of them will have a, no, so xi's for
them will be, no.
But we still have to take all the possible
words into account.
And choose yes, if this likelihood ratio
is greater than a threshold and, no,
otherwise.
We also normally take logarithms to make
multiplications into additions, so you
would frequently hear the term log
likelihood.
The reason is because off these many, many
probabilities all less than one being
multiplied.
You probably will run out of decimal
points and you'll get all zeros.
So by taking logarithms you will reduce
the and of such underflow occurring and
make the computation stable.
In particular that you really need to
consider all possible features.
Even those that don't occur in the query.
So, if you have a million words, and your
query consists of four words, you need to
multiply the probabilities for a million
words.
For the four words that you have, you'll
use the probability that the word occurs
for buyers and the probability that the
word occurs for browsers.
And for the rest, you'll have to include
the probabilities that the word does not
occur for buyers and for browsers in the
denominator.
Now this makes it an extremely expensive
method, but never the less quite useful.
The expensiveness we'll figure out how to
deal with using parallel computing next
week.
But for the moment lets take a look at an
example of trying to compute sentiment
using machine learning and the Naive Bayes
Classifier.