The most simple and fundamental technique of machine learning is called the bayesian classifier. And it's also called a Naive Bayesian Classifier, because it makes the assumption that all the futures that one uses to learn about historical data are independent for a particular choice of the behavior one is trying to learn. So, for example, if one is trying to predict whether a given user is shopping or just surfing using the query words that they use to search. We assume that the words itself are all independent and equally likely to occur independent of what other words are used in the query. It's not necessarily of valid assumptions in most cases as we have just argued a while back. But that's what is assumed and that's why it's called naive based classifiers. Anyway, let's try to figure this out from first principles. The probability, of B that is, whether some user is a buyer, or just a browser. Given, the combination of R and C that they use times the probability of R and C itself. By Bayes rule, this is nothing but the probability of R and C, given B the probability of B. This is easy to see just by thinking of R, C as one random variable, and just using Bayes rule. Next, we further expand the first term over here. Again using Bayes rule, forgetting about B for the time being. And, noting that the probability of R and C given B is the probability of R given C and B is there just because it was earlier, times the probability of c given b which again is there because it was there up there. Again, by applying Bayes rule we get this expansion. Now comes the independence assumption. The probability that of R given C and B or A or whatever else, it doesn't matter because of independence of R and C is the same as the probability of R given whatever else was there on the conditional side. Because of this independence we can essentially separately compute these terms probability of R given by equal to yes, Probability of C given by equal to yes. And then just simply multiply by it by the a priori, or the overall probability, of by equal to yes. We can do the same for by equal to know. So given some values of R and C, that is whether the word red occurs or the word cheap occurs or not. We compute the probability of R being there given that by is yes. C being there or not there given by equal to yes. And the a priori probability of B equal to yes and divide it by the same expression that we have here but for the other possibility that i equal to a no. Now, since this expression is proportional to the conditional probability that we're looking for. That is whether or not the user is more likely to be a buyer versus a browser given the query words that they used, we simply compute this ratio for both possibilities buyer or browser and if this ratio is greater than one. In general creating some threshold we chose the option of buy equal to yes as being the most likely estimate. Otherwise, we decide that buyer equal to no. So, congratulations. We have just derived the most fundamental classification mechanism on machine learning technique called the Naive Bayes Classifier from first principles. And, as you can see, the derivation is simple arithmetic and basic probability. Given any combination of words and the past history say you have four features that we had earlier. Red flower gift and cheap and in general N features. Arbitrary long queries for example, taking particular values yes, no, yes, no, etcetera, etcetera, weakens merely compute the ratio that we had earlier. Simply now that we have N terms in the product as opposed to just two. And in the end, the a priori probabilities of the two different types of behavior that we're trying to predict. This ratio is called a likelihood ratio because each of these terms in the product is a likelihood in the sense that if x1 = yes then the probability here is the probability that x1 = yes for the situation when people are buying versus the probability that x1 = yes when people are just browsing. And we do that for all the possible words. Many of them will have a, no, so xi's for them will be, no. But we still have to take all the possible words into account. And choose yes, if this likelihood ratio is greater than a threshold and, no, otherwise. We also normally take logarithms to make multiplications into additions, so you would frequently hear the term log likelihood. The reason is because off these many, many probabilities all less than one being multiplied. You probably will run out of decimal points and you'll get all zeros. So by taking logarithms you will reduce the and of such underflow occurring and make the computation stable. In particular that you really need to consider all possible features. Even those that don't occur in the query. So, if you have a million words, and your query consists of four words, you need to multiply the probabilities for a million words. For the four words that you have, you'll use the probability that the word occurs for buyers and the probability that the word occurs for browsers. And for the rest, you'll have to include the probabilities that the word does not occur for buyers and for browsers in the denominator. Now this makes it an extremely expensive method, but never the less quite useful. The expensiveness we'll figure out how to deal with using parallel computing next week. But for the moment lets take a look at an example of trying to compute sentiment using machine learning and the Naive Bayes Classifier.