1 00:00:00,000 --> 00:00:07,025 The most simple and fundamental technique of machine learning is called the bayesian 2 00:00:07,025 --> 00:00:11,049 classifier. And it's also called a Naive Bayesian 3 00:00:11,049 --> 00:00:17,051 Classifier, because it makes the assumption that all the futures that one 4 00:00:17,051 --> 00:00:24,035 uses to learn about historical data are independent for a particular choice of the 5 00:00:24,035 --> 00:00:29,087 behavior one is trying to learn. So, for example, if one is trying to 6 00:00:29,087 --> 00:00:36,079 predict whether a given user is shopping or just surfing using the query words that 7 00:00:36,079 --> 00:00:41,518 they use to search. We assume that the words itself are all 8 00:00:41,518 --> 00:00:48,072 independent and equally likely to occur independent of what other words are used 9 00:00:48,072 --> 00:00:52,662 in the query. It's not necessarily of valid assumptions 10 00:00:52,662 --> 00:00:56,857 in most cases as we have just argued a while back. 11 00:00:56,857 --> 00:01:02,869 But that's what is assumed and that's why it's called naive based classifiers. 12 00:01:02,869 --> 00:01:08,308 Anyway, let's try to figure this out from first principles. 13 00:01:08,308 --> 00:01:18,015 The probability, of B that is, whether some user is a buyer, or just a browser. 14 00:01:18,015 --> 00:01:28,298 Given, the combination of R and C that they use times the probability of R and C 15 00:01:28,298 --> 00:01:34,065 itself. By Bayes rule, this is nothing but the 16 00:01:34,065 --> 00:01:40,024 probability of R and C, given B the probability of B. 17 00:01:40,024 --> 00:01:48,014 This is easy to see just by thinking of R, C as one random variable, and just using 18 00:01:48,014 --> 00:01:56,087 Bayes rule. Next, we further expand the first term 19 00:01:56,087 --> 00:02:02,087 over here. Again using Bayes rule, forgetting about B 20 00:02:02,087 --> 00:02:09,828 for the time being. And, noting that the probability of R and 21 00:02:09,828 --> 00:02:15,518 C given B is the probability of R given C and B is there just because it was 22 00:02:15,518 --> 00:02:21,049 earlier, times the probability of c given b which again is there because it was 23 00:02:21,049 --> 00:02:27,040 there up there. Again, by applying Bayes rule we get this 24 00:02:27,040 --> 00:02:34,275 expansion. Now comes the independence assumption. 25 00:02:34,275 --> 00:02:42,056 The probability that of R given C and B or A or whatever else, it doesn't matter 26 00:02:43,036 --> 00:02:50,004 because of independence of R and C is the same as the probability of R given 27 00:02:50,004 --> 00:02:54,021 whatever else was there on the conditional side. 28 00:02:55,069 --> 00:03:04,028 Because of this independence we can essentially separately compute these terms 29 00:03:04,028 --> 00:03:11,068 probability of R given by equal to yes, Probability of C given by equal to yes. 30 00:03:11,097 --> 00:03:19,026 And then just simply multiply by it by the a priori, or the overall probability, of 31 00:03:19,026 --> 00:03:25,025 by equal to yes. We can do the same for by equal to know. 32 00:03:25,078 --> 00:03:33,087 So given some values of R and C, that is whether the word red occurs or the word 33 00:03:33,087 --> 00:03:42,044 cheap occurs or not. We compute the probability of R being 34 00:03:42,044 --> 00:03:49,757 there given that by is yes. C being there or not there given by equal 35 00:03:49,757 --> 00:03:53,674 to yes. And the a priori probability of B equal to 36 00:03:53,674 --> 00:04:00,163 yes and divide it by the same expression that we have here but for the other 37 00:04:00,163 --> 00:04:07,553 possibility that i equal to a no. Now, since this expression is proportional 38 00:04:07,553 --> 00:04:12,476 to the conditional probability that we're looking for. 39 00:04:12,476 --> 00:04:20,484 That is whether or not the user is more likely to be a buyer versus a browser 40 00:04:20,484 --> 00:04:27,781 given the query words that they used, we simply compute this ratio for both 41 00:04:27,781 --> 00:04:34,577 possibilities buyer or browser and if this ratio is greater than one. 42 00:04:34,577 --> 00:04:41,425 In general creating some threshold we chose the option of buy equal to yes as 43 00:04:41,425 --> 00:04:47,750 being the most likely estimate. Otherwise, we decide that buyer equal to 44 00:04:47,750 --> 00:04:49,601 no. So, congratulations. 45 00:04:49,601 --> 00:04:56,305 We have just derived the most fundamental classification mechanism on machine 46 00:04:56,305 --> 00:05:02,911 learning technique called the Naive Bayes Classifier from first principles. 47 00:05:02,911 --> 00:05:11,583 And, as you can see, the derivation is simple arithmetic and basic probability. 48 00:05:11,583 --> 00:05:21,619 Given any combination of words and the past history say you have four features 49 00:05:21,619 --> 00:05:30,660 that we had earlier. Red flower gift and cheap and in general N 50 00:05:30,660 --> 00:05:35,488 features. Arbitrary long queries for example, taking 51 00:05:35,488 --> 00:05:41,512 particular values yes, no, yes, no, etcetera, etcetera, weakens merely compute 52 00:05:41,512 --> 00:05:46,836 the ratio that we had earlier. Simply now that we have N terms in the 53 00:05:46,836 --> 00:05:52,697 product as opposed to just two. And in the end, the a priori probabilities 54 00:05:52,697 --> 00:05:59,091 of the two different types of behavior that we're trying to predict. 55 00:05:59,091 --> 00:06:07,637 This ratio is called a likelihood ratio because each of these terms in the product 56 00:06:07,637 --> 00:06:16,122 is a likelihood in the sense that if x1 = yes then the probability here is the 57 00:06:16,122 --> 00:06:24,916 probability that x1 = yes for the situation when people are buying versus 58 00:06:24,916 --> 00:06:32,027 the probability that x1 = yes when people are just browsing. 59 00:06:33,045 --> 00:06:40,034 And we do that for all the possible words. Many of them will have a, no, so xi's for 60 00:06:40,034 --> 00:06:45,029 them will be, no. But we still have to take all the possible 61 00:06:45,029 --> 00:06:49,662 words into account. And choose yes, if this likelihood ratio 62 00:06:49,662 --> 00:06:53,440 is greater than a threshold and, no, otherwise. 63 00:06:53,440 --> 00:07:00,146 We also normally take logarithms to make multiplications into additions, so you 64 00:07:00,146 --> 00:07:03,851 would frequently hear the term log likelihood. 65 00:07:03,851 --> 00:07:09,976 The reason is because off these many, many probabilities all less than one being 66 00:07:09,976 --> 00:07:13,728 multiplied. You probably will run out of decimal 67 00:07:13,728 --> 00:07:19,691 points and you'll get all zeros. So by taking logarithms you will reduce 68 00:07:19,691 --> 00:07:25,036 the and of such underflow occurring and make the computation stable. 69 00:07:25,077 --> 00:07:31,037 In particular that you really need to consider all possible features. 70 00:07:31,080 --> 00:07:36,855 Even those that don't occur in the query. So, if you have a million words, and your 71 00:07:36,855 --> 00:07:43,936 query consists of four words, you need to multiply the probabilities for a million 72 00:07:43,936 --> 00:07:48,014 words. For the four words that you have, you'll 73 00:07:48,014 --> 00:07:54,084 use the probability that the word occurs for buyers and the probability that the 74 00:07:54,084 --> 00:08:00,022 word occurs for browsers. And for the rest, you'll have to include 75 00:08:00,022 --> 00:08:06,303 the probabilities that the word does not occur for buyers and for browsers in the 76 00:08:06,303 --> 00:08:10,649 denominator. Now this makes it an extremely expensive 77 00:08:10,649 --> 00:08:18,126 method, but never the less quite useful. The expensiveness we'll figure out how to 78 00:08:18,126 --> 00:08:21,520 deal with using parallel computing next week. 79 00:08:21,520 --> 00:08:28,017 But for the moment lets take a look at an example of trying to compute sentiment 80 00:08:28,017 --> 00:08:32,060 using machine learning and the Naive Bayes Classifier.