1 00:00:00,000 --> 00:00:07,470 Let's return to our Naive Bayes Classifier now, with rain as the class either yes or 2 00:00:07,470 --> 00:00:13,860 no, and grass being wet as the only feature that we're going to measure. 3 00:00:14,380 --> 00:00:19,891 But suppose we have the conditional probability that grass is wet, given that 4 00:00:19,891 --> 00:00:23,828 there is rain. So obviously, if there is rain, the chance 5 00:00:23,828 --> 00:00:28,768 the grass is wet is very high. Otherwise, it's still there but not too 6 00:00:28,768 --> 00:00:31,917 much. And if there's no rain, there is a very 7 00:00:31,917 --> 00:00:37,259 low chance that the grass is wet. And of course, if there's no rain, the 8 00:00:37,259 --> 00:00:40,964 chance that the grass is not wet is very high. 9 00:00:40,964 --> 00:00:45,635 Notice again that this is a conditional probability table. 10 00:00:45,635 --> 00:00:51,997 So, the values for a particular value of rain being yes, these have to add up to 11 00:00:51,997 --> 00:00:56,180 one. The prior or the a priori probability that 12 00:00:56,180 --> 00:01:02,897 it rains regardless of whether the grass is wet or not is a twenty percent of the 13 00:01:02,897 --> 00:01:07,800 time it rains in this locality and 80% it doesn't rain. 14 00:01:08,140 --> 00:01:15,878 Now, if we just have this one feature, our Bayes rule gives us the fact that the 15 00:01:15,878 --> 00:01:23,818 joint probability of R and W can be factored in two ways as the probability of 16 00:01:23,818 --> 00:01:31,285 R given W times the probability of W, Or this conditional probability times the 17 00:01:31,285 --> 00:01:37,897 prior probability of R. Now, suppose we are given some evidence 18 00:01:37,897 --> 00:01:41,605 that we actually observe that the grass is wet, 19 00:01:41,605 --> 00:01:46,577 That is W equal to yes. Now, we can condition this joint 20 00:01:46,577 --> 00:01:51,474 probability by restricting it to the case where W is yes. 21 00:01:51,474 --> 00:01:58,731 So, we get probability of R given W equal to yes times the probability that W equal 22 00:01:58,731 --> 00:02:03,015 to yes. And this side gets probability of W equal 23 00:02:03,015 --> 00:02:10,360 to yes given R which is a likelihood times the prior which doesn't have anything to 24 00:02:10,360 --> 00:02:14,120 do with W so we don't have to condition it. 25 00:02:16,920 --> 00:02:26,061 This element probability that of R given W equal to yes, can now be written as the 26 00:02:26,061 --> 00:02:30,893 right hand side here divided by the probability of the evidence. 27 00:02:30,893 --> 00:02:36,253 The inverse of probability W equal to yes, we'll just write it as sigma. 28 00:02:36,253 --> 00:02:42,444 It's, it's simply the probability of the evidence and we'll use the sigma wherever 29 00:02:42,444 --> 00:02:47,955 we need to refer to one over the probability of the evidence that we are 30 00:02:47,955 --> 00:02:52,488 observing. Pardon the use of sigma for this purpose 31 00:02:52,488 --> 00:02:56,451 here, Only are we use sigma as a selection 32 00:02:56,451 --> 00:03:04,100 operator but now we will use sigma only as the inverse of the evidence probability. 33 00:03:05,140 --> 00:03:11,611 Study this carefully so that what we want is the a posterior probability of R given 34 00:03:11,611 --> 00:03:17,697 the evidence, which is simply proportional to the likelihoods multiplied by the 35 00:03:17,697 --> 00:03:25,028 prior. In sequel, we can write this as select sum 36 00:03:25,028 --> 00:03:36,863 of P times P from these two tables where W equal to yes, and R equal to R because we 37 00:03:36,863 --> 00:03:43,386 have to make sure that we're joining these tables on the common attribute R and 38 00:03:43,386 --> 00:03:48,867 finally grouping by R. This is what we've stated before, how to 39 00:03:48,867 --> 00:03:54,037 multiple two probability tables. So, the result is what we get by 40 00:03:54,037 --> 00:03:58,550 restricting our case, cases to the, the rows where W is yes, 41 00:03:58,550 --> 00:04:01,974 Multiplying the P values and adding them up. 42 00:04:01,974 --> 00:04:06,876 But it doesn't add up since there are only two values, 43 00:04:06,876 --> 00:04:11,311 Each of them having two rows having distinct values of R. 44 00:04:11,311 --> 00:04:16,136 So we simply multiply 0.9 by 0.2 to get the first row R equal to yes, 45 00:04:16,136 --> 00:04:21,272 And 0.2 by 0.8 to get the second row. This is the product of these two 46 00:04:21,272 --> 00:04:28,241 potentials or probability tables. Now, we need to normalize so that the 47 00:04:28,241 --> 00:04:34,651 total probability that, of R is one. So, the sum has to be one. 48 00:04:34,651 --> 00:04:42,531 So essentially, we were, we're taking the probability of R being yes as 0.18 divided 49 00:04:42,531 --> 00:04:51,843 by the sum of these two values which we get as 0.18 divided by the sum, and you 50 00:04:51,843 --> 00:05:00,473 get 53% of 0.53. This is the chance that, or our belief that it's raining once we 51 00:05:00,473 --> 00:05:07,635 see the grass as being wet. The reason it's so small, one might expect 52 00:05:07,635 --> 00:05:12,921 it's a little bit higher, Is that there are cases where the grass 53 00:05:12,921 --> 00:05:18,937 can be wet without there being any rain. In fact, there are cases of twenty percent 54 00:05:18,940 --> 00:05:22,518 where the grass is wet when there's no rain. 55 00:05:22,518 --> 00:05:28,373 In addition, it hardly ever rains. So, combining these two things together 56 00:05:28,373 --> 00:05:35,042 essentially gives us only a 53% chance of saying that it's actually raining if the 57 00:05:35,042 --> 00:05:40,033 grass is wet. Now, let's see what happens when we have 58 00:05:40,033 --> 00:05:43,640 more than one feature as we normally do in a Baysian classifier. 59 00:05:43,640 --> 00:05:48,373 We're going to have another feature called thunder, which will say whether or not we 60 00:05:48,373 --> 00:05:52,702 are hearing thunder. And, let's for the moment, assume that we 61 00:05:52,702 --> 00:05:56,330 don't really know whether we heard thunder at all. 62 00:05:56,330 --> 00:06:01,371 But the probability of hearing thunder, given, that it's raining, is 0.8, not 63 00:06:01,371 --> 00:06:05,993 hearing it is 0.2 and so on. So we have a conditional table even for 64 00:06:05,993 --> 00:06:11,455 this variable thunder, but in this case let us assume that we haven't actually 65 00:06:11,455 --> 00:06:15,516 observed thunder. We could be asking our neighbor over the 66 00:06:15,516 --> 00:06:21,048 phone whether the grass is wet, and then trying to conclude whether or not it's 67 00:06:21,048 --> 00:06:26,440 raining there. But, we didn't ask our neighbor whether they're hearing thunder. 68 00:06:27,020 --> 00:06:34,102 So now, the probability of RT given W equal to yes, because that's all we know, 69 00:06:34,102 --> 00:06:40,894 is by the same equation that we had earlier, probability of W equal to yes 70 00:06:40,894 --> 00:06:49,940 given R, probability of T given R times the probability of R. This is our Baysian 71 00:06:49,940 --> 00:06:55,206 formula. Again, we have the probability of the 72 00:06:55,206 --> 00:07:01,485 evidence over here, but this time we don't have anything that says T equal to yes or 73 00:07:01,485 --> 00:07:04,700 no. So, we need to sum out T in this product 74 00:07:04,700 --> 00:07:09,560 if we want to understand or, or know only the probability of rain. 75 00:07:09,560 --> 00:07:13,250 The way we sum that out is that we, again, do sequel. 76 00:07:13,250 --> 00:07:18,533 So, we select R and the sum of the products of all three of these columns 77 00:07:18,533 --> 00:07:24,178 from these three tables where W is yes, and all the common variables, which are 78 00:07:24,178 --> 00:07:28,231 only R over here. The only common variable between these 79 00:07:28,231 --> 00:07:32,600 tables is R, And then you group by R. 80 00:07:33,160 --> 00:07:41,679 This effectively sums out T because we're only selecting R and summing up the values 81 00:07:41,679 --> 00:07:48,142 for different values of T. Now you can verify that you could do this 82 00:07:48,142 --> 00:07:55,300 by first joining T1 and T3, that is this table and this table that was the PW given 83 00:07:55,300 --> 00:08:02,113 R and P of R, just like we did earlier, and then joining the result of that with 84 00:08:02,113 --> 00:08:07,201 the T given R table. So, we're just going to take the result we 85 00:08:07,201 --> 00:08:10,995 had earlier and join it with this new table. 86 00:08:10,995 --> 00:08:18,240 But, notice that we get the same result because when we multiplied this element by 87 00:08:18,240 --> 00:08:24,102 0.8, for R equal to yes, and again by 0.2 for R equal to no. And then, sum them up, 88 00:08:24,102 --> 00:08:28,002 we'll get the same result because 0.8 plus 0.2 is one. 89 00:08:28,002 --> 00:08:32,059 Similarly, for 0.1 and 0.9, so it doesn't change anything. 90 00:08:32,059 --> 00:08:38,143 This is to be expected since there was no new evidence as compared to earlier. 91 00:08:38,143 --> 00:08:44,383 Just by including something new in our diagram, we couldn't expect to change our 92 00:08:44,383 --> 00:08:49,919 belief in R. Another important point is that if you 93 00:08:49,919 --> 00:08:56,170 remember in one of the homeworks, we asked whether ignoring some of the features 94 00:08:56,170 --> 00:09:00,252 changes our belief. Well, it does, change our belief as 95 00:09:00,252 --> 00:09:04,333 compared to if we had actually observed the features. 96 00:09:04,333 --> 00:09:10,263 But it's in some sense, equally correct, because suppose we didn't observe the 97 00:09:10,263 --> 00:09:13,275 feature. So this is actually the same as a Bayesian 98 00:09:13,275 --> 00:09:17,832 classifier with only two features, but only partially being observed, that only 99 00:09:17,832 --> 00:09:22,097 one of them is being observed. So you could have millions of features and 100 00:09:22,097 --> 00:09:26,888 only observe one or two, and you'll still get the same result just by putting them 101 00:09:26,888 --> 00:09:30,277 in your classifier. The summing out process make sure that 102 00:09:30,277 --> 00:09:33,900 this process always works and you don't get any wrong results. 103 00:09:34,260 --> 00:09:40,383 Go back to that example where we asked whether partial evidence changes things 104 00:09:40,383 --> 00:09:45,577 and, in fact, it doesn't. It is simply as if the feature didn't exist. 105 00:09:45,577 --> 00:09:50,848 Of course, if we are observing the feature, it's better to include it. 106 00:09:50,848 --> 00:09:56,816 But, by ignoring it, we're not saying that we had a wrong result, it's just that we 107 00:09:56,816 --> 00:10:00,460 didn't have that feature to measure, that's all. 108 00:10:01,260 --> 00:10:07,177 Now, let's see what happens if we actually do have evidence about T equal to yes or 109 00:10:07,177 --> 00:10:12,311 no and that suppose we have T equal to yes. So now, we're looking for the 110 00:10:12,311 --> 00:10:18,014 probability of rain given W and T are both yes. And now, in this case, we restrict 111 00:10:18,014 --> 00:10:23,790 our multiplication by the PT given R table to the case where T equal to yes only. 112 00:10:23,790 --> 00:10:29,084 In this case, again, we use the same sequel, only we have another select 113 00:10:29,084 --> 00:10:34,077 statement which restricts us to use only those rows of this table. 114 00:10:34,077 --> 00:10:38,540 Again, joining with the prior join that we had of T1 and T3. 115 00:10:39,380 --> 00:10:45,881 But with the restriction that T = to yes, we'll now get a different result because 116 00:10:45,881 --> 00:10:51,821 now we'll multiply this by 0.8 and the second row by 0.1, and these rows don't 117 00:10:51,821 --> 00:10:57,840 count, so we get different results. And normalizing gives us the probability 118 00:10:57,840 --> 00:11:01,452 of rain equal to yes, given the evidence is now 90%. 119 00:11:01,452 --> 00:11:07,941 In a sense, our belief has undergrown a revision from the earlier value of 53% to 120 00:11:07,941 --> 00:11:11,501 90%.. So, new evidence has changed our belief. 121 00:11:11,501 --> 00:11:18,197 In classical logic, once you asserted that say, rain occurs or doesn't occur, you 122 00:11:18,197 --> 00:11:23,452 can't change our belief. But, in probabilistic reasoning, belief 123 00:11:23,452 --> 00:11:26,419 can be revised. It's very important.