So in order to compute these conditional probabilities when there are a large number of possible words or features, we need to do a little more work. Let's look at the simple case of just one word, say red. There are two possibilities. The red could be present or absent. So, for our cases red is present. Out of these I have a Bi and the rest don't. The total number of cases which are Y situations are obviously K as before. So lets see what the simple condition probabilities are in this case. The probability of a Bi given that there is an R is just I over R. On the other hand, the probability of an R over all is just R over the total number N. And the probability that both R occurs and there is a Bi is I over N. Now this is the joint probability, so you'll have to divide by the total number of instances N rather than any of the conditional ones K or R. Baye's rule is actually just simple arithmetic. In particular if you write I over N as I over R times R over N, lets see what you get. Well we already know what I over R is its just the conditional probability of a Bi given R. And R over N is the probability of R itself, so simple arithmetic tells us that the joint probability of R and B is the product of the conditional probability and the a prior I probability. This is just Bayes rule. The probability of B and R is the conditional probability of B given R times P of R which is also the same thing as the probability of R given B times the probability of B. We can see that by just rewriting I over N, not by introducing an R, but by introducing a K. In which case you'll get I over K, which is just the probability of R out of all those that are B and the probability of B which is just K over N, right? Bayes' Rule which some of you may or may not remember turns out to be just simple arithmetic. As we shall soon see, Bayes' Rule is critical to machine learning because it allows us to compute any of those many, many joint probabilities even if there is no data for a particular combination. Before we can do machine learning using Bayesian techniques we need one more important concept that is independence. Think about two words like red and cheap. As before, we have R equal to yes for R queries, cheap equal to yes for C cases and in I cases both key words are present. The probability of an R occurring is clearly R over N. And the probability of cheap occurring is C over N, as before. And similarly, the conditional probabilities of an R given that C occurs can be computed. Independence says that the probability of R does not depend on whether or not the word cheap is already present in the query. In other words the probability of R should be the same as the probability of R given C. Similarly the probability that cheap occurs should not depend on whether or not red occurs. In such situations these two features are independent. Of course, that might not necessarily always be the case. For example, somebody searching for big data might actually search for map reduce at the same time rather than something else, like red or flower.