Let us now ask how to compute such association rules from the data. There might be many combinations of features in for example, if there are N features each taking two values. Then we immediately have two to the N combinations. And even for moderate values of N, you would have too many combinations to explore. The key observation, to getting, an efficient algorithm for association rule mining was made by Agrawal and Srikant, in the 90s.' And that observation was that if a combination A and B has support greater than S that means it occurs more than, say, S times in the data, then so does A and B individually. Because otherwise, the combination A and B would not occur more than S times. So in some sense, to find combinations with high support, one can go iteratively start with. All single feature combinations that have high support, that means greater than S. This set will contain necessarily all pairs and triplets and quadruplets if any of them have high support. So, we do not have to check all combinations, one can go step by step. So, this is how the algorithm works. We first scan all the records and note down which features have support greater than s, that means which particular values of data items. That means, you might have diapers have a high support, beer has a high support, nuts have a high support, but others don't. For example. Then we scan only this subset of data. Throw away all the other data. And look for pairs that have high support. So remember, we have thrown away a lot of data in the very step. All those records where none of the high support features occurred can simply be discarded. Because they can never contain pairs or triplets which have high support. So now we scan the subset of data for all the pairs which have high support. And then, discard some more data items, then scan the reduced data set, for triples, etcetera. Until finally we know, we don't find any more sets, with a high support. This considerably reduces the complexity of finding rules and results in an efficient algorithm. Think about the complexity of rule mining. Once we have, all possible pairs, singletons, triplets, etcetera, with high support. We can check each subset for confidence and interestingness. Now, in practice, you would not get that many rules. So, I can simply go rule by rule, and compute the confidences by again, by counting and interestingness of each rule and for each possible antecedent and prece, and consequent from the features in a rule, and decide which ones to declare as your, the rules that you learned from the data. Note, This is all just counting. You are only scanning the data again and again and counting things. Whether it is to find those sets with high support or once you found them to compute the confidences for different positions of antecedent consequent etcetera and compute the interestingness. It is just counting so map produced like paradigms are ideal for such tasks. The association rule mining algorithm we just described is called the Apriori Algorithm The Apriori Algorithm is a very popular algorithm, and it's used in many settings. Not just market basket analysis. But there are some problems with it. First of all, it depends on why one is trying to find association roots. One of the reasons one finds rules or tries to find rules is to characterize a class. So when one wants to figure out what constitutes a buyer? Which kinds of words do they use? What constitutes an elephant? What are the features an elephant has? Or what constitutes a positive sentiment? Which words characterize positivity? The trouble is, association rules only work on high support. So small classes that do not enjoy high support, they just are not enough instances of them. They had left out. For such situations, one does not normally use association rules, but instead uses, what I call, decision trees. Now we would not go into decision trees in detail but it is suffice to say that they are based on mutual information so they are able to compute the mutual information between each feature and each set of features with the class in question, very similar to the calculations we had done earlier of mutual information between different features and the class being decided. Of course, computing mutual information is costly because it requires all the, joint probabilities to be known. And is not really possible for large numbers of features. The other reason why one might want to get rules is to figure out some structure in the world. And figure out which combinations of features actually imply others. And essentially learn some structure to the normal world without presupposing any classes. Association rules are great for such tasks except that, because they rely on high support, it means that negative rules, such as milk and not diapers, implies not beer, are lost. So we do not, we have no way of figuring out not diapers, We can only look for a value diapers, because not diapers could be any other value, like cereal, butter or cola, and there are just too many possible combinations to check in not diapers. In such situations one does something a little different, and techniques called Interesting Subgroup Discovery are used instead. They are very similar to association rules, however, instead of relying only on high support or, or, relying on high support, they look for correlations in the data. One of the earliest papers that talked about this concept was this one in 1997 by Sergey Brin and [inaudible]. [inaudible] Sergey Brin, of course, is one of the founders of Google, this is before Google. So learning rules from data is fairly possible and association rules is, are simple. I support situations are easily covered but figuring out small support situations or negative rules require more sophisticated techniques.