1 00:00:00,000 --> 00:00:06,660 Let us now ask how to compute such association rules from the data. 2 00:00:06,980 --> 00:00:12,636 There might be many combinations of features in for example, if there are N 3 00:00:12,636 --> 00:00:18,136 features each taking two values. Then we immediately have two to the N 4 00:00:18,136 --> 00:00:22,143 combinations. And even for moderate values of N, you 5 00:00:22,143 --> 00:00:25,600 would have too many combinations to explore. 6 00:00:26,720 --> 00:00:33,305 The key observation, to getting, an efficient algorithm for association rule 7 00:00:33,305 --> 00:00:40,654 mining was made by Agrawal and Srikant, in the 90s.' And that observation was that if 8 00:00:40,654 --> 00:00:49,318 a combination A and B has support greater than S that means it occurs more than, 9 00:00:49,318 --> 00:00:55,599 say, S times in the data, then so does A and B individually. 10 00:00:55,599 --> 00:01:04,420 Because otherwise, the combination A and B would not occur more than S times. 11 00:01:05,860 --> 00:01:12,302 So in some sense, to find combinations with high support, one can go iteratively 12 00:01:12,302 --> 00:01:18,414 start with. All single feature combinations that have 13 00:01:18,414 --> 00:01:25,453 high support, that means greater than S. This set will contain necessarily all 14 00:01:25,453 --> 00:01:31,526 pairs and triplets and quadruplets if any of them have high support. 15 00:01:31,526 --> 00:01:37,509 So, we do not have to check all combinations, one can go step by step. 16 00:01:37,509 --> 00:01:44,117 So, this is how the algorithm works. We first scan all the records and note 17 00:01:44,117 --> 00:01:51,529 down which features have support greater than s, that means which particular values 18 00:01:51,529 --> 00:01:56,258 of data items. That means, you might have diapers have a 19 00:01:56,258 --> 00:02:01,490 high support, beer has a high support, nuts have a high support, but others 20 00:02:01,490 --> 00:02:02,780 don't. For example. 21 00:02:04,360 --> 00:02:10,720 Then we scan only this subset of data. Throw away all the other data. 22 00:02:13,480 --> 00:02:19,874 And look for pairs that have high support. So remember, we have thrown away a lot of 23 00:02:19,874 --> 00:02:24,728 data in the very step. All those records where none of the high 24 00:02:24,728 --> 00:02:28,580 support features occurred can simply be discarded. 25 00:02:28,580 --> 00:02:34,204 Because they can never contain pairs or triplets which have high support. 26 00:02:34,204 --> 00:02:40,060 So now we scan the subset of data for all the pairs which have high support. 27 00:02:40,800 --> 00:02:46,322 And then, discard some more data items, then scan the reduced data set, for 28 00:02:46,322 --> 00:02:50,710 triples, etcetera. Until finally we know, we don't find any 29 00:02:50,710 --> 00:02:59,215 more sets, with a high support. This considerably reduces the complexity 30 00:02:59,215 --> 00:03:04,340 of finding rules and results in an efficient algorithm. 31 00:03:05,700 --> 00:03:17,860 Think about the complexity of rule mining. Once we have, all possible pairs, 32 00:03:17,860 --> 00:03:23,060 singletons, triplets, etcetera, with high support. 33 00:03:23,620 --> 00:03:28,937 We can check each subset for confidence and interestingness. 34 00:03:28,937 --> 00:03:33,013 Now, in practice, you would not get that many rules. 35 00:03:33,013 --> 00:03:39,393 So, I can simply go rule by rule, and compute the confidences by again, by 36 00:03:39,393 --> 00:03:47,006 counting and interestingness of each rule and for each possible antecedent and 37 00:03:47,006 --> 00:03:55,541 prece, and consequent from the features in a rule, and decide which ones to declare 38 00:03:55,541 --> 00:03:59,460 as your, the rules that you learned from the data. 39 00:04:01,600 --> 00:04:06,570 Note, This is all just counting. 40 00:04:06,570 --> 00:04:13,286 You are only scanning the data again and again and counting things. Whether it is 41 00:04:13,286 --> 00:04:19,662 to find those sets with high support or once you found them to compute the 42 00:04:19,662 --> 00:04:26,208 confidences for different positions of antecedent consequent etcetera and compute 43 00:04:26,208 --> 00:04:32,924 the interestingness. It is just counting so map produced like paradigms are ideal 44 00:04:32,924 --> 00:04:38,559 for such tasks. The association rule mining algorithm we 45 00:04:38,559 --> 00:04:44,999 just described is called the Apriori Algorithm The Apriori Algorithm is a very 46 00:04:44,999 --> 00:04:48,895 popular algorithm, and it's used in many settings. 47 00:04:48,895 --> 00:04:54,310 Not just market basket analysis. But there are some problems with it. 48 00:04:54,310 --> 00:04:59,420 First of all, it depends on why one is trying to find association roots. 49 00:04:59,760 --> 00:05:06,087 One of the reasons one finds rules or tries to find rules is to characterize a 50 00:05:06,087 --> 00:05:09,131 class. So when one wants to figure out what 51 00:05:09,131 --> 00:05:13,376 constitutes a buyer? Which kinds of words do they use? 52 00:05:13,376 --> 00:05:18,742 What constitutes an elephant? What are the features an elephant has? 53 00:05:18,742 --> 00:05:24,910 Or what constitutes a positive sentiment? Which words characterize positivity? 54 00:05:24,910 --> 00:05:30,545 The trouble is, association rules only work on high support. 55 00:05:30,545 --> 00:05:37,421 So small classes that do not enjoy high support, they just are not enough 56 00:05:37,421 --> 00:05:40,860 instances of them. They had left out. 57 00:05:41,780 --> 00:05:47,806 For such situations, one does not normally use association rules, but instead uses, 58 00:05:47,806 --> 00:05:53,045 what I call, decision trees. Now we would not go into decision trees in 59 00:05:53,045 --> 00:05:59,539 detail but it is suffice to say that they are based on mutual information so they 60 00:05:59,539 --> 00:06:06,115 are able to compute the mutual information between each feature and each set of 61 00:06:06,115 --> 00:06:12,773 features with the class in question, very similar to the calculations we had done 62 00:06:12,773 --> 00:06:19,102 earlier of mutual information between different features and the class being 63 00:06:19,102 --> 00:06:24,019 decided. Of course, computing mutual information is 64 00:06:24,019 --> 00:06:29,984 costly because it requires all the, joint probabilities to be known. 65 00:06:29,984 --> 00:06:35,060 And is not really possible for large numbers of features. 66 00:06:37,040 --> 00:06:44,860 The other reason why one might want to get rules is to figure out some structure in 67 00:06:44,860 --> 00:06:49,235 the world. And figure out which combinations of 68 00:06:49,235 --> 00:06:55,845 features actually imply others. And essentially learn some structure to 69 00:06:55,845 --> 00:07:00,500 the normal world without presupposing any classes. 70 00:07:03,640 --> 00:07:11,121 Association rules are great for such tasks except that, because they rely on high 71 00:07:11,121 --> 00:07:18,968 support, it means that negative rules, such as milk and not diapers, implies not 72 00:07:18,968 --> 00:07:23,972 beer, are lost. So we do not, we have no way of figuring 73 00:07:23,972 --> 00:07:28,976 out not diapers, We can only look for a value diapers, 74 00:07:28,976 --> 00:07:36,151 because not diapers could be any other value, like cereal, butter or cola, and 75 00:07:36,151 --> 00:07:42,760 there are just too many possible combinations to check in not diapers. 76 00:07:44,160 --> 00:07:50,133 In such situations one does something a little different, and techniques called 77 00:07:50,133 --> 00:07:53,762 Interesting Subgroup Discovery are used instead. 78 00:07:53,762 --> 00:07:59,660 They are very similar to association rules, however, instead of relying only on 79 00:07:59,660 --> 00:08:05,256 high support or, or, relying on high support, they look for correlations in the 80 00:08:05,256 --> 00:08:08,583 data. One of the earliest papers that talked 81 00:08:08,583 --> 00:08:16,913 about this concept was this one in 1997 by Sergey Brin and [inaudible]. 82 00:08:16,913 --> 00:08:23,468 [inaudible] Sergey Brin, of course, is one of the founders of Google, this is before 83 00:08:23,468 --> 00:08:27,682 Google. So learning rules from data is fairly 84 00:08:27,682 --> 00:08:31,896 possible and association rules is, are simple. 85 00:08:31,896 --> 00:08:39,575 I support situations are easily covered but figuring out small support situations 86 00:08:39,575 --> 00:08:44,820 or negative rules require more sophisticated techniques.