1 00:00:00,000 --> 00:00:07,704 We now turn to finding rules from data. Now rules are essentially correlations 2 00:00:07,704 --> 00:00:15,803 between different features as we've seen earlier features need not be independent 3 00:00:15,803 --> 00:00:21,625 so finding rules is essentially finding out which features or which sets of 4 00:00:21,625 --> 00:00:25,333 features are related to each other or correlated. 5 00:00:25,333 --> 00:00:31,161 In some sense we are trying to cluster features rather than cluster the data. 6 00:00:31,161 --> 00:00:37,216 For example we might want to discover rules which say that if you have, like and 7 00:00:37,216 --> 00:00:41,303 lot in a comment than it's very likely to be positive. 8 00:00:41,303 --> 00:00:46,980 If you have an not and like in a comment it's very likely to be a negative. 9 00:00:47,660 --> 00:00:53,522 Similarly, we, one might like to find a rule which says that searching for flowers 10 00:00:53,522 --> 00:01:01,180 means that one is searching for a cheap gift or in the case of the animal example. 11 00:01:01,720 --> 00:01:06,200 If an animal's a bird, then it chirps or squeals. 12 00:01:06,200 --> 00:01:11,240 If an animal chirps and has two legs then it's a bird. 13 00:01:12,940 --> 00:01:19,820 In the case of people buying items, we might want to figure out interesting 14 00:01:19,820 --> 00:01:28,061 situation that those who buy diapers and milk also buy beer. In each case what it 15 00:01:28,061 --> 00:01:33,506 is trying to not look at clustering of objects or data item. 16 00:01:33,506 --> 00:01:40,495 But is looking at clustering the features, and seeing which features co-occur 17 00:01:40,495 --> 00:01:49,590 together in the data very often. So, what one is trying to do is find 18 00:01:49,590 --> 00:01:56,340 statistical rules based on the frequency of co-occurrence of features. 19 00:01:58,120 --> 00:02:05,880 In our unified framework. What this means is we're trying to find 20 00:02:05,880 --> 00:02:13,725 regions, again, of X that indicate correlation of features, that is those 21 00:02:13,725 --> 00:02:21,525 regions have more data items than would be expected if all the features were 22 00:02:21,525 --> 00:02:24,685 independent. Think about it again. 23 00:02:24,685 --> 00:02:31,578 If features are completely independent, then you wouldn't have that many 24 00:02:31,578 --> 00:02:39,137 situations of birds chirping or squealing. Because these features don't necessarily 25 00:02:39,137 --> 00:02:43,378 co-occur together. But if these features are correlated, that 26 00:02:43,378 --> 00:02:49,103 is, there are actually birds that chirp and squeal you'll see many such instances. 27 00:02:49,103 --> 00:02:55,039 And those regions will be more populated, than say objects or animals that chirp and 28 00:02:55,039 --> 00:03:00,640 have four legs. So this time. 29 00:03:01,980 --> 00:03:12,500 Instead of comparing our data to random data, we're comparing our data to. 30 00:03:13,140 --> 00:03:18,954 Data which has the same features but, where the features are independent. 31 00:03:18,954 --> 00:03:24,768 So, P zero now is the distribution assuming independent features so, it's 32 00:03:24,768 --> 00:03:29,129 just the product of the distributions of each feature. 33 00:03:29,129 --> 00:03:35,508 Now, these are not uniform distributions, they happen to be just the independent 34 00:03:35,508 --> 00:03:41,000 distributions of every feature in the data that we actually observe. 35 00:03:43,400 --> 00:03:49,154 So if we can conduct our thought experiment again, and set y equal to one 36 00:03:49,154 --> 00:03:54,120 for all the real data, that means the data that actually exists. 37 00:03:54,560 --> 00:04:01,205 And add y equal to zero, points, that means extra points, by picking points 38 00:04:01,205 --> 00:04:07,669 where each feature is uniformly chosen not at random, but from the data. 39 00:04:07,669 --> 00:04:15,273 So the probability of choosing a chirp depends on the number of times [inaudible] 40 00:04:15,273 --> 00:04:19,585 occurs in the data independently of any other feature. 41 00:04:19,585 --> 00:04:25,973 Similarly, the probability of choosing four legs is simply dependent on how many 42 00:04:25,973 --> 00:04:32,680 times four legs occur in the data rather than any other feature occurring alongside 43 00:04:32,680 --> 00:04:38,414 that. So instead of comparing the actual data to 44 00:04:38,414 --> 00:04:45,318 random data, one compares it to. Artificially generated data, where each 45 00:04:45,318 --> 00:04:53,276 feature's chosen independently. Now if we choose F of X, again it's the 46 00:04:53,276 --> 00:05:01,354 expected value Y given X where Y is one for the real data and zero for these newly 47 00:05:01,354 --> 00:05:06,605 added points. This again estimates R upon one + R where 48 00:05:06,605 --> 00:05:11,189 R is P(x) by P0(x). This time P0(x) is this distribution not 49 00:05:11,189 --> 00:05:17,841 the random distribution but a distribution that assumes that features are 50 00:05:17,841 --> 00:05:25,536 independent. Again, the extreme region of f(x) indicate 51 00:05:25,536 --> 00:05:30,054 those. That have high support and we'll explain 52 00:05:30,054 --> 00:05:35,883 what support means in a minute, and are therefore, regions where we can 53 00:05:35,883 --> 00:05:40,380 potentially find rules, like the one we've shown above. 54 00:05:41,740 --> 00:05:48,029 Let's see how, the most popular technique for finding rules in data is called 55 00:05:48,029 --> 00:05:54,400 associated rule mining and it works on data which consists of instances which 56 00:05:54,400 --> 00:05:58,892 have features. So for example, you can have animals that 57 00:05:58,892 --> 00:06:05,263 have features, you could have shopping transactions where the features are the 58 00:06:05,263 --> 00:06:10,836 items that people buy. And one wants to infer rules of the form 59 00:06:10,836 --> 00:06:15,718 a, b, and c implies d, where all four are just features. 60 00:06:15,718 --> 00:06:23,272 We don't really know upfront which ones are on the left and which ones are on the 61 00:06:23,272 --> 00:06:26,956 right. But we'll decide based on certain 62 00:06:26,956 --> 00:06:31,838 principles. Fourth principle is, that this combination 63 00:06:31,838 --> 00:06:38,672 a, b, c, and d has high support. What this means is, that the probability 64 00:06:38,672 --> 00:06:45,020 of finding a combination ABCD is reasonably high in the data. 65 00:06:45,020 --> 00:06:49,600 Typically, we choose twenty, 30 or 40% support. 66 00:06:50,040 --> 00:06:56,786 Those are considered extremely high. But the point is that this combination 67 00:06:56,786 --> 00:07:01,860 occurs high enough to warrant it being considered as a potential rule. 68 00:07:03,800 --> 00:07:10,319 Next principle is that, the rule that went in first out of this combination of four 69 00:07:10,319 --> 00:07:13,853 features, is one where the confidence is high. 70 00:07:13,853 --> 00:07:20,368 All that means is that. Of all the cases where you have a, b and 71 00:07:20,368 --> 00:07:23,585 c. A large number of them actually have'd'. 72 00:07:23,585 --> 00:07:29,765 So now you are considering more instances than just the ones where there is'a', 'b', 73 00:07:29,765 --> 00:07:35,343 'c' and'd', you are considering those where they might be only'a', 'b' and'c' 74 00:07:35,343 --> 00:07:40,544 but of those that have only'a', 'b' and'c', a large number of them also 75 00:07:40,544 --> 00:07:43,936 have'd'. So the Conditional Probability of'd' 76 00:07:43,936 --> 00:07:49,589 given'a', 'b' and'c' is [inaudible], so there is a high confidence that'a', 'b' 77 00:07:49,589 --> 00:07:55,945 and'c' result in a'd' actually occurring. And lastly, this rule should be 78 00:07:55,945 --> 00:08:02,595 interesting, in the sense that. The confidence that d occurs given a, b 79 00:08:02,595 --> 00:08:09,938 and c is significantly higher than the probability of d occurring just by itself. 80 00:08:09,938 --> 00:08:16,828 So, for example, if d always occurred in the data, that means you always had a 81 00:08:16,828 --> 00:08:23,808 value d, that everybody always bought, say, milk, then any rule that you came up 82 00:08:23,808 --> 00:08:29,520 with, with whatever confidence would not be interesting because. 83 00:08:30,040 --> 00:08:35,483 The probability of D given A, B, and C would be the same as the probability of D 84 00:08:35,483 --> 00:08:38,445 and it really wouldn't be very interesting. 85 00:08:38,445 --> 00:08:43,820 On the other hand, if one found that those who bought beer also bought diapers. 86 00:08:44,080 --> 00:08:51,300 More, than those people who bought, beer just like that. 87 00:08:52,880 --> 00:08:58,755 That means the propensity to buy beer is higher if [inaudible], if the person also 88 00:08:58,755 --> 00:09:02,696 buys diapers. Now, that's interesting because that tells 89 00:09:02,696 --> 00:09:08,500 us something about how one might want to place items on the shelves in the store. 90 00:09:08,500 --> 00:09:14,303 This is actually the classical example of a correlation between beer and diapers 91 00:09:14,303 --> 00:09:18,531 that a large retail chain found way back in the 80's. 92 00:09:18,531 --> 00:09:24,120 And it sparked all this interest in what is called, market basket analysis and 93 00:09:24,500 --> 00:09:29,720 Resulted in algorithms for association rule mining. 94 00:09:31,540 --> 00:09:36,925 But association remaining is more than just for transactions. 95 00:09:36,925 --> 00:09:44,075 If one thinks about objects in the real world, one might come to conclusions like 96 00:09:44,075 --> 00:09:48,735 birds chirp or. Squirrels squeal or lions roar. 97 00:09:48,735 --> 00:09:55,446 Which is quite interesting. Since these are rules which we consider to 98 00:09:55,446 --> 00:10:01,300 be common sense, and the technique like association rule mining might actually 99 00:10:01,300 --> 00:10:06,778 allow us to such rules amongst the features, apart from just knowing that 100 00:10:06,778 --> 00:10:09,780 features are correlated with each other.