We now turn to finding rules from data. Now rules are essentially correlations between different features as we've seen earlier features need not be independent so finding rules is essentially finding out which features or which sets of features are related to each other or correlated. In some sense we are trying to cluster features rather than cluster the data. For example we might want to discover rules which say that if you have, like and lot in a comment than it's very likely to be positive. If you have an not and like in a comment it's very likely to be a negative. Similarly, we, one might like to find a rule which says that searching for flowers means that one is searching for a cheap gift or in the case of the animal example. If an animal's a bird, then it chirps or squeals. If an animal chirps and has two legs then it's a bird. In the case of people buying items, we might want to figure out interesting situation that those who buy diapers and milk also buy beer. In each case what it is trying to not look at clustering of objects or data item. But is looking at clustering the features, and seeing which features co-occur together in the data very often. So, what one is trying to do is find statistical rules based on the frequency of co-occurrence of features. In our unified framework. What this means is we're trying to find regions, again, of X that indicate correlation of features, that is those regions have more data items than would be expected if all the features were independent. Think about it again. If features are completely independent, then you wouldn't have that many situations of birds chirping or squealing. Because these features don't necessarily co-occur together. But if these features are correlated, that is, there are actually birds that chirp and squeal you'll see many such instances. And those regions will be more populated, than say objects or animals that chirp and have four legs. So this time. Instead of comparing our data to random data, we're comparing our data to. Data which has the same features but, where the features are independent. So, P zero now is the distribution assuming independent features so, it's just the product of the distributions of each feature. Now, these are not uniform distributions, they happen to be just the independent distributions of every feature in the data that we actually observe. So if we can conduct our thought experiment again, and set y equal to one for all the real data, that means the data that actually exists. And add y equal to zero, points, that means extra points, by picking points where each feature is uniformly chosen not at random, but from the data. So the probability of choosing a chirp depends on the number of times [inaudible] occurs in the data independently of any other feature. Similarly, the probability of choosing four legs is simply dependent on how many times four legs occur in the data rather than any other feature occurring alongside that. So instead of comparing the actual data to random data, one compares it to. Artificially generated data, where each feature's chosen independently. Now if we choose F of X, again it's the expected value Y given X where Y is one for the real data and zero for these newly added points. This again estimates R upon one + R where R is P(x) by P0(x). This time P0(x) is this distribution not the random distribution but a distribution that assumes that features are independent. Again, the extreme region of f(x) indicate those. That have high support and we'll explain what support means in a minute, and are therefore, regions where we can potentially find rules, like the one we've shown above. Let's see how, the most popular technique for finding rules in data is called associated rule mining and it works on data which consists of instances which have features. So for example, you can have animals that have features, you could have shopping transactions where the features are the items that people buy. And one wants to infer rules of the form a, b, and c implies d, where all four are just features. We don't really know upfront which ones are on the left and which ones are on the right. But we'll decide based on certain principles. Fourth principle is, that this combination a, b, c, and d has high support. What this means is, that the probability of finding a combination ABCD is reasonably high in the data. Typically, we choose twenty, 30 or 40% support. Those are considered extremely high. But the point is that this combination occurs high enough to warrant it being considered as a potential rule. Next principle is that, the rule that went in first out of this combination of four features, is one where the confidence is high. All that means is that. Of all the cases where you have a, b and c. A large number of them actually have'd'. So now you are considering more instances than just the ones where there is'a', 'b', 'c' and'd', you are considering those where they might be only'a', 'b' and'c' but of those that have only'a', 'b' and'c', a large number of them also have'd'. So the Conditional Probability of'd' given'a', 'b' and'c' is [inaudible], so there is a high confidence that'a', 'b' and'c' result in a'd' actually occurring. And lastly, this rule should be interesting, in the sense that. The confidence that d occurs given a, b and c is significantly higher than the probability of d occurring just by itself. So, for example, if d always occurred in the data, that means you always had a value d, that everybody always bought, say, milk, then any rule that you came up with, with whatever confidence would not be interesting because. The probability of D given A, B, and C would be the same as the probability of D and it really wouldn't be very interesting. On the other hand, if one found that those who bought beer also bought diapers. More, than those people who bought, beer just like that. That means the propensity to buy beer is higher if [inaudible], if the person also buys diapers. Now, that's interesting because that tells us something about how one might want to place items on the shelves in the store. This is actually the classical example of a correlation between beer and diapers that a large retail chain found way back in the 80's. And it sparked all this interest in what is called, market basket analysis and Resulted in algorithms for association rule mining. But association remaining is more than just for transactions. If one thinks about objects in the real world, one might come to conclusions like birds chirp or. Squirrels squeal or lions roar. Which is quite interesting. Since these are rules which we consider to be common sense, and the technique like association rule mining might actually allow us to such rules amongst the features, apart from just knowing that features are correlated with each other.