Let us now ask how to compute such
association rules from the data.
There might be many combinations of
features in for example, if there are N
features each taking two values.
Then we immediately have two to the N
combinations.
And even for moderate values of N, you
would have too many combinations to
explore.
The key observation, to getting, an
efficient algorithm for association rule
mining was made by Agrawal and Srikant, in
the 90s.' And that observation was that if
a combination A and B has support greater
than S that means it occurs more than,
say, S times in the data, then so does A
and B individually.
Because otherwise, the combination A and B
would not occur more than S times.
So in some sense, to find combinations
with high support, one can go iteratively
start with.
All single feature combinations that have
high support, that means greater than S.
This set will contain necessarily all
pairs and triplets and quadruplets if any
of them have high support.
So, we do not have to check all
combinations, one can go step by step.
So, this is how the algorithm works.
We first scan all the records and note
down which features have support greater
than s, that means which particular values
of data items.
That means, you might have diapers have a
high support, beer has a high support,
nuts have a high support, but others
don't.
For example.
Then we scan only this subset of data.
Throw away all the other data.
And look for pairs that have high support.
So remember, we have thrown away a lot of
data in the very step.
All those records where none of the high
support features occurred can simply be
discarded.
Because they can never contain pairs or
triplets which have high support.
So now we scan the subset of data for all
the pairs which have high support.
And then, discard some more data items,
then scan the reduced data set, for
triples, etcetera.
Until finally we know, we don't find any
more sets, with a high support.
This considerably reduces the complexity
of finding rules and results in an
efficient algorithm.
Think about the complexity of rule mining.
Once we have, all possible pairs,
singletons, triplets, etcetera, with high
support.
We can check each subset for confidence
and interestingness.
Now, in practice, you would not get that
many rules.
So, I can simply go rule by rule, and
compute the confidences by again, by
counting and interestingness of each rule
and for each possible antecedent and
prece, and consequent from the features in
a rule, and decide which ones to declare
as your, the rules that you learned from
the data.
Note,
This is all just counting.
You are only scanning the data again and
again and counting things. Whether it is
to find those sets with high support or
once you found them to compute the
confidences for different positions of
antecedent consequent etcetera and compute
the interestingness. It is just counting
so map produced like paradigms are ideal
for such tasks.
The association rule mining algorithm we
just described is called the Apriori
Algorithm The Apriori Algorithm is a very
popular algorithm, and it's used in many
settings.
Not just market basket analysis.
But there are some problems with it.
First of all, it depends on why one is
trying to find association roots.
One of the reasons one finds rules or
tries to find rules is to characterize a
class.
So when one wants to figure out what
constitutes a buyer?
Which kinds of words do they use?
What constitutes an elephant?
What are the features an elephant has?
Or what constitutes a positive sentiment?
Which words characterize positivity?
The trouble is, association rules only
work on high support.
So small classes that do not enjoy high
support, they just are not enough
instances of them.
They had left out.
For such situations, one does not normally
use association rules, but instead uses,
what I call, decision trees.
Now we would not go into decision trees in
detail but it is suffice to say that they
are based on mutual information so they
are able to compute the mutual information
between each feature and each set of
features with the class in question, very
similar to the calculations we had done
earlier of mutual information between
different features and the class being
decided.
Of course, computing mutual information is
costly because it requires all the, joint
probabilities to be known.
And is not really possible for large
numbers of features.
The other reason why one might want to get
rules is to figure out some structure in
the world.
And figure out which combinations of
features actually imply others.
And essentially learn some structure to
the normal world without presupposing any
classes.
Association rules are great for such tasks
except that, because they rely on high
support, it means that negative rules,
such as milk and not diapers, implies not
beer, are lost.
So we do not, we have no way of figuring
out not diapers,
We can only look for a value diapers,
because not diapers could be any other
value, like cereal, butter or cola, and
there are just too many possible
combinations to check in not diapers.
In such situations one does something a
little different, and techniques called
Interesting Subgroup Discovery are used
instead.
They are very similar to association
rules, however, instead of relying only on
high support or, or, relying on high
support, they look for correlations in the
data.
One of the earliest papers that talked
about this concept was this one in 1997 by
Sergey Brin and [inaudible].
[inaudible] Sergey Brin, of course, is one
of the founders of Google, this is before
Google.
So learning rules from data is fairly
possible and association rules is, are
simple.
I support situations are easily covered
but figuring out small support situations
or negative rules require more
sophisticated techniques.