We now turn to finding rules from data.
Now rules are essentially correlations
between different features as we've seen
earlier features need not be independent
so finding rules is essentially finding
out which features or which sets of
features are related to each other or
correlated.
In some sense we are trying to cluster
features rather than cluster the data.
For example we might want to discover
rules which say that if you have, like and
lot in a comment than it's very likely to
be positive.
If you have an not and like in a comment
it's very likely to be a negative.
Similarly, we, one might like to find a
rule which says that searching for flowers
means that one is searching for a cheap
gift or in the case of the animal example.
If an animal's a bird, then it chirps or
squeals.
If an animal chirps and has two legs then
it's a bird.
In the case of people buying items, we
might want to figure out interesting
situation that those who buy diapers and
milk also buy beer. In each case what it
is trying to not look at clustering of
objects or data item.
But is looking at clustering the features,
and seeing which features co-occur
together in the data very often.
So, what one is trying to do is find
statistical rules based on the frequency
of co-occurrence of features.
In our unified framework.
What this means is we're trying to find
regions, again, of X that indicate
correlation of features, that is those
regions have more data items than would be
expected if all the features were
independent.
Think about it again.
If features are completely independent,
then you wouldn't have that many
situations of birds chirping or squealing.
Because these features don't necessarily
co-occur together.
But if these features are correlated, that
is, there are actually birds that chirp
and squeal you'll see many such instances.
And those regions will be more populated,
than say objects or animals that chirp and
have four legs.
So this time.
Instead of comparing our data to random
data, we're comparing our data to.
Data which has the same features but,
where the features are independent.
So, P zero now is the distribution
assuming independent features so, it's
just the product of the distributions of
each feature.
Now, these are not uniform distributions,
they happen to be just the independent
distributions of every feature in the data
that we actually observe.
So if we can conduct our thought
experiment again, and set y equal to one
for all the real data, that means the data
that actually exists.
And add y equal to zero, points, that
means extra points, by picking points
where each feature is uniformly chosen not
at random, but from the data.
So the probability of choosing a chirp
depends on the number of times [inaudible]
occurs in the data independently of any
other feature.
Similarly, the probability of choosing
four legs is simply dependent on how many
times four legs occur in the data rather
than any other feature occurring alongside
that.
So instead of comparing the actual data to
random data, one compares it to.
Artificially generated data, where each
feature's chosen independently.
Now if we choose F of X, again it's the
expected value Y given X where Y is one
for the real data and zero for these newly
added points.
This again estimates R upon one + R where
R is P(x) by P0(x).
This time P0(x) is this distribution not
the random distribution but a distribution
that assumes that features are
independent.
Again, the extreme region of f(x) indicate
those.
That have high support and we'll explain
what support means in a minute, and are
therefore, regions where we can
potentially find rules, like the one we've
shown above.
Let's see how, the most popular technique
for finding rules in data is called
associated rule mining and it works on
data which consists of instances which
have features.
So for example, you can have animals that
have features, you could have shopping
transactions where the features are the
items that people buy.
And one wants to infer rules of the form
a, b, and c implies d, where all four are
just features.
We don't really know upfront which ones
are on the left and which ones are on the
right.
But we'll decide based on certain
principles.
Fourth principle is, that this combination
a, b, c, and d has high support.
What this means is, that the probability
of finding a combination ABCD is
reasonably high in the data.
Typically, we choose twenty, 30 or 40%
support.
Those are considered extremely high.
But the point is that this combination
occurs high enough to warrant it being
considered as a potential rule.
Next principle is that, the rule that went
in first out of this combination of four
features, is one where the confidence is
high.
All that means is that.
Of all the cases where you have a, b and
c.
A large number of them actually have'd'.
So now you are considering more instances
than just the ones where there is'a', 'b',
'c' and'd', you are considering those
where they might be only'a', 'b' and'c'
but of those that have only'a', 'b'
and'c', a large number of them also
have'd'.
So the Conditional Probability of'd'
given'a', 'b' and'c' is [inaudible], so
there is a high confidence that'a', 'b'
and'c' result in a'd' actually occurring.
And lastly, this rule should be
interesting, in the sense that.
The confidence that d occurs given a, b
and c is significantly higher than the
probability of d occurring just by itself.
So, for example, if d always occurred in
the data, that means you always had a
value d, that everybody always bought,
say, milk, then any rule that you came up
with, with whatever confidence would not
be interesting because.
The probability of D given A, B, and C
would be the same as the probability of D
and it really wouldn't be very
interesting.
On the other hand, if one found that those
who bought beer also bought diapers.
More, than those people who bought, beer
just like that.
That means the propensity to buy beer is
higher if [inaudible], if the person also
buys diapers.
Now, that's interesting because that tells
us something about how one might want to
place items on the shelves in the store.
This is actually the classical example of
a correlation between beer and diapers
that a large retail chain found way back
in the 80's.
And it sparked all this interest in what
is called, market basket analysis and
Resulted in algorithms for association
rule mining.
But association remaining is more than
just for transactions.
If one thinks about objects in the real
world, one might come to conclusions like
birds chirp or.
Squirrels squeal or lions roar.
Which is quite interesting.
Since these are rules which we consider to
be common sense, and the technique like
association rule mining might actually
allow us to such rules amongst the
features, apart from just knowing that
features are correlated with each other.