Let's recap now our unified formal
framework for dealing with classification
clustering, rule mining, all under the
same umbrella and see what it has to do
with big data.
And, whether it gives any additional
insight as we have alluded to a couple of
times in the past few minutes.
Recall that we defined this function, f of
x, where x is the set of features which
can take various values in a large space.
As the expected value of the output, which
could be zero and one for classification,
etc.,, for appropriate data sets.
So, we define this, not on the original
data but on other types of data.
For example, for the classification exam
case, we had the output variable to zero,
one, and the problem becomes estimating
this function or estimating where the
values of this function will be less than
a half or more than a half to decide the
decision boundary.
Then, we added on random data, which means
we added more data to the data set to deal
with clustering.
In the case of rule mining, we added
independent data not random data, but data
where all the features were independent of
each other.
And then, our problem became that of
finding regions where this function f is
large on this new data set.
Now, suppose we really have big data that
means, lots and lots of examples.
It's long data, in the sense that there
are lots of examples.
But, the number of features isn't too
large, so it's not wide data.
This is typical of real world data outside
of the web world where you have words, and
even images and videos, where the numbers
of features can be huge.
But in, in things like transactions or
small number of features of which are
typically found in traditional data sets.
This is a typical situation where the data
is very long, and if you start storing
lots and lots of transactions, you can get
big data into sets.
So, the problem A now reduces to just
querying the data and figuring out, for a
particular combination of x,
What's the expected value of y?
And if you remember, there was a question
in our earlier.
Homework or problem set where I asked
whether, if we had lots of data we could
simply estimate the joint probability
directly.
And indeed, if you really have enough data
that for every possible combination of x,
We can compute this expected value with
some degree of accuracy because you have
enough data.
Then, finding out which class an instance
belongs to is simply a matter of querying
and figuring out for that particular
combination how many positive instances
have you seen and how many negative,
And deciding based on the expected value.
The problem B on the other hand, reduces
to finding regions of high support.
Let's see how that happens.
Suppose we have added new data to our data
set in this matter, and computed the value
of f for every data point.
Now, data points which have high values of
f, we keep, and those that don't, we
discard.
Let's, let's look at that, that way.
Once we have only points which have high
values of f,
Now the question is, what regions
characterize these high values of f?
Are there particular combinations of
features which we can say are typical of
high regions and these would be our
interesting rules or interesting clusters?
And again, all we need to do is find high
support region, in the sense that which
combination of features have high support
out of those which have high value of f.
Of course, we still have the problem of
dealing with negative rules even in this
high support technique.
So that, that doesn't solve the entire
problem.
It certainly is a route to solving the
problem.
Finally, remember that we're, even though,
in principle we added random data for
clustering and for rule mining, it was
start experiment so we don't actually add
data.
We sort of, can compute the value of f by
dividing by the probability of that
particular combination occurring, assuming
uniform density of data or the particular
probability assuming independent data by
direct calculation rather than by adding
random sets of data.
I hope that's clear, that you don't
actually have to add data.
It's, it's, it's, it's a way of imagining
what's really going on but we actually
simply count the actual real data, divided
by the appropriate quantity.
In the case of, say, independent,
The P0 being independent assumption,
you're actually computing the information
gain between features when you do this.
And using that value for f,,
Filtering out that data based on the value
of f, and then finding reason of high
support.
Note here that we're talking about finding
high support regions and not just high
support combinations of features as in the
association rule mining case.
Finding regions of high support is a
little bit more difficult and is often
referred to as bump hunting.
We won't go into this in more detail here,
But the unified formulation that we have
used is particularly useful in tackling
bump hunting problems.
The important point is, such techniques
where we simply query the data or find
high support regions, not maybe in the
original data, but in some slightly
modified data where we filter it by high
f, is just in the end counting.
So, techniques like MapReduce or on very
large data sets dremel for querying, work
by brute force.
And so, big data where there's lot of
data,
Can actually be solved with fairly simple
techniques of just counting.
And, this is something which we need to
understand that big data can change the
way one does statistical analysis because
counting now seems to work much better
than it did if one didn't have enough
data.
Of course, why data?
That means, when the number of features is
very large or are the high dimensional
data is still a problem as we shall see
very soon.