Let's recap now our unified formal framework for dealing with classification clustering, rule mining, all under the same umbrella and see what it has to do with big data. And, whether it gives any additional insight as we have alluded to a couple of times in the past few minutes. Recall that we defined this function, f of x, where x is the set of features which can take various values in a large space. As the expected value of the output, which could be zero and one for classification, etc.,, for appropriate data sets. So, we define this, not on the original data but on other types of data. For example, for the classification exam case, we had the output variable to zero, one, and the problem becomes estimating this function or estimating where the values of this function will be less than a half or more than a half to decide the decision boundary. Then, we added on random data, which means we added more data to the data set to deal with clustering. In the case of rule mining, we added independent data not random data, but data where all the features were independent of each other. And then, our problem became that of finding regions where this function f is large on this new data set. Now, suppose we really have big data that means, lots and lots of examples. It's long data, in the sense that there are lots of examples. But, the number of features isn't too large, so it's not wide data. This is typical of real world data outside of the web world where you have words, and even images and videos, where the numbers of features can be huge. But in, in things like transactions or small number of features of which are typically found in traditional data sets. This is a typical situation where the data is very long, and if you start storing lots and lots of transactions, you can get big data into sets. So, the problem A now reduces to just querying the data and figuring out, for a particular combination of x, What's the expected value of y? And if you remember, there was a question in our earlier. Homework or problem set where I asked whether, if we had lots of data we could simply estimate the joint probability directly. And indeed, if you really have enough data that for every possible combination of x, We can compute this expected value with some degree of accuracy because you have enough data. Then, finding out which class an instance belongs to is simply a matter of querying and figuring out for that particular combination how many positive instances have you seen and how many negative, And deciding based on the expected value. The problem B on the other hand, reduces to finding regions of high support. Let's see how that happens. Suppose we have added new data to our data set in this matter, and computed the value of f for every data point. Now, data points which have high values of f, we keep, and those that don't, we discard. Let's, let's look at that, that way. Once we have only points which have high values of f, Now the question is, what regions characterize these high values of f? Are there particular combinations of features which we can say are typical of high regions and these would be our interesting rules or interesting clusters? And again, all we need to do is find high support region, in the sense that which combination of features have high support out of those which have high value of f. Of course, we still have the problem of dealing with negative rules even in this high support technique. So that, that doesn't solve the entire problem. It certainly is a route to solving the problem. Finally, remember that we're, even though, in principle we added random data for clustering and for rule mining, it was start experiment so we don't actually add data. We sort of, can compute the value of f by dividing by the probability of that particular combination occurring, assuming uniform density of data or the particular probability assuming independent data by direct calculation rather than by adding random sets of data. I hope that's clear, that you don't actually have to add data. It's, it's, it's, it's a way of imagining what's really going on but we actually simply count the actual real data, divided by the appropriate quantity. In the case of, say, independent, The P0 being independent assumption, you're actually computing the information gain between features when you do this. And using that value for f,, Filtering out that data based on the value of f, and then finding reason of high support. Note here that we're talking about finding high support regions and not just high support combinations of features as in the association rule mining case. Finding regions of high support is a little bit more difficult and is often referred to as bump hunting. We won't go into this in more detail here, But the unified formulation that we have used is particularly useful in tackling bump hunting problems. The important point is, such techniques where we simply query the data or find high support regions, not maybe in the original data, but in some slightly modified data where we filter it by high f, is just in the end counting. So, techniques like MapReduce or on very large data sets dremel for querying, work by brute force. And so, big data where there's lot of data, Can actually be solved with fairly simple techniques of just counting. And, this is something which we need to understand that big data can change the way one does statistical analysis because counting now seems to work much better than it did if one didn't have enough data. Of course, why data? That means, when the number of features is very large or are the high dimensional data is still a problem as we shall see very soon.