1 00:00:00,000 --> 00:00:06,566 Let's recap now our unified formal framework for dealing with classification 2 00:00:06,566 --> 00:00:13,219 clustering, rule mining, all under the same umbrella and see what it has to do 3 00:00:13,219 --> 00:00:17,539 with big data. And, whether it gives any additional 4 00:00:17,539 --> 00:00:23,760 insight as we have alluded to a couple of times in the past few minutes. 5 00:00:24,800 --> 00:00:31,200 Recall that we defined this function, f of x, where x is the set of features which 6 00:00:31,200 --> 00:00:38,547 can take various values in a large space. As the expected value of the output, which 7 00:00:38,547 --> 00:00:44,574 could be zero and one for classification, etc.,, for appropriate data sets. 8 00:00:44,574 --> 00:00:50,390 So, we define this, not on the original data but on other types of data. 9 00:00:50,390 --> 00:00:56,880 For example, for the classification exam case, we had the output variable to zero, 10 00:00:56,880 --> 00:01:03,126 one, and the problem becomes estimating this function or estimating where the 11 00:01:03,126 --> 00:01:09,859 values of this function will be less than a half or more than a half to decide the 12 00:01:09,859 --> 00:01:16,496 decision boundary. Then, we added on random data, which means 13 00:01:16,496 --> 00:01:20,780 we added more data to the data set to deal with clustering. 14 00:01:21,940 --> 00:01:27,913 In the case of rule mining, we added independent data not random data, but data 15 00:01:27,913 --> 00:01:31,996 where all the features were independent of each other. 16 00:01:31,996 --> 00:01:37,818 And then, our problem became that of finding regions where this function f is 17 00:01:37,818 --> 00:01:46,052 large on this new data set. Now, suppose we really have big data that 18 00:01:46,052 --> 00:01:52,525 means, lots and lots of examples. It's long data, in the sense that there 19 00:01:52,525 --> 00:01:56,757 are lots of examples. But, the number of features isn't too 20 00:01:56,757 --> 00:02:03,478 large, so it's not wide data. This is typical of real world data outside 21 00:02:03,478 --> 00:02:11,196 of the web world where you have words, and even images and videos, where the numbers 22 00:02:11,196 --> 00:02:16,961 of features can be huge. But in, in things like transactions or 23 00:02:17,720 --> 00:02:24,976 small number of features of which are typically found in traditional data sets. 24 00:02:24,976 --> 00:02:31,892 This is a typical situation where the data is very long, and if you start storing 25 00:02:31,892 --> 00:02:37,100 lots and lots of transactions, you can get big data into sets. 26 00:02:38,080 --> 00:02:44,966 So, the problem A now reduces to just querying the data and figuring out, for a 27 00:02:44,966 --> 00:02:50,175 particular combination of x, What's the expected value of y? 28 00:02:50,175 --> 00:02:55,120 And if you remember, there was a question in our earlier. 29 00:02:55,800 --> 00:03:04,555 Homework or problem set where I asked whether, if we had lots of data we could 30 00:03:04,555 --> 00:03:07,575 simply estimate the joint probability directly. 31 00:03:07,575 --> 00:03:13,100 And indeed, if you really have enough data that for every possible combination of x, 32 00:03:13,620 --> 00:03:18,954 We can compute this expected value with some degree of accuracy because you have 33 00:03:18,954 --> 00:03:22,445 enough data. Then, finding out which class an instance 34 00:03:22,445 --> 00:03:27,714 belongs to is simply a matter of querying and figuring out for that particular 35 00:03:27,714 --> 00:03:32,719 combination how many positive instances have you seen and how many negative, 36 00:03:32,719 --> 00:03:42,198 And deciding based on the expected value. The problem B on the other hand, reduces 37 00:03:42,198 --> 00:03:47,620 to finding regions of high support. Let's see how that happens. 38 00:03:48,000 --> 00:03:55,351 Suppose we have added new data to our data set in this matter, and computed the value 39 00:03:55,351 --> 00:04:03,373 of f for every data point. Now, data points which have high values of 40 00:04:03,373 --> 00:04:07,886 f, we keep, and those that don't, we discard. 41 00:04:07,886 --> 00:04:18,528 Let's, let's look at that, that way. Once we have only points which have high 42 00:04:18,528 --> 00:04:21,916 values of f, Now the question is, what regions 43 00:04:21,916 --> 00:04:28,188 characterize these high values of f? Are there particular combinations of 44 00:04:28,188 --> 00:04:35,455 features which we can say are typical of high regions and these would be our 45 00:04:35,455 --> 00:04:42,411 interesting rules or interesting clusters? And again, all we need to do is find high 46 00:04:42,411 --> 00:04:47,531 support region, in the sense that which combination of features have high support 47 00:04:47,531 --> 00:04:53,897 out of those which have high value of f. Of course, we still have the problem of 48 00:04:53,897 --> 00:04:58,093 dealing with negative rules even in this high support technique. 49 00:04:58,093 --> 00:05:00,978 So that, that doesn't solve the entire problem. 50 00:05:00,978 --> 00:05:04,060 It certainly is a route to solving the problem. 51 00:05:05,160 --> 00:05:12,496 Finally, remember that we're, even though, in principle we added random data for 52 00:05:12,496 --> 00:05:18,677 clustering and for rule mining, it was start experiment so we don't actually add 53 00:05:18,677 --> 00:05:22,276 data. We sort of, can compute the value of f by 54 00:05:22,276 --> 00:05:28,613 dividing by the probability of that particular combination occurring, assuming 55 00:05:28,613 --> 00:05:35,107 uniform density of data or the particular probability assuming independent data by 56 00:05:35,107 --> 00:05:39,880 direct calculation rather than by adding random sets of data. 57 00:05:41,160 --> 00:05:45,422 I hope that's clear, that you don't actually have to add data. 58 00:05:45,422 --> 00:05:50,173 It's, it's, it's, it's a way of imagining what's really going on but we actually 59 00:05:50,173 --> 00:05:54,994 simply count the actual real data, divided by the appropriate quantity. 60 00:05:54,994 --> 00:06:00,243 In the case of, say, independent, The P0 being independent assumption, 61 00:06:00,243 --> 00:06:05,900 you're actually computing the information gain between features when you do this. 62 00:06:08,340 --> 00:06:14,808 And using that value for f,, Filtering out that data based on the value 63 00:06:14,808 --> 00:06:17,580 of f, and then finding reason of high support. 64 00:06:17,980 --> 00:06:23,831 Note here that we're talking about finding high support regions and not just high 65 00:06:23,831 --> 00:06:28,906 support combinations of features as in the association rule mining case. 66 00:06:28,906 --> 00:06:34,264 Finding regions of high support is a little bit more difficult and is often 67 00:06:34,264 --> 00:06:39,199 referred to as bump hunting. We won't go into this in more detail here, 68 00:06:39,199 --> 00:06:44,909 But the unified formulation that we have used is particularly useful in tackling 69 00:06:44,909 --> 00:06:50,960 bump hunting problems. The important point is, such techniques 70 00:06:51,300 --> 00:06:58,100 where we simply query the data or find high support regions, not maybe in the 71 00:06:58,100 --> 00:07:05,076 original data, but in some slightly modified data where we filter it by high 72 00:07:05,076 --> 00:07:11,258 f, is just in the end counting. So, techniques like MapReduce or on very 73 00:07:11,258 --> 00:07:16,204 large data sets dremel for querying, work by brute force. 74 00:07:16,248 --> 00:07:19,560 And so, big data where there's lot of data, 75 00:07:20,080 --> 00:07:25,300 Can actually be solved with fairly simple techniques of just counting. 76 00:07:26,540 --> 00:07:31,794 And, this is something which we need to understand that big data can change the 77 00:07:31,794 --> 00:07:37,182 way one does statistical analysis because counting now seems to work much better 78 00:07:37,182 --> 00:07:40,042 than it did if one didn't have enough data. 79 00:07:40,042 --> 00:07:44,100 Of course, why data? That means, when the number of features is 80 00:07:44,100 --> 00:07:49,354 very large or are the high dimensional data is still a problem as we shall see 81 00:07:49,354 --> 00:07:50,020 very soon.