1
00:00:00,000 --> 00:00:06,566
Let's recap now our unified formal
framework for dealing with classification

2
00:00:06,566 --> 00:00:13,219
clustering, rule mining, all under the
same umbrella and see what it has to do

3
00:00:13,219 --> 00:00:17,539
with big data.
And, whether it gives any additional

4
00:00:17,539 --> 00:00:23,760
insight as we have alluded to a couple of
times in the past few minutes.

5
00:00:24,800 --> 00:00:31,200
Recall that we defined this function, f of
x, where x is the set of features which

6
00:00:31,200 --> 00:00:38,547
can take various values in a large space.
As the expected value of the output, which

7
00:00:38,547 --> 00:00:44,574
could be zero and one for classification,
etc.,, for appropriate data sets.

8
00:00:44,574 --> 00:00:50,390
So, we define this, not on the original
data but on other types of data.

9
00:00:50,390 --> 00:00:56,880
For example, for the classification exam
case, we had the output variable to zero,

10
00:00:56,880 --> 00:01:03,126
one, and the problem becomes estimating
this function or estimating where the

11
00:01:03,126 --> 00:01:09,859
values of this function will be less than
a half or more than a half to decide the

12
00:01:09,859 --> 00:01:16,496
decision boundary.
Then, we added on random data, which means

13
00:01:16,496 --> 00:01:20,780
we added more data to the data set to deal
with clustering.

14
00:01:21,940 --> 00:01:27,913
In the case of rule mining, we added
independent data not random data, but data

15
00:01:27,913 --> 00:01:31,996
where all the features were independent of
each other.

16
00:01:31,996 --> 00:01:37,818
And then, our problem became that of
finding regions where this function f is

17
00:01:37,818 --> 00:01:46,052
large on this new data set.
Now, suppose we really have big data that

18
00:01:46,052 --> 00:01:52,525
means, lots and lots of examples.
It's long data, in the sense that there

19
00:01:52,525 --> 00:01:56,757
are lots of examples.
But, the number of features isn't too

20
00:01:56,757 --> 00:02:03,478
large, so it's not wide data.
This is typical of real world data outside

21
00:02:03,478 --> 00:02:11,196
of the web world where you have words, and
even images and videos, where the numbers

22
00:02:11,196 --> 00:02:16,961
of features can be huge.
But in, in things like transactions or

23
00:02:17,720 --> 00:02:24,976
small number of features of which are
typically found in traditional data sets.

24
00:02:24,976 --> 00:02:31,892
This is a typical situation where the data
is very long, and if you start storing

25
00:02:31,892 --> 00:02:37,100
lots and lots of transactions, you can get
big data into sets.

26
00:02:38,080 --> 00:02:44,966
So, the problem A now reduces to just
querying the data and figuring out, for a

27
00:02:44,966 --> 00:02:50,175
particular combination of x,
What's the expected value of y?

28
00:02:50,175 --> 00:02:55,120
And if you remember, there was a question
in our earlier.

29
00:02:55,800 --> 00:03:04,555
Homework or problem set where I asked
whether, if we had lots of data we could

30
00:03:04,555 --> 00:03:07,575
simply estimate the joint probability
directly.

31
00:03:07,575 --> 00:03:13,100
And indeed, if you really have enough data
that for every possible combination of x,

32
00:03:13,620 --> 00:03:18,954
We can compute this expected value with
some degree of accuracy because you have

33
00:03:18,954 --> 00:03:22,445
enough data.
Then, finding out which class an instance

34
00:03:22,445 --> 00:03:27,714
belongs to is simply a matter of querying
and figuring out for that particular

35
00:03:27,714 --> 00:03:32,719
combination how many positive instances
have you seen and how many negative,

36
00:03:32,719 --> 00:03:42,198
And deciding based on the expected value.
The problem B on the other hand, reduces

37
00:03:42,198 --> 00:03:47,620
to finding regions of high support.
Let's see how that happens.

38
00:03:48,000 --> 00:03:55,351
Suppose we have added new data to our data
set in this matter, and computed the value

39
00:03:55,351 --> 00:04:03,373
of f for every data point.
Now, data points which have high values of

40
00:04:03,373 --> 00:04:07,886
f, we keep, and those that don't, we
discard.

41
00:04:07,886 --> 00:04:18,528
Let's, let's look at that, that way.
Once we have only points which have high

42
00:04:18,528 --> 00:04:21,916
values of f,
Now the question is, what regions

43
00:04:21,916 --> 00:04:28,188
characterize these high values of f?
Are there particular combinations of

44
00:04:28,188 --> 00:04:35,455
features which we can say are typical of
high regions and these would be our

45
00:04:35,455 --> 00:04:42,411
interesting rules or interesting clusters?
And again, all we need to do is find high

46
00:04:42,411 --> 00:04:47,531
support region, in the sense that which
combination of features have high support

47
00:04:47,531 --> 00:04:53,897
out of those which have high value of f.
Of course, we still have the problem of

48
00:04:53,897 --> 00:04:58,093
dealing with negative rules even in this
high support technique.

49
00:04:58,093 --> 00:05:00,978
So that, that doesn't solve the entire
problem.

50
00:05:00,978 --> 00:05:04,060
It certainly is a route to solving the
problem.

51
00:05:05,160 --> 00:05:12,496
Finally, remember that we're, even though,
in principle we added random data for

52
00:05:12,496 --> 00:05:18,677
clustering and for rule mining, it was
start experiment so we don't actually add

53
00:05:18,677 --> 00:05:22,276
data.
We sort of, can compute the value of f by

54
00:05:22,276 --> 00:05:28,613
dividing by the probability of that
particular combination occurring, assuming

55
00:05:28,613 --> 00:05:35,107
uniform density of data or the particular
probability assuming independent data by

56
00:05:35,107 --> 00:05:39,880
direct calculation rather than by adding
random sets of data.

57
00:05:41,160 --> 00:05:45,422
I hope that's clear, that you don't
actually have to add data.

58
00:05:45,422 --> 00:05:50,173
It's, it's, it's, it's a way of imagining
what's really going on but we actually

59
00:05:50,173 --> 00:05:54,994
simply count the actual real data, divided
by the appropriate quantity.

60
00:05:54,994 --> 00:06:00,243
In the case of, say, independent,
The P0 being independent assumption,

61
00:06:00,243 --> 00:06:05,900
you're actually computing the information
gain between features when you do this.

62
00:06:08,340 --> 00:06:14,808
And using that value for f,,
Filtering out that data based on the value

63
00:06:14,808 --> 00:06:17,580
of f, and then finding reason of high
support.

64
00:06:17,980 --> 00:06:23,831
Note here that we're talking about finding
high support regions and not just high

65
00:06:23,831 --> 00:06:28,906
support combinations of features as in the
association rule mining case.

66
00:06:28,906 --> 00:06:34,264
Finding regions of high support is a
little bit more difficult and is often

67
00:06:34,264 --> 00:06:39,199
referred to as bump hunting.
We won't go into this in more detail here,

68
00:06:39,199 --> 00:06:44,909
But the unified formulation that we have
used is particularly useful in tackling

69
00:06:44,909 --> 00:06:50,960
bump hunting problems.
The important point is, such techniques

70
00:06:51,300 --> 00:06:58,100
where we simply query the data or find
high support regions, not maybe in the

71
00:06:58,100 --> 00:07:05,076
original data, but in some slightly
modified data where we filter it by high

72
00:07:05,076 --> 00:07:11,258
f, is just in the end counting.
So, techniques like MapReduce or on very

73
00:07:11,258 --> 00:07:16,204
large data sets dremel for querying, work
by brute force.

74
00:07:16,248 --> 00:07:19,560
And so, big data where there's lot of
data,

75
00:07:20,080 --> 00:07:25,300
Can actually be solved with fairly simple
techniques of just counting.

76
00:07:26,540 --> 00:07:31,794
And, this is something which we need to
understand that big data can change the

77
00:07:31,794 --> 00:07:37,182
way one does statistical analysis because
counting now seems to work much better

78
00:07:37,182 --> 00:07:40,042
than it did if one didn't have enough
data.

79
00:07:40,042 --> 00:07:44,100
Of course, why data?
That means, when the number of features is

80
00:07:44,100 --> 00:07:49,354
very large or are the high dimensional
data is still a problem as we shall see

81
00:07:49,354 --> 00:07:50,020
very soon.