1
00:00:00,000 --> 00:00:07,704
We now turn to finding rules from data.
Now rules are essentially correlations

2
00:00:07,704 --> 00:00:15,803
between different features as we've seen
earlier features need not be independent

3
00:00:15,803 --> 00:00:21,625
so finding rules is essentially finding
out which features or which sets of

4
00:00:21,625 --> 00:00:25,333
features are related to each other or
correlated.

5
00:00:25,333 --> 00:00:31,161
In some sense we are trying to cluster
features rather than cluster the data.

6
00:00:31,161 --> 00:00:37,216
For example we might want to discover
rules which say that if you have, like and

7
00:00:37,216 --> 00:00:41,303
lot in a comment than it's very likely to
be positive.

8
00:00:41,303 --> 00:00:46,980
If you have an not and like in a comment
it's very likely to be a negative.

9
00:00:47,660 --> 00:00:53,522
Similarly, we, one might like to find a
rule which says that searching for flowers

10
00:00:53,522 --> 00:01:01,180
means that one is searching for a cheap
gift or in the case of the animal example.

11
00:01:01,720 --> 00:01:06,200
If an animal's a bird, then it chirps or
squeals.

12
00:01:06,200 --> 00:01:11,240
If an animal chirps and has two legs then
it's a bird.

13
00:01:12,940 --> 00:01:19,820
In the case of people buying items, we
might want to figure out interesting

14
00:01:19,820 --> 00:01:28,061
situation that those who buy diapers and
milk also buy beer. In each case what it

15
00:01:28,061 --> 00:01:33,506
is trying to not look at clustering of
objects or data item.

16
00:01:33,506 --> 00:01:40,495
But is looking at clustering the features,
and seeing which features co-occur

17
00:01:40,495 --> 00:01:49,590
together in the data very often.
So, what one is trying to do is find

18
00:01:49,590 --> 00:01:56,340
statistical rules based on the frequency
of co-occurrence of features.

19
00:01:58,120 --> 00:02:05,880
In our unified framework.
What this means is we're trying to find

20
00:02:05,880 --> 00:02:13,725
regions, again, of X that indicate
correlation of features, that is those

21
00:02:13,725 --> 00:02:21,525
regions have more data items than would be
expected if all the features were

22
00:02:21,525 --> 00:02:24,685
independent.
Think about it again.

23
00:02:24,685 --> 00:02:31,578
If features are completely independent,
then you wouldn't have that many

24
00:02:31,578 --> 00:02:39,137
situations of birds chirping or squealing.
Because these features don't necessarily

25
00:02:39,137 --> 00:02:43,378
co-occur together.
But if these features are correlated, that

26
00:02:43,378 --> 00:02:49,103
is, there are actually birds that chirp
and squeal you'll see many such instances.

27
00:02:49,103 --> 00:02:55,039
And those regions will be more populated,
than say objects or animals that chirp and

28
00:02:55,039 --> 00:03:00,640
have four legs.
So this time.

29
00:03:01,980 --> 00:03:12,500
Instead of comparing our data to random
data, we're comparing our data to.

30
00:03:13,140 --> 00:03:18,954
Data which has the same features but,
where the features are independent.

31
00:03:18,954 --> 00:03:24,768
So, P zero now is the distribution
assuming independent features so, it's

32
00:03:24,768 --> 00:03:29,129
just the product of the distributions of
each feature.

33
00:03:29,129 --> 00:03:35,508
Now, these are not uniform distributions,
they happen to be just the independent

34
00:03:35,508 --> 00:03:41,000
distributions of every feature in the data
that we actually observe.

35
00:03:43,400 --> 00:03:49,154
So if we can conduct our thought
experiment again, and set y equal to one

36
00:03:49,154 --> 00:03:54,120
for all the real data, that means the data
that actually exists.

37
00:03:54,560 --> 00:04:01,205
And add y equal to zero, points, that
means extra points, by picking points

38
00:04:01,205 --> 00:04:07,669
where each feature is uniformly chosen not
at random, but from the data.

39
00:04:07,669 --> 00:04:15,273
So the probability of choosing a chirp
depends on the number of times [inaudible]

40
00:04:15,273 --> 00:04:19,585
occurs in the data independently of any
other feature.

41
00:04:19,585 --> 00:04:25,973
Similarly, the probability of choosing
four legs is simply dependent on how many

42
00:04:25,973 --> 00:04:32,680
times four legs occur in the data rather
than any other feature occurring alongside

43
00:04:32,680 --> 00:04:38,414
that.
So instead of comparing the actual data to

44
00:04:38,414 --> 00:04:45,318
random data, one compares it to.
Artificially generated data, where each

45
00:04:45,318 --> 00:04:53,276
feature's chosen independently.
Now if we choose F of X, again it's the

46
00:04:53,276 --> 00:05:01,354
expected value Y given X where Y is one
for the real data and zero for these newly

47
00:05:01,354 --> 00:05:06,605
added points.
This again estimates R upon one + R where

48
00:05:06,605 --> 00:05:11,189
R is P(x) by P0(x).
This time P0(x) is this distribution not

49
00:05:11,189 --> 00:05:17,841
the random distribution but a distribution
that assumes that features are

50
00:05:17,841 --> 00:05:25,536
independent.
Again, the extreme region of f(x) indicate

51
00:05:25,536 --> 00:05:30,054
those.
That have high support and we'll explain

52
00:05:30,054 --> 00:05:35,883
what support means in a minute, and are
therefore, regions where we can

53
00:05:35,883 --> 00:05:40,380
potentially find rules, like the one we've
shown above.

54
00:05:41,740 --> 00:05:48,029
Let's see how, the most popular technique
for finding rules in data is called

55
00:05:48,029 --> 00:05:54,400
associated rule mining and it works on
data which consists of instances which

56
00:05:54,400 --> 00:05:58,892
have features.
So for example, you can have animals that

57
00:05:58,892 --> 00:06:05,263
have features, you could have shopping
transactions where the features are the

58
00:06:05,263 --> 00:06:10,836
items that people buy.
And one wants to infer rules of the form

59
00:06:10,836 --> 00:06:15,718
a, b, and c implies d, where all four are
just features.

60
00:06:15,718 --> 00:06:23,272
We don't really know upfront which ones
are on the left and which ones are on the

61
00:06:23,272 --> 00:06:26,956
right.
But we'll decide based on certain

62
00:06:26,956 --> 00:06:31,838
principles.
Fourth principle is, that this combination

63
00:06:31,838 --> 00:06:38,672
a, b, c, and d has high support.
What this means is, that the probability

64
00:06:38,672 --> 00:06:45,020
of finding a combination ABCD is
reasonably high in the data.

65
00:06:45,020 --> 00:06:49,600
Typically, we choose twenty, 30 or 40%
support.

66
00:06:50,040 --> 00:06:56,786
Those are considered extremely high.
But the point is that this combination

67
00:06:56,786 --> 00:07:01,860
occurs high enough to warrant it being
considered as a potential rule.

68
00:07:03,800 --> 00:07:10,319
Next principle is that, the rule that went
in first out of this combination of four

69
00:07:10,319 --> 00:07:13,853
features, is one where the confidence is
high.

70
00:07:13,853 --> 00:07:20,368
All that means is that.
Of all the cases where you have a, b and

71
00:07:20,368 --> 00:07:23,585
c.
A large number of them actually have'd'.

72
00:07:23,585 --> 00:07:29,765
So now you are considering more instances
than just the ones where there is'a', 'b',

73
00:07:29,765 --> 00:07:35,343
'c' and'd', you are considering those
where they might be only'a', 'b' and'c'

74
00:07:35,343 --> 00:07:40,544
but of those that have only'a', 'b'
and'c', a large number of them also

75
00:07:40,544 --> 00:07:43,936
have'd'.
So the Conditional Probability of'd'

76
00:07:43,936 --> 00:07:49,589
given'a', 'b' and'c' is [inaudible], so
there is a high confidence that'a', 'b'

77
00:07:49,589 --> 00:07:55,945
and'c' result in a'd' actually occurring.
And lastly, this rule should be

78
00:07:55,945 --> 00:08:02,595
interesting, in the sense that.
The confidence that d occurs given a, b

79
00:08:02,595 --> 00:08:09,938
and c is significantly higher than the
probability of d occurring just by itself.

80
00:08:09,938 --> 00:08:16,828
So, for example, if d always occurred in
the data, that means you always had a

81
00:08:16,828 --> 00:08:23,808
value d, that everybody always bought,
say, milk, then any rule that you came up

82
00:08:23,808 --> 00:08:29,520
with, with whatever confidence would not
be interesting because.

83
00:08:30,040 --> 00:08:35,483
The probability of D given A, B, and C
would be the same as the probability of D

84
00:08:35,483 --> 00:08:38,445
and it really wouldn't be very
interesting.

85
00:08:38,445 --> 00:08:43,820
On the other hand, if one found that those
who bought beer also bought diapers.

86
00:08:44,080 --> 00:08:51,300
More, than those people who bought, beer
just like that.

87
00:08:52,880 --> 00:08:58,755
That means the propensity to buy beer is
higher if [inaudible], if the person also

88
00:08:58,755 --> 00:09:02,696
buys diapers.
Now, that's interesting because that tells

89
00:09:02,696 --> 00:09:08,500
us something about how one might want to
place items on the shelves in the store.

90
00:09:08,500 --> 00:09:14,303
This is actually the classical example of
a correlation between beer and diapers

91
00:09:14,303 --> 00:09:18,531
that a large retail chain found way back
in the 80's.

92
00:09:18,531 --> 00:09:24,120
And it sparked all this interest in what
is called, market basket analysis and

93
00:09:24,500 --> 00:09:29,720
Resulted in algorithms for association
rule mining.

94
00:09:31,540 --> 00:09:36,925
But association remaining is more than
just for transactions.

95
00:09:36,925 --> 00:09:44,075
If one thinks about objects in the real
world, one might come to conclusions like

96
00:09:44,075 --> 00:09:48,735
birds chirp or.
Squirrels squeal or lions roar.

97
00:09:48,735 --> 00:09:55,446
Which is quite interesting.
Since these are rules which we consider to

98
00:09:55,446 --> 00:10:01,300
be common sense, and the technique like
association rule mining might actually

99
00:10:01,300 --> 00:10:06,778
allow us to such rules amongst the
features, apart from just knowing that

100
00:10:06,778 --> 00:10:09,780
features are correlated with each other.