1
00:00:00,000 --> 00:00:06,660
Let us now ask how to compute such
association rules from the data.

2
00:00:06,980 --> 00:00:12,636
There might be many combinations of
features in for example, if there are N

3
00:00:12,636 --> 00:00:18,136
features each taking two values.
Then we immediately have two to the N

4
00:00:18,136 --> 00:00:22,143
combinations.
And even for moderate values of N, you

5
00:00:22,143 --> 00:00:25,600
would have too many combinations to
explore.

6
00:00:26,720 --> 00:00:33,305
The key observation, to getting, an
efficient algorithm for association rule

7
00:00:33,305 --> 00:00:40,654
mining was made by Agrawal and Srikant, in
the 90s.' And that observation was that if

8
00:00:40,654 --> 00:00:49,318
a combination A and B has support greater
than S that means it occurs more than,

9
00:00:49,318 --> 00:00:55,599
say, S times in the data, then so does A
and B individually.

10
00:00:55,599 --> 00:01:04,420
Because otherwise, the combination A and B
would not occur more than S times.

11
00:01:05,860 --> 00:01:12,302
So in some sense, to find combinations
with high support, one can go iteratively

12
00:01:12,302 --> 00:01:18,414
start with.
All single feature combinations that have

13
00:01:18,414 --> 00:01:25,453
high support, that means greater than S.
This set will contain necessarily all

14
00:01:25,453 --> 00:01:31,526
pairs and triplets and quadruplets if any
of them have high support.

15
00:01:31,526 --> 00:01:37,509
So, we do not have to check all
combinations, one can go step by step.

16
00:01:37,509 --> 00:01:44,117
So, this is how the algorithm works.
We first scan all the records and note

17
00:01:44,117 --> 00:01:51,529
down which features have support greater
than s, that means which particular values

18
00:01:51,529 --> 00:01:56,258
of data items.
That means, you might have diapers have a

19
00:01:56,258 --> 00:02:01,490
high support, beer has a high support,
nuts have a high support, but others

20
00:02:01,490 --> 00:02:02,780
don't.
For example.

21
00:02:04,360 --> 00:02:10,720
Then we scan only this subset of data.
Throw away all the other data.

22
00:02:13,480 --> 00:02:19,874
And look for pairs that have high support.
So remember, we have thrown away a lot of

23
00:02:19,874 --> 00:02:24,728
data in the very step.
All those records where none of the high

24
00:02:24,728 --> 00:02:28,580
support features occurred can simply be
discarded.

25
00:02:28,580 --> 00:02:34,204
Because they can never contain pairs or
triplets which have high support.

26
00:02:34,204 --> 00:02:40,060
So now we scan the subset of data for all
the pairs which have high support.

27
00:02:40,800 --> 00:02:46,322
And then, discard some more data items,
then scan the reduced data set, for

28
00:02:46,322 --> 00:02:50,710
triples, etcetera.
Until finally we know, we don't find any

29
00:02:50,710 --> 00:02:59,215
more sets, with a high support.
This considerably reduces the complexity

30
00:02:59,215 --> 00:03:04,340
of finding rules and results in an
efficient algorithm.

31
00:03:05,700 --> 00:03:17,860
Think about the complexity of rule mining.
Once we have, all possible pairs,

32
00:03:17,860 --> 00:03:23,060
singletons, triplets, etcetera, with high
support.

33
00:03:23,620 --> 00:03:28,937
We can check each subset for confidence
and interestingness.

34
00:03:28,937 --> 00:03:33,013
Now, in practice, you would not get that
many rules.

35
00:03:33,013 --> 00:03:39,393
So, I can simply go rule by rule, and
compute the confidences by again, by

36
00:03:39,393 --> 00:03:47,006
counting and interestingness of each rule
and for each possible antecedent and

37
00:03:47,006 --> 00:03:55,541
prece, and consequent from the features in
a rule, and decide which ones to declare

38
00:03:55,541 --> 00:03:59,460
as your, the rules that you learned from
the data.

39
00:04:01,600 --> 00:04:06,570
Note,
This is all just counting.

40
00:04:06,570 --> 00:04:13,286
You are only scanning the data again and
again and counting things. Whether it is

41
00:04:13,286 --> 00:04:19,662
to find those sets with high support or
once you found them to compute the

42
00:04:19,662 --> 00:04:26,208
confidences for different positions of
antecedent consequent etcetera and compute

43
00:04:26,208 --> 00:04:32,924
the interestingness. It is just counting
so map produced like paradigms are ideal

44
00:04:32,924 --> 00:04:38,559
for such tasks.
The association rule mining algorithm we

45
00:04:38,559 --> 00:04:44,999
just described is called the Apriori
Algorithm The Apriori Algorithm is a very

46
00:04:44,999 --> 00:04:48,895
popular algorithm, and it's used in many
settings.

47
00:04:48,895 --> 00:04:54,310
Not just market basket analysis.
But there are some problems with it.

48
00:04:54,310 --> 00:04:59,420
First of all, it depends on why one is
trying to find association roots.

49
00:04:59,760 --> 00:05:06,087
One of the reasons one finds rules or
tries to find rules is to characterize a

50
00:05:06,087 --> 00:05:09,131
class.
So when one wants to figure out what

51
00:05:09,131 --> 00:05:13,376
constitutes a buyer?
Which kinds of words do they use?

52
00:05:13,376 --> 00:05:18,742
What constitutes an elephant?
What are the features an elephant has?

53
00:05:18,742 --> 00:05:24,910
Or what constitutes a positive sentiment?
Which words characterize positivity?

54
00:05:24,910 --> 00:05:30,545
The trouble is, association rules only
work on high support.

55
00:05:30,545 --> 00:05:37,421
So small classes that do not enjoy high
support, they just are not enough

56
00:05:37,421 --> 00:05:40,860
instances of them.
They had left out.

57
00:05:41,780 --> 00:05:47,806
For such situations, one does not normally
use association rules, but instead uses,

58
00:05:47,806 --> 00:05:53,045
what I call, decision trees.
Now we would not go into decision trees in

59
00:05:53,045 --> 00:05:59,539
detail but it is suffice to say that they
are based on mutual information so they

60
00:05:59,539 --> 00:06:06,115
are able to compute the mutual information
between each feature and each set of

61
00:06:06,115 --> 00:06:12,773
features with the class in question, very
similar to the calculations we had done

62
00:06:12,773 --> 00:06:19,102
earlier of mutual information between
different features and the class being

63
00:06:19,102 --> 00:06:24,019
decided.
Of course, computing mutual information is

64
00:06:24,019 --> 00:06:29,984
costly because it requires all the, joint
probabilities to be known.

65
00:06:29,984 --> 00:06:35,060
And is not really possible for large
numbers of features.

66
00:06:37,040 --> 00:06:44,860
The other reason why one might want to get
rules is to figure out some structure in

67
00:06:44,860 --> 00:06:49,235
the world.
And figure out which combinations of

68
00:06:49,235 --> 00:06:55,845
features actually imply others.
And essentially learn some structure to

69
00:06:55,845 --> 00:07:00,500
the normal world without presupposing any
classes.

70
00:07:03,640 --> 00:07:11,121
Association rules are great for such tasks
except that, because they rely on high

71
00:07:11,121 --> 00:07:18,968
support, it means that negative rules,
such as milk and not diapers, implies not

72
00:07:18,968 --> 00:07:23,972
beer, are lost.
So we do not, we have no way of figuring

73
00:07:23,972 --> 00:07:28,976
out not diapers,
We can only look for a value diapers,

74
00:07:28,976 --> 00:07:36,151
because not diapers could be any other
value, like cereal, butter or cola, and

75
00:07:36,151 --> 00:07:42,760
there are just too many possible
combinations to check in not diapers.

76
00:07:44,160 --> 00:07:50,133
In such situations one does something a
little different, and techniques called

77
00:07:50,133 --> 00:07:53,762
Interesting Subgroup Discovery are used
instead.

78
00:07:53,762 --> 00:07:59,660
They are very similar to association
rules, however, instead of relying only on

79
00:07:59,660 --> 00:08:05,256
high support or, or, relying on high
support, they look for correlations in the

80
00:08:05,256 --> 00:08:08,583
data.
One of the earliest papers that talked

81
00:08:08,583 --> 00:08:16,913
about this concept was this one in 1997 by
Sergey Brin and [inaudible].

82
00:08:16,913 --> 00:08:23,468
[inaudible] Sergey Brin, of course, is one
of the founders of Google, this is before

83
00:08:23,468 --> 00:08:27,682
Google.
So learning rules from data is fairly

84
00:08:27,682 --> 00:08:31,896
possible and association rules is, are
simple.

85
00:08:31,896 --> 00:08:39,575
I support situations are easily covered
but figuring out small support situations

86
00:08:39,575 --> 00:08:44,820
or negative rules require more
sophisticated techniques.