Association Rules

oneonone · Jan 23, 2016

Description
Finding frequent patterns frequent patterns, association correlation or causal association, correlation, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Association Rules
Lecture 4/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA ([email protected])
Faculty of Computer Science, University of Indonesia
Objectives Objectives
Introduction
What is Association Mining? What is Association Mining?
Mining Association Rules
Al ith f A i ti R l Mi i Algorithms for Association Rules Mining
Visualization
University of Indonesia
2
Introduction Introduction
You sell more if customers can see the product.
Customers that purchase one type of product are likely Customers that purchase one type of product are likely
to be interested in other particular products.
Market-basket analysis studying the composition of
h i b k f d h d d i i l shopping basket of products purchased during a single
shopping event.
Market-basket data the transactional list of purchases p
by customer. It is challenging, because
Very large number of records (often millions of trans/day)
Sparseness (each market basket contains only a small portion of Sparseness (each market-basket contains only a small portion of
items carried)
Heterogeneity (those with different tastes tend to purchase a specific
subset of items)
University of Indonesia
subset of items).
Introduction (2) Introduction (2)
Product presentations can be more intelligently planned
for specific times a day, days of the week, or holidays. for specific times a day, days of the week, or holidays.
Can also involve sequential relationships.
Market-basket analysis is an undirected (along with y ( g
clustering) DM operation, seeking patterns that were
previously unknown.
Cross-selling
The propensity for the purchaser of a specific item to purchase
a different item a different item
Can be maximized by locating those products that tend to be
purchased by the same consumer in places where both
University of Indonesia
products can be seen.
4
What is Association Mining? What is Association Mining?
Association rule mining (ARM):
Finding frequent patterns association correlation or causal Finding frequent patterns, association, correlation, or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
Frequent pattern: pattern that occurs frequently in a database.
Motivation: finding regularities in data Motivation: finding regularities in data
What products were often purchased together? —Beer and
diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
University of Indonesia
Can we automatically classify web documents?
5
Why is Frequent Pattern or Association
Mining an Essential Task in DM? Mining an Essential Task in DM?
Foundation for many essential data mining tasks
Association correlation causality Association, correlation, causality
Sequential patterns, temporal or cyclic association, partial
periodicity, spatial and multimedia association p y p
Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
Broad applications
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis
Web log (click stream) analysis, DNA sequence analysis, etc.
University of Indonesia
6
What is Association Mining? What is Association Mining?
Examples:
Rule form: “A ? B [support confidence]” Rule form: A ? B [support, confidence] .
buys(x, “diapers”) ? buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) ? grade(x, “A”) [1%, 75%] j ( , ) ( , ) g ( , ) [ , ]
A support of 0.5% for Assoc Rule means that 0.5%
of all the transaction show that diapers and beers are
purchased together.
A confidence of 60% means that 60% of the
customers who purchased diapers also bought beers.
Rules that satisfy both minimum support and
f d h h ld ll d
University of Indonesia
minimum confidence threshold are called strong.
7
What is Association Mining? What is Association Mining?
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset An itemset that contains k items is a k-itemset.
{beer, diaper} is a 2-itemset.
If an itemset satisfies minimum support then it is a If an itemset satisfies minimum support, then it is a
frequent itemset.
The set of frequent k-itemsets is commonly denoted by L
k
. q y y
k
ARM is a two-step process:
Find all frequent itemsets
Generate strong AR from the frequent itemsets.
The second step is the easiest of the two. Overall
f f i i AR i d t i d b th fi t t
University of Indonesia
performance of mining AR is determined by the first step.
8
Association Mining Association Mining
mining association rules
(Agrawal et al SIGMOD93) (Agrawal et. al SIGMOD93)
Fast algorithm
(Agrawal et. al VLDB94)
Generalized A.R.
(Srikant et al; Han et al VLDB95)
Problem extension Better algorithms
Partitioning
(Agrawal et. al VLDB94)
(Srikant et. al; Han et. al. VLDB95)
Quantitative A.R.
(Srikant et. al SIGMOD96)
Hash-based
(Park et. al SIGMOD95)
g
(Navathe et. al VLDB95)
Direct Itemset Counting
(Brin et. al SIGMOD97)
N-dimensional A.R.
(Lu et. al DMKD’98)
Parallel mining
(Agrawal et. al TKDE96)
( )
Meta-ruleguided
mining
University of Indonesia
9
Distributed mining
(Cheung et. al PDIS96)
Incremental mining
(Cheung et. al ICDE96)
Many Kinds of Association Rules Many Kinds of Association Rules
Boolean association rule:
If a rule concerns associations between the presence or absence If a rule concerns associations between the presence or absence
of items.
Example: buys(x, “diapers”) ? buys(x, “beers”) [0.5%, 60%] (R1)
Quantitative association rule:
Describes associations between quantitative items. Quantitative
values for items are partitioned into intervals values for items are partitioned into intervals.
Example: age(X,”30-39”) ? income(X,“42K-48K”) ?
buys(X,“LCD TV”) (R2)
Age and income have been discretized.
Single-dimensional association rule R1
University of Indonesia
Multidimensional association rule R2
10
Many Kinds of Association Rules Many Kinds of Association Rules
Single-level association rule
Example: age(X”30-39”) ? buys(X “laptop computer”) Example: age(X, 30 39 ) ? buys(X, laptop computer )
Multilevel association rule
Example: age(X,”30-39”) ? buys(X,“computer”) p g ( ) y ( p )
Computer is a higher-level abstraction of laptop
Various Extensions
Mining maximal frequent patterns.
If p is a maximal frequent pattern, then any superpattern of p is
not frequent not frequent.
Used to substantially reduce the number of frequent itemsts
generated in mining.
University of Indonesia
11
Mining Single-Dimensional
Boolean Assocation Rules Boolean Assocation Rules
Given
A database of customer transactions A database of customer transactions
Each transaction is a list of items (purchased by a customer in
a visit)
Find all rules that correlate the presence of one set of items
with that of another set of items
Example: 98% of people who purchase tires and auto accessories Example: 98% of people who purchase tires and auto accessories
also get automotive services done
Any number of items in the consequent/antecedent of rule
Possible to specify constraints on rules (e.g., find only rules
involving Home Laundry Appliances).
University of Indonesia
12
Application Examples Application Examples
Market-basket Analysis
* ? Fanta -- what the store should do to boost Fanta sales ? Fanta what the store should do to boost Fanta sales
Bodrex ? * -- what other products should the store stocks
up on if the store has a sale on Bodrex
Attached mailing in direct marketing
University of Indonesia
13
Rule Measures: Support and Confidence Rule Measures: Support and Confidence
Find all the rules X & Y ? Z with
Customer
buys diaper
Customer
buys both
minimum confidence and support
support, s, probability that a
transaction contains {X Y Z}
buys diaper
transaction contains {X, Y, Z}
confidence, c, conditional probability
that a transaction having {X, Y} also
Customer
buys beer
contains Z.
Transaction ID Items Bought
2000 A B C
Let minimum support 50%, and
buys beer
2000 A,B,C
1000 A,C
4000 A,D
Let minimum support 50%, and
minimum confidence 50%, we have
A ? C (50%, 66.6%)
C A (50% 100%)
University of Indonesia
14
,
5000 B,E,F
C ? A (50%, 100%)
Mining Association Rules -- Example Mining Association Rules Example
TransactionID Items Bought Transaction ID Items Bought
2000 A,B,C
1000 A,C
Min. support 50%
Min. confidence 50%
4000 A,D
5000 B,E,F
Frequent Itemset Support
{A} 75%
{B} 50%
For rule A ?C:
support = support({A, C}) = 50%
{B} 50%
{C} 50%
{A,C} 50%
support support({A, C}) 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
University of Indonesia
15
Any subset of a frequent itemset must be frequent.
Mining Frequent Itemsets: the Key Step Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of items that have
minimum support minimum support
A subset of a frequent itemset must also be a frequent itemset,
i e if {AB} is a frequent itemset both {A} and {B} should be a i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
Use the frequent itemsets to generate association rules. q g
University of Indonesia
16
The Apriori Algorithm The Apriori Algorithm
C
k
: Candidate itemset of size k
L f i f i k L
k
: frequent itemset of size k
L
1
= {frequent items};
f (k 1 L ! ? k++) d b i for (k = 1; L
k
!=?; k++) do begin
C
k+1
= candidates generated from L
k
;
for each transaction t in database do
increment the count of all candidates in C
k+1
that are contained in t
L
k+1
= candidates in C
k+1
with min_support
end
return ?
k
L
k
;
University of Indonesia
17
The Apriori Algorithm – Example 1 The Apriori Algorithm Example 1
TID Items
Database D
itemset sup.
{1} 2
itemset sup.
{1} 2
C
1
L
1
100 1 3 4
200 2 3 5
300 1 2 3 5
{2} 3
{3} 3
{4} 1
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
400 2 5 {5} 3
{5} 3
itemset
{1 2}
itemset sup
{1 2} 1
itemset sup L
2
C
2 C
2
Scan D
{1 3}
{1 5}
{2 3}
{ }
{1 3} 2
{1 5} 1
{2 3} 2
p
{1 3} 2
{2 3} 2
{2 5} 3
2
{2 5}
{3 5}
{ }
{2 5} 3
{3 5} 2
{2 5} 3
{3 5} 2
C L
itemset
S D
itemset sup
University of Indonesia
18
C
3
L
3 itemset
{2 3 5}
Scan D
itemset sup
{2 3 5} 2
The Apriori Algorithm – Example 2 The Apriori Algorithm Example 2
Tid Items
1 3 4 5 6 7 9
2 1 3 4 5 13
3 1 2 4 5 7 11
4 1 3 4 8
Itemset Tid Support
1 2,3,4,5 80% (4)
3 1,2,4,5 80% (4)
Rule Support Confidence
7
?
5 40% (2) 100%
7
?
4 40% (2) 100% 4 1 3 4 8
5 1 3 4 10
3 1,2,4,5 80% (4)
4 1,2,3,4,5 100% (5)
5 1,2,3 60% (3)
7 1,3 40% (2)
4 7 1,3 40% (2)
7
?
4 40% (2) 100%
5
?
4 60% (3) 100%
3
?
1 60% (3) 75%
1
?
3 60% (3) 75%
3 4 80% (4) 100%
Sample
Database
, ( )
5 7 1,3 40% (2)
4 5 7 1,3 40%/2
4 5 1,2,3 60% (3)
3 5 1,2 40% (2)
3
?
4 80% (4) 100%
4
?
3 80% (4) 80%
1
?
4 80% (4) 100%
4
?
1 80% (4) 80%
, ( )
3 4 5 1,2 40% (2)
1 5 2,3 40% (2)
1 4 5 2,3 40% (2)
1 4 2,3,4,5 80% (4)
5 7
?
4 40% (2) 100%
4 7
?
5 40% (2) 100%
3 5
?
4 40% (2) 100%
1 5
?
4 40% (2) 100%
( )
1 3 2,4,5 60% (3)
1 3 4 2,4,5 60% (3)
3 4 1,2,4,5 80% (4)
1 5
?
4 40% (2) 100%
1 3
?
4 60% (3) 100%
3 4
?
1 60% (3) 75%
1 4
?
3 60% (3) 75%
University of Indonesia
19
Frequent Patterns
with MinSupport =40%
Association Rules
with MinSupport =40% and MinConf =70%

C
1
Pattern Support
1 4
2 1
3 4 Scan D for
Prune
L
1
4 5
5 3
6 1
7 2
8 1
9 1
count of each
candidate
Prune
infrequent
patterns
Pattern Support
1 4
3 4
4 5
5 3
7 2
The
Apriori
10 1
11 1
13 1

C
2
Pattern
C
2
P tt S t Apriori
Algorithm
–
Pattern
1 3
1 4
1 5
1 7
3 4
3 5
Generate C2
Candidates
From L
1

Scan D for
count of each
candidate
Pattern Support
1 3 3
1 4 4
1 5 2
1 7 1
3 4 3
3 5 2
Prune
i nfrequent
patterns
L
2
Pattern Support
1 3 3
1 4 4
1 5 2
3 4 3
–
Example 2
3 5
3 7
4 5
4 7
5 7

3 5 2
3 7 1
4 5 3
4 7 2
5 7 2

3 4 3
3 5 2
4 5 3
4 7 2
5 7 2
C
C
C
3
Pattern
1 3 4
1 3 5
1 4 5
1 4 7
C
3
Pattern Support
1 3 4 3
1 3 5 1
1 4 5 2
1 4 7 1
Generate C3
Candidates
From L
2

Scan D for
count of each
candidate
Prune
infrequent
patterns
L
3
Pattern Support
1 3 4 3
1 4 5 2
34 5 2
University of Indonesia
20
1 5 7
3 4 5
3 4 7
3 5 7
4 5 7
1 5 7 1
3 4 5 2
3 4 7 1
3 5 7 1
4 5 7 2
3 4 5 2
4 5 7 2
Major Drawbacks of Apriori Major Drawbacks of Apriori
Apriori has to read the database many times to test the
support of candidate patterns. In order to find a frequent support of candidate patterns. In order to find a frequent
pattern X with length 50, it has to traverse the database
50 times.
On dense datasets with long patterns, as the length of the
pattern increases, the performance of Apriori drops rapidly
due to the explosion of the candidate patterns due to the explosion of the candidate patterns.
University of Indonesia
21
TreeProjection Algorithm
(Agarwal, Aggarwal & Prasad 2000) (Agarwal, Aggarwal & Prasad 2000)
null Level 0 {7534, 5134, 7514, 134, 134}
7 5 1 3 4
Level 1
{54, 54} {34, 134, 14} {34, 4,
34, 34}
{4,4,
4,4}
75 74 51 53 54 13 14 34 Level 2
754 514 534 134
Level 3

7 5 1 3 4
7
5 2
1 1 2
3 1 2 3
Lexicographical Tree
and
Triangular Matrix for Counting Frequent Patterns
with LengthTwo
University of Indonesia
22
3 1 2 3
4 2 3 4 4
with Length Two
Eclat Algorithm (Zaki et al. 1997) Eclat Algorithm (Zaki et al. 1997)

13457
1
345
1345 1347 1357 1457 3457
34 35
1
2
tid?l ist intersection
134 135 137 145 147 157 345 347 357 457
34
1
2
35
1
2
4
5
13 14 15 17 34 35 37 45 47 57
2
1
1
3
1
4
1
5
1
7
tid?list intersecti on
1 3 4 5 7
{}
2
3
4
5
1
2
4
5
1
2
3
4
5
1
2
3
1
3
University of Indonesia
23
{}
FP-Growth (Han, Pei & Yin 2000) FP Growth (Han, Pei & Yin 2000)

Root
HeaderTable
4:5
4
3
11
5
7
3:4 1:1
ItemHead of
node?li nks
1:3 5:1 5:1
7:1 5:1 7:1
FP-Tree for the sample database
University of Indonesia
24
CT-PRO (Sucahyo & Gopalan 2004) CT PRO (Sucahyo & Gopalan 2004)
0 5 1
Level 0
Level 1 ItemTable
1
Tid     Items
  1     3  4  5  7
  2     1  3  4  5
  3     1  4  5  7
4 1 3 4
Tid     Items
  1     1  2  4  5
  2     1  2  3  4
  3     1  3  4  5
4 1 2 3
0 4
1 0
0 1
0 1 0 3
12
13
2
123 134
Level 1
Level 2
0 1 124
1    4   5
2    3   4
3    1   4
4 5 3
ItemTable
2
3 4
3
4
4     1  3  4
  5     1  3  4
(a) Frequent Items
  4     1  2  3
  5     1  2  3
(b) Mapped
1 0
2 0
0 1
23
3
1234 0 1 1345
134
0 1 1245
Level 3
1 0 24
I
n
d
I
t
e
C
o
u
4    5   3
5    7   2
P
S
T
3
4
4
5
4
5
1 0
2 0
3 0
234
34
4
0 0 12345
1 0 245
Level 4
d
e
x
e
m

u
n
t
T
5
0 0
1 0
2 0
3 0
12345
2345
345
45
4 0 5
5
University of Indonesia
25
4 0 5
(c) Global CFP-Tree
Mining Very Large Database Mining Very Large Database
Partition Algorithm (Savasere, Omiecinski & Navathe 1995)
Tid            Items
   1     3   4  5   6  7 9
2 1 3 4 5 13
Scan P
1
and P
2
to fi nd l ocal
frequent patterns
FP
1
: {{1},{1,4},{1,5},{1,4,5},
{3}, {3,4}, {3,4,5}, {3,5},
{4} {45} {47} {457} 2     1   3  4   5  13
   3     1   2  4   5  7 11
   4     1   3  4   8
5 1 3 4 10
frequent patterns
{4}, {4,5}, {4,7}, {4,5,7},
{5}, {5,7}, {7}}
FP
2
: {{1},{1,3},{1,4},{1,3,4}
{3},{3,4},{4},……. {8},{10}
C: {{1},{1,3},{1,4},{1,5},
{1,3,4},{1,4,5},
{3} {34} {34 } {3 }
Scan database to count
FP: {{1},{1,3},{1,4},
{15} {134}{145}
{3}, {3,4}, {3,4,5}, {3,5},
{4}, {4,5}, {4,7}, {4,5,7},
{5}, {5,7}, {7}……. {8},{10}}
support for patterns in C
{1,5}, {1,3,4},{1,4,5},
{3}, {3,4}, {3,4,5},
{3,5}, {4}, {4,5}, {4,7},
{4,5,7}, {5}, {5,7},
University of Indonesia
26
{7}}
Mining Very Large Database Mining Very Large Database
Projection (Pei 2002)
Tid     Items
  1     7  5  3  4
  2     5  3  1  4
  3     7  5  1  4
Tid     Items
1     7  5  3  4
2     5  3  1  4
3 7 5 1 4
4     3  1  4
  5     3  1  4
It It Items Items It
3     7  5  1  4
4     3  1  4
5     3  1  4
     Items
   5  3  4
   5  1  4
     Items
    3  4
    3  1  4
    1  4
     Items
     4
     1  4
     1  4
     1  4
    Items
        4
        4
        4
        4
Projecti on 7
Projection 5
Projection 3 Projecti on 1
Projection 4
     Items

      Items
   5  3  4
   5  1  4
       Items
      3  1  4
     Items
      1  4
      1  4
Projection 4
     Items

Projection 1
    Items

Projection 7
Projection 5
Projecti on 3
(a) Parallel Projection
(b) Partition Projection
Projection 7 Projecti on 3
University of Indonesia
27
Presentation of Association Rules
(Table Form) (Table Form)
University of Indonesia
28
Visualization of Association Rule
Using Rule Graph Using Rule Graph
University of Indonesia
29
Visualization of Association Rule
Using Plane Graph Using Plane Graph
University of Indonesia
30
Conclusion Conclusion
Association rule mining
probably the most significant contribution from the
database community in KDD
A large number of papers have been published
Many interesting issues have been explored
An interesting research direction
Association analysis in other types of data: spatial Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
University of Indonesia
31
References References
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001.
David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill,
2007.
Agarwal, R. C., Aggarwal, C. C. & Prasad, V. V. V. 2001, 'A Tree Projection g gg j
Algorithm for Generation of Frequent Item Sets', Journal of Parallel and
Distributed Computing (Special Issue on High-Performance Data Mining), vol. 61, no.
3, pp. 350-371.
Han, J., Pei, J. & Yin, Y. 2000, 'Mining Frequent Patterns without Candidate
Generation', in Proceedings of the ACM SIGMOD International Conference on
Management of Data, Dallas, Texas, USA, pp. 1-12.
Savasere, A., Omiecinski, E. & Navathe, S. 1995, 'An Efficient Algorithm for
Mining Association Rules in Large Databases', in Proceedings of the 21st
International Conference on Very Large Data Bases (VLDB), Zurich, Switzerland, pp.
University of Indonesia
432-444
32
References (2) References (2)
Pei, J. 2002, Pattern-growth Methods for Frequent Pattern Mining, PhD Thesis, Simon
Fraser University, Canada.
Z k M J 1997 'P ll l Al h f F D f A R l ' D Zaki, M. J. 1997, 'Parallel Algorithms for Fast Discovery of Association Rules', Data
Mining and Knowledge Discovery: An International Journal, vol. 1, no. 4, pp. 343-373
Sucahyo, Y. G. & Gopalan, R. P. 2004, 'CT-PRO: A Bottom-Up Non Recursive Frequent
Itemset Mining Algorithm Using Compressed FP Tree Data Structure' in Proceedings Itemset Mining Algorithm Using Compressed FP-Tree Data Structure , in Proceedings
of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI),
Brighton, UK.
University of Indonesia
33

doc_840104135.pdf

Association Rules

Attachments