Concepts and Techniques
(3rd ed.)
— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
Adapted for CSE 347-447, Lecture 5b, Spring 2015
1
Chapter 6: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?—Pattern
Evaluation Methods n Summary
2
What Is Frequent Pattern Analysis? u Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
u
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining
u
u
Motivation: Finding inherent regularities in data u What products were often purchased together?— Beer and diapers?!
u
What are the subsequent purchases after buying a PC?
u
What kinds of DNA are sensitive to this new drug?
u
Can we automatically classify web documents?
Applications u Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Frequent Pattern Mining Important? n n
Frequent patterns: An intrinsic and important property of datasets Foundation for many essential data mining tasks n n n n n n n n
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
4
Basic Concepts: Frequent Patterns
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer buys diaper
n
n n n
n
Customer buys beer
itemset: A set of one or more items k-itemset X = {x1, …, xk}
(absolute) support, or, support count of X: Frequency or occurrence of an itemset X
(relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)
An itemset X is frequent if X’s support is no less than a minsup threshold 5
Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
50
Nuts, Eggs, Milk
n
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer buys beer
Customer buys diaper
Find all the rules X à Y with minimum support and confidence n support, s, probability that a transaction contains X and Y n confidence, c, conditional probability that a transaction having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3 n Association rules: (many more!) n Beer à Diaper (60%, 100%) n Diaper à Beer (60%, 75%)
6
Closed Patterns and Max-Patterns n n n n
A long pattern contains a combinatorial number of sub100
100
patterns, e.g., {a1, …, a100} contains ( 1 ) + ( 2 ) + … +
100 – 1 = 1.27*1030 sub-patterns!
(100
100) = 2
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
7
Closed Patterns and Max-Patterns n n
n
Exercise: Suppose a DB contains only two transactions n <a1, …, a100>, <a1, …, a50>
n
Let min_sup = 1
What is the set of closed itemsets? n {a1, …, a100}: 1
n
{a1, …, a50}: 2
What is the set of max-patterns? n n
{a1, …, a100}: 1
What is