MINING Set Data: Frequent Pattern Mining Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Set Data: Frequent Pattern Mining Instructor: Yizhou Sun yzsun@cs.ucla.edu November 17, 2018

Midterm Statistics • Highest: 105 Congratulations! • Mean: 86.5 • Median: 90 • Standard deviation: 10.8 negatively skewed Recall: 2

Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 3

Mining Frequent Patterns, Association and Correlations • Basic Concepts • Frequent Itemset Mining Methods • Pattern Evaluation Methods • Summary 4

Set Data • A data point corresponds to a set of items • Each data point is also called a transaction nsaction Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk 5

What Is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data • What products were often purchased together? — Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? 6

Why Is Freq. Pattern Mining Important? • Freq. pattern: An intrinsic and important property of datasets • Foundation for many essential data mining tasks • Association, correlation, and causality analysis • Sequential, structural (e.g., sub-graph) patterns • Pattern analysis in spatiotemporal, multimedia, time-series, and stream data • Classification: discriminative, frequent pattern analysis • Cluster analysis: frequent pattern-based clustering • Broad applications 7

Basic Concepts: Frequent Patterns Tid Items bought • itemset: A set of one or more items 10 Beer, Nuts, Diaper • k-itemset X = {x 1 , …, x k }: A set of k 20 Beer, Coffee, Diaper items 30 Beer, Diaper, Eggs • (absolute) support , or, support count 40 Nuts, Eggs, Milk of X: Frequency or occurrence of an 50 Nuts, Coffee, Diaper, Eggs, Milk itemset X • (relative) support , s , is the fraction of Customer Customer transactions that contains X (i.e., the buys both buys diaper probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup Customer threshold buys beer 8

Basic Concepts: Association Rules Find all the rules X  Y with • minimum support and confidence Tid Items bought 10 Beer, Nuts, Diaper • support, s , probability that a 20 Beer, Coffee, Diaper transaction contains X  Y 30 Beer, Diaper, Eggs • confidence, c, conditional 40 Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk 50 probability that a transaction having X also contains Y Customer Customer buys both buys Let minsup = 50%, minconf = 50% diaper Freq. Pat.: {Beer}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, {Beer, Diaper}:3 Customer Strong Association rules  buys beer {Beer}  {Diaper} (60%, 100%)  {Diaper}  {Beer} (60%, 75%)  9

Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns • e.g., {a 1 , …, a 100 } contains 2 100 – 1 = 1.27*10 30 sub-patterns! • In general, {a 1 , …, a n } contains 2 n – 1 sub- patterns • 𝑜 1 + 𝑜 2 + ⋯ + 𝑜 𝑜 = 2 𝑜 − 1 10

Closed Patterns and Max-Patterns • Solution: Mine closed patterns and max-patterns instead • An itemset X is closed if X is frequent and there exists no super- pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT ’ 99) • An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD ’ 98) • Closed pattern is a lossless compression of freq. patterns • Reducing the # of patterns and rules 11

Closed Patterns and Max-Patterns • Example. DB = {{a 1 , …, a 100 }, {a 1 , …, a 50 }} • Min_sup = 1. • What is the set of closed pattern(s)? • {a 1 , …, a 100 }: 1 • {a 1 , …, a 50 }: 2 • Yes, it does have super-pattern, but not with the same support • What is the set of max-pattern(s)? • {a 1 , …, a 100 }: 1 • What is the set of all patterns? • !! 12

Computational Complexity of Frequent Itemset Mining • How many itemsets are potentially to be generated in the worst case? • The number of frequent itemsets to be generated is sensitive to the minsup threshold • When minsup is low, there exist potentially an exponential number of frequent itemsets • The worst case: 𝑁 𝑂 • M: # distinct items, N: max length of transactions 𝑁 • = 𝑁 × 𝑁 − 1 × ⋯ × (𝑁 − 𝑂 + 1)/𝑂! 𝑂 13

Mining Frequent Patterns, Association and Correlations • Basic Concepts • Frequent Itemset Mining Methods • Pattern Evaluation Methods • Summary 15

Scalable Frequent Itemset Mining Methods • Apriori: A Candidate Generation-and-Test Approach • Improving the Efficiency of Apriori • FPGrowth: A Frequent Pattern-Growth Approach • *ECLAT: Frequent Pattern Mining with Vertical Data Format • Generating Association Rules 16

The Apriori Property and Scalable Mining Methods • The Apriori property of frequent patterns • Any nonempty subsets of a frequent itemset must be frequent • E.g., If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Scalable mining methods: Three major approaches • Apriori (Agrawal & Srikant@VLDB’94) • Freq. pattern growth (FPgrowth — Han, Pei & Yin @SIGMOD’00) • *Vertical data format approach (Eclat) 17

Apriori: A Candidate Generation & Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB ’ 94, Mannila, et al. @ KDD ’ 94) • Method: • Initially, scan DB once to get frequent 1-itemset • Generate length k candidate itemsets from length k-1 frequent itemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated 18

From Frequent k-1 Itemset To Frequent k-Itemset C k : Candidate itemsets of size k L k : frequent itemsets of size k • From 𝑀 𝑙−1 to 𝐷 𝑙 (Candidates Generation) • The join step • The prune step • From 𝐷 𝑙 to 𝑀 𝑙 • Test candidates by scanning database 19

Candidates Generation Assume a pre-specified order for items, e.g., alphabetical order • How to generate candidates C k ? • Step 1: self-joining L k-1 • Two length k-1 itemsets 𝑚 1 and 𝑚 2 can join, only if the first k- 2 items are the same, and for the last term, 𝑚 1 𝑙 − 1 < 𝑚 2 𝑙 − 1 (why?) • Step 2: pruning • Why we need pruning for candidates? • How? • Again, use Apriori property • A candidate itemset can be safely pruned, if it contains infrequent subset 20

Candidate-Generation Example • Example of Candidate-generation from L 3 to C 4 • L 3 = { abc, abd, acd, ace, bcd } • Self-joining: L 3 *L 3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L 3 • C 4 = { abcd } 21

The Apriori Algorithm — Example Sup min = 2 Itemset sup Itemset sup Database TDB {A} 2 L 1 {A} 2 Tid Items C 1 {B} 3 {B} 3 10 A, C, D {C} 3 1 st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E Itemset sup C 2 C 2 Itemset {A, B} 1 2 nd scan Itemset sup L 2 {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {A, E} {B, C} 2 {B, E} 3 {B, C} {B, E} 3 {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset sup Itemset 3 rd scan L 3 C 3 {B, C, E} 2 {B, C, E} 22

The Apriori Algorithm ( Pseudo-Code) C k : Candidate itemsets of size k L k : frequent itemsets of size k L 1 = {frequent items}; for ( k = 2; L k-1 !=  ; k ++) do begin C k = candidates generated from L k-1 ; for each transaction t in database do increment the count of all candidates in C k that are contained in t L k = candidates in C k with min_support end return  k L k ; 23

Questions • How many scans on DB are needed for Apriori algorithm? • When (k = ?) does Apriori algorithm generate the biggest number of candidate itemsets? • Is support counting for candidates expensive? 24

Further Improvement of the Apriori Method • Major computational challenges • Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for candidates • Improving Apriori: general ideas • Reduce passes of transaction database scans • Shrink number of candidates • Facilitate support counting of candidates 25

*Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Scan 1: partition database and find local frequent patterns • Scan 2: consolidate global frequent patterns • A. Savasere, E. Omiecinski and S. Navathe, VLDB’95 DB 1 + DB 2 + + DB k = DB sup 1 (i) < σ DB 1 sup 2 (i) < σ DB 2 sup k (i) < σ DB k sup(i) < σ DB

MINING Set Data: Frequent Pattern Mining Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Set Data: Frequent Pattern Mining Instructor: Yizhou Sun yzsun@cs.ucla.edu November 17, 2018 Midterm Statistics Highest: 105 Congratulations! Mean: 86.5 Median: 90 Standard deviation: 10.8

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data mining functionalities Data Mining Major

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

NANO MINING POOL CLOUD CONTRACTS AND MINING SERVICES OUR PRODUCTS Cloud cards are mining cards

Business 101: The Nuts & Bolts of Starting a Games Studio Rachel Presser The Code Liberation

Assigned to Approved by financial committee committee ? Public hearing ? S ent

NUT Black Holes in N = 2 Gauged Supergravity Harold Erbin Lpthe , Universit Paris 6Jussieu

Laura J Rosenak Senior Vice President/Child Support MAXIMUS 2018 NCSEA Board of Directors

Nuts and Bolts of Real Estate Due Diligence Due diligence review of real estate tests the

WHAT DRIVES YOU NUTS? Duncan E. Campbell Retired Director General, Montreal, Quebec, Canada

Randomized Algorithms Lecture 9 September 24, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 32

Gravitational duality: a NUT story Francois Dehouck U.L.B. Brussels September 8, 2009 Francois

MINING Set Data: Frequent Pattern Mining Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Set Data: Frequent Pattern Mining Instructor: Yizhou Sun yzsun@cs.ucla.edu November 17, 2018 Midterm Statistics Highest: 105 Congratulations! Mean: 86.5 Median: 90 Standard deviation: 10.8

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data mining functionalities Data Mining Major

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

NANO MINING POOL CLOUD CONTRACTS AND MINING SERVICES OUR PRODUCTS Cloud cards are mining cards

Business 101: The Nuts &amp; Bolts of Starting a Games Studio Rachel Presser The Code Liberation

Assigned to Approved by financial committee committee ? Public hearing ? S ent

NUT Black Holes in N = 2 Gauged Supergravity Harold Erbin Lpthe , Universit Paris 6Jussieu

Laura J Rosenak Senior Vice President/Child Support MAXIMUS 2018 NCSEA Board of Directors

Nuts and Bolts of Real Estate Due Diligence Due diligence review of real estate tests the

WHAT DRIVES YOU NUTS? Duncan E. Campbell Retired Director General, Montreal, Quebec, Canada

Randomized Algorithms Lecture 9 September 24, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 32

Gravitational duality: a NUT story Francois Dehouck U.L.B. Brussels September 8, 2009 Francois

Business 101: The Nuts & Bolts of Starting a Games Studio Rachel Presser The Code Liberation