Lecture 2 Course Content Week 2 (March 17) and Week 3 (March 24) 33459-01 Principles of Knowledge Discovery • Introduction to Data Mining in Data • Association Analysis Association Rule Mining • Sequential Pattern Analysis • Classification and Prediction • Contrast Sets • Data Clustering Lecture by: Dr. Osmar R. Zaïane • Outlier Detection • Web Mining 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 1 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 2 (Dr. O. Zaiane) (Dr. O. Zaiane) What Is Association Rule Mining? Transactional Databases • Association rule mining searches for relationships between items in a dataset : Transaction Frequent itemset Rule – aims at discovering associations between items in a transactional database. {bread, milk, beer,…} (Bread, milk) Bread � milk find Store {a,b,c,d…} combinations of items that Automatic diagnostic {x,y,z} occur typically Background, Motivation and General Outline of the Proposed Project {term 1 , term 2 ,…,term n } (term 2 , term 25 ) term2 � term25 We have been collecting tremendous amounts of information counting on the power of computers to help efficiently sort together through this amalgam of information. Unfortunately, these massive collections of data stored on disparate dispersed media very rapidly become overwhelming. Regrettably, most of the collected large datasets remain unanalyzed due to lack of appropriate, effective and scalable techniques. { , , ,…} • Rule form: “ Body � Head [support, confidence] ” buys(x, “bread”) � buys(x, “milk”) [0.6%, 65%] {f1, f2,…,Ca} (f3, f5, f α ) f3^f5 � f α major(x, “CS”) ^ takes(x, “DB”) � grade(x, “A”) [1%, 75%] 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 3 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 4 (Dr. O. Zaiane) (Dr. O. Zaiane) Association Rule Mining Lecture Outline Part I: Concepts (30 minutes) mining association rules Fast algorithm Partitioning (Agrawal et. al SIGMOD93) (Agrawal et. al VLDB94) (Navathe et. al VLDB95) Basic concepts • Generalized A.R. Hash-based Multilevel A.R. Support and Confidence • (Park et. al SIGMOD95) (Han et. al. VLDB95) (Srikant et. Al. VLDB95) Naïve approach • Quantitative A.R. Incremental mining Parallel mining Part II: The Apriori Algorithm (30 minutes) (Cheung et. al ICDE96) (Agrawal et. al TKDE96) (Srikant et. al SIGMOD96) Principles • Algorithm Distributed mining Meta-ruleguided mining Direct Itemset Counting • (Cheung et. al PDIS96) (Kamber et al. KDD97) (Brin et. al SIGMOD97) Running Example • Part III: The FP-Growth Algorithm (30 minutes) N-dimensional A.R. Constraint A.R. A.R. with recurrent items FP-tree structure (Lu et. al DMKD’98) (Ng et. al SIGMOD’98) (Zaïane et. al ICDE’00) • Running Example • FP without Candidate gen. DualMiner COFI algorithm (Han et. al SIGMOD’00) (Bucil, et. al KDD’02) (El-Hajj, et. al Dawak’03) Part IV: More Advanced Concepts (30 minutes) Database layout and space search approach And many many others: • Spatial AR; Sequence Associations;AR for multimedia; AR Other types of patterns and constraints • in time series;AR with progressive refinement; etc. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 5 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 6 (Dr. O. Zaiane) (Dr. O. Zaiane)
Finding Rules in Transaction Data Set Basic Concepts • 6 transactions • 5 items: {Beer, Bread, Jelly, Milk, PeanutButter} Transactions Items A transaction is a set of items: T={i a , i b ,…i t } T1 Bread, Jelly, PeanutButter T2 Bread, PeanutButter T ⊂ I, where I is the set of all possible items {i 1 , i 2 ,…i d } T3 Bread, Milk, PeanutButter T4 Beer, Bread T5 Beer, Milk D , the task relevant data, is a set of transactions D={T 1 , T 2 ,…T n }. T6 Bread, Milk • Searching for rules of the form X � Y, where X and Y are An association rule is of the form: sets of items P � Q, where P ⊂ I , Q ⊂ I , and P ∩ Q = ∅ – e.g. Bread � Jelly; Bread, Jelly � PeanutButter • Design an efficient algorithm for mining association rules in large data sets • Develop an effective approach for distinguishing interesting rules from irrelevant ones 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 7 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 8 (Dr. O. Zaiane) (Dr. O. Zaiane) Support of an Itemset Basic Concepts (con’t) Support of P = P 1 ∧ P 2 ∧ ... ∧ P k in D σ (P/ D ) is the probability that P • occurs in D: it is the percentage of transactions T in D satisfying P. A set of items is referred to as itemset. I.e. the support of an item (or itemset) X is the percentage of • An itemset containing k items is called k-itemset . transactions in which that item (or items) occurs: (number of T by {Jelly, Milk, Bread} is a 3-itemset example cardinality of D ). # X = s upport ( X ) An items set can also be seen as a conjunction of items (or a n predicate) Support for all subsets of • Itemset Support Itemset Support items Beer 33% Beer, Bread, Milk 0% Bread 66% Beer, Bread, PeanutButter 0% – Note the exponential P � Q holds in D with support s Jelly 16% Beer, Jelly, Milk 0% growth in the set of items Milk 50% Beer, Jelly, PeanutButter 0% PeanutButter 50% Beer, Milk, PeanutButter 0% and – 5 items: 31 sets Beer, Bread 16% Bread, Jelly, Milk 0% Beer, Jelly 0% Bread, Jelly, PeanutButter 16% P � Q has a confidence c in the transaction set D . Beer, Milk 16% Bread, Milk, PeanutButter 16% Transactions Items Beer, PeanutButter 0% Jelly, Milk, PeanutButter 0% T1 Bread, Jelly, PeanutButter Bread, Jelly 16% Beer, Bread, Jelly, Milk 0% T2 Bread, PeanutButter Bread, Milk 33% Beer, Bread, Jelly, PeanutButter 0% Support(P � Q) = Probability(P ∪ Q) Bread, PeanutButter 50% Beer, Bread, Milk, PeanutButter 0% T3 Bread, Milk, PeanutButter Jelly, Milk 0% Beer, Jelly, Milk, PeanutButter 0% T4 Beer, Bread Jelly, PeanutButter 16% Bread, Jelly, Milk, PeanutButter 0% Confidence(P � Q) = Probability(Q/P) T5 Beer, Milk Milk, PeanutButter 16% Beer, Bread, Jelly, Milk, PeanutButter 0% T6 Bread, Milk Beer, Bread, Jelly 0% 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 9 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 10 (Dr. O. Zaiane) (Dr. O. Zaiane) Support and Confidence of an Association Rule Support and Confidence – cont. • What is the support and Transactions Items • The support of an association rule X � Y is the T1 Bread, Jelly, PeanutButter confidence of the following rules? T2 Bread, PeanutButter percentage of transactions that contain X ∪ Y T3 Bread, Milk, PeanutButter – Beer � Bread T4 Beer, Bread T5 Beer, Milk ∪ # ( X Y) T6 Bread, Milk – {Bread, PeanutButter} � Jelly support ( X − > Y ) = n • Support and confidence for some association rules • The confidence of an association rule X � Y is the ratio Rule Support Confidence of the number of transactions that contain X ∪ Y to the Bread � PeanutButter 50% 75% Why the number of transactions that contain X PeanutButter � Bread 50% 100% difference? Beer � Bread 16% 50% ∪ # ( X Y) PeanutButter � Jelly 16% 33% − > = confidence ( X Y ) Jelly � PeanutButter 16% 100% # X Jelly � Milk 0% 0% {Bread, PeanutButter} � Jelly 16% 33% • Confidence of a rule P → Q in database D ϕ( P → Q/ D ) is • Support measures how often the rule occurs in the the ratio σ ((P ∧ Q)/ D ) by σ (P/ D ) database. support ( X − > Y) − > = confidence ( X Y ) • Confidence measures the strength of the rule. support ( X ) 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 11 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 12 (Dr. O. Zaiane) (Dr. O. Zaiane)
Recommend
More recommend