2 5 association rule mining
play

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and - PowerPoint PPT Presentation

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and Karsten Borgwardt. Course Data Mining in Bioinformatics. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester


  1. 2.5 Association Rule Mining based on: Chlo´ e-Agathe Azencott and Karsten Borgwardt. Course ’Data Mining in Bioinformatics’. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 179 / 230

  2. Keyword co-occurrence Goals To understand the link between keyword co-occurrence, and association rule mining and frequent itemset mining To understand how the computation of frequent itemsets can be sped up Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 180 / 230

  3. Keyword co-occurrence Problem Find sets of keyword that often co-occur Common problem in biomedical literature: find associations between genes, proteins or other entities using co-occurrence search Keyword co-occurrence search is an instance of a more general problem in data mining, called association rule mining. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 181 / 230

  4. Association Rules Definitions Let I = { I 1 , I 2 , . . . , I m } be a set of items (keywords) Let D be the database of transactions T (collection of documents) A transaction T ∈ D is a set of items: T ⊆ I (a document is a set of keywords) Let A be a set of items: A ⊆ T . An association rule is an implication of the form A ⊆ T ⇒ B ⊆ T , (1) where A , B ⊆ I and A ∩ B = ∅ Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 182 / 230

  5. Association Rules Support and Confidence The rule A ⇒ B holds in the transaction set D with support s , where s is the percentage of transactions in D that contain A ∪ B : support( A ⇒ B ) = |{ T ∈ D | A ⊆ T ∧ B ⊆ T }| (2) |{ T ∈ D }| The rule A ⇒ B has confidence c in the transaction set D , where c is the percentage of transactions in D containing A that also contain B : confidence( A ⇒ B ) = |{ T ∈ D | A ⊆ T ∧ B ⊆ T }| (3) |{ T ∈ D | A ⊆ T }| Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 183 / 230

  6. Association Rules Strong rules Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold (minconf) are called strong association rules — and these are the ones we are after! Finding strong rules 1. Search for all frequent itemsets (set of items that occur in at least minsup % of all transactions) 2. Generate strong association rules from the frequent itemsets Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 184 / 230

  7. Association Rules: Frequent pattern mining Frequent item set mining Market basket analysis Find items that are frequently purchased together Given a set B = { i 1 , i 2 , . . . , i n } of items a list T = { t 1 , t 2 , . . . , t m } of transactions t j ⊆ B a minimum number of occurences s min ∈ N Find the set of frequent item sets , i.e. F ( s min ) = { I ⊆ B : |{ k : I ⊆ t k } ≥ s min } Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 185 / 230

  8. Association Rules: Apriori [Agrawal et al. , 1994] Brute force approach Enumerate all 2 n subsets of B Count how often each of them is included in each of t 1 , . . . , t m Generally infeasible The Apriori property If an itemset A is frequent, then any subset B of A ( B ⊆ A ) is frequent as well. If B is infrequent, then any superset A of B ( A ⊇ B ) is infrequent as well. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 186 / 230

  9. Association Rules Apriori Pseudocode 1 Determine frequent items = k -itemsets with k = 1 2 Join all pairs of frequent k -itemsets that differ in at most 1 item = candidates C k +1 for being frequent k + 1 itemsets 3 Check the frequency of these candidates C k +1 : the frequent ones form the frequent k + 1-itemsets (trick: discard any candidate immediately that contains an infrequent k -itemset) 4 Repeat from Step 2 until no more candidate is frequent. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 187 / 230

  10. Association Rules: A Priori Generating unique candidates There are k ! ways of generating a single set of k items Ensure we do it only once ⇒ Idea: assign a unique parent set to each set Canonical form The set of possible parents of an item set I is the set of its maximal proper subsets : { J ⊂ I | ∄ K : J ⊂ K ⊂ I } Put an ordering on B : i 1 < i 2 < · · · < i n Define the canonical parent of I as p c ( I ) = I \ { max a ∈ I a } Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 188 / 230

  11. Association Rules: A Priori Canonical code words code word for I ⊆ B : any word w on the alphabet B canonical code word of I w c ( I ): smallest of these words, in lexicographic order E.g. { a , c , b , e } → abce The canonical parent of I p c ( I ) is described by the longest proper prefix of w c ( I ). Prefix property : The longest proper prefix of a canonical code word is a canonical code word itself. Equivalently, any prefix of a canonical code word is a canonical code word itself. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 189 / 230

  12. Association Rules: A Priori Candidate set generation From frequent item sets of size k − 1, construct item sets of size k by appending (frequent) items to their canonical code words Only do so for items greater than the last letter of the canonical code word abe → abef , abeg , ✘✘ abec ✘ Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 190 / 230

  13. Association Rules: A Priori Prefix tree a b c d ab ac ad bc bd cd abc abd acd bcd abcd Full prefix tree for B = { a , b , c , d } Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 191 / 230

  14. Association Rules: A Priori Pruning the prefix tree Only generate unique item sets A-priori property ⇒ Prune branches at infrequent items Size-based pruning T = {{ a , b } , { a , b , c } , { b , c } , a b c d 4 11 9 7 { b } , { b , d } , { d } , { a , c } , { b , c } , { d } , { a , c } , { b , c } , { b , c , d } , ab ac ad bc bd cd { d } , { b } , { b , c , d } , { b , c , d }} 2 2 0 7 4 3 abc abd acd bcd 3 1 0 0 abcd 0 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 192 / 230

  15. Association Rules: Frequent Pattern Mining Exploring the search tree Breadth-First Search : find all frequent sets of size k before moving on to size k + 1 → A-priori Depth-First Search : find all frequent sets containing element a before moving on to those that contain b but do not contain a Advantage : divide-and-conquer strategy, requires less memory → Eclat, FP-growth ... Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 193 / 230

  16. Association Rules Summary Keyword co-occurrence is a way to mine relationships between concepts from text databases. It is an instance of association rule mining, which tries to find associations between the occurrences of sets of words. The classic algorithm for finding association rules is the Apriori algorithm, which enumerates all frequent itemsets in a branch-and-bound fashion. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 194 / 230

  17. 3. Graph Mining based on: Chlo´ e-Agathe Azencott and Karsten Borgwardt. Course ’Data Mining in Bioinformatics’. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 195 / 230

  18. Graphs are everywhere Coexpression network Social network Program flow Protein structure Chemical compound Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 196 / 230

  19. Mining graph data Graph comparison Example: Compare PPIN between species Graph classification / regression Predict properties of objects represented as graphs Example: Predict toxicity of molecular compound, functionality of protein Graph nodes classification / regression Predict properties of objects connected on a graph Example: Predict functionality of protein, classify pixels in remote sensing images Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 197 / 230

  20. Mining graph data Graph compression Representing graphs compactly Example: Store and mine web data Graph clustering Finding dense subnetworks of graphs Example: Find groups in social networks Link prediction Predicting relationships between nodes of the graph Example: Predict who should be added to your social network, predict interactions Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 198 / 230

  21. Graph pattern mining Graph pattern mining Find frequent / informative graph patterns Summarize patterns Approximate patterns Applications Finding biological conserved subnetworks Finding functional modules Program control flow analysis Intrusion detection Building blocks for graph classification, clustering, compression, comparison Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 199 / 230

  22. 3.1 Frequent Subgraph Mining Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 200 / 230

Recommend


More recommend