2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and - PowerPoint PPT Presentation

2.5 Association Rule Mining based on: Chlo´ e-Agathe Azencott and Karsten Borgwardt. Course ’Data Mining in Bioinformatics’. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 179 / 230

Keyword co-occurrence Goals To understand the link between keyword co-occurrence, and association rule mining and frequent itemset mining To understand how the computation of frequent itemsets can be sped up Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 180 / 230

Keyword co-occurrence Problem Find sets of keyword that often co-occur Common problem in biomedical literature: find associations between genes, proteins or other entities using co-occurrence search Keyword co-occurrence search is an instance of a more general problem in data mining, called association rule mining. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 181 / 230

Association Rules Definitions Let I = { I 1 , I 2 , . . . , I m } be a set of items (keywords) Let D be the database of transactions T (collection of documents) A transaction T ∈ D is a set of items: T ⊆ I (a document is a set of keywords) Let A be a set of items: A ⊆ T . An association rule is an implication of the form A ⊆ T ⇒ B ⊆ T , (1) where A , B ⊆ I and A ∩ B = ∅ Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 182 / 230

Association Rules Support and Confidence The rule A ⇒ B holds in the transaction set D with support s , where s is the percentage of transactions in D that contain A ∪ B : support( A ⇒ B ) = |{ T ∈ D | A ⊆ T ∧ B ⊆ T }| (2) |{ T ∈ D }| The rule A ⇒ B has confidence c in the transaction set D , where c is the percentage of transactions in D containing A that also contain B : confidence( A ⇒ B ) = |{ T ∈ D | A ⊆ T ∧ B ⊆ T }| (3) |{ T ∈ D | A ⊆ T }| Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 183 / 230

Association Rules Strong rules Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold (minconf) are called strong association rules — and these are the ones we are after! Finding strong rules 1. Search for all frequent itemsets (set of items that occur in at least minsup % of all transactions) 2. Generate strong association rules from the frequent itemsets Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 184 / 230

Association Rules: Frequent pattern mining Frequent item set mining Market basket analysis Find items that are frequently purchased together Given a set B = { i 1 , i 2 , . . . , i n } of items a list T = { t 1 , t 2 , . . . , t m } of transactions t j ⊆ B a minimum number of occurences s min ∈ N Find the set of frequent item sets , i.e. F ( s min ) = { I ⊆ B : |{ k : I ⊆ t k } ≥ s min } Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 185 / 230

Association Rules: Apriori [Agrawal et al. , 1994] Brute force approach Enumerate all 2 n subsets of B Count how often each of them is included in each of t 1 , . . . , t m Generally infeasible The Apriori property If an itemset A is frequent, then any subset B of A ( B ⊆ A ) is frequent as well. If B is infrequent, then any superset A of B ( A ⊇ B ) is infrequent as well. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 186 / 230

Association Rules Apriori Pseudocode 1 Determine frequent items = k -itemsets with k = 1 2 Join all pairs of frequent k -itemsets that differ in at most 1 item = candidates C k +1 for being frequent k + 1 itemsets 3 Check the frequency of these candidates C k +1 : the frequent ones form the frequent k + 1-itemsets (trick: discard any candidate immediately that contains an infrequent k -itemset) 4 Repeat from Step 2 until no more candidate is frequent. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 187 / 230

Association Rules: A Priori Generating unique candidates There are k ! ways of generating a single set of k items Ensure we do it only once ⇒ Idea: assign a unique parent set to each set Canonical form The set of possible parents of an item set I is the set of its maximal proper subsets : { J ⊂ I | ∄ K : J ⊂ K ⊂ I } Put an ordering on B : i 1 < i 2 < · · · < i n Define the canonical parent of I as p c ( I ) = I \ { max a ∈ I a } Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 188 / 230

Association Rules: A Priori Canonical code words code word for I ⊆ B : any word w on the alphabet B canonical code word of I w c ( I ): smallest of these words, in lexicographic order E.g. { a , c , b , e } → abce The canonical parent of I p c ( I ) is described by the longest proper prefix of w c ( I ). Prefix property : The longest proper prefix of a canonical code word is a canonical code word itself. Equivalently, any prefix of a canonical code word is a canonical code word itself. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 189 / 230

Association Rules: A Priori Candidate set generation From frequent item sets of size k − 1, construct item sets of size k by appending (frequent) items to their canonical code words Only do so for items greater than the last letter of the canonical code word abe → abef , abeg , ✘✘ abec ✘ Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 190 / 230

Association Rules: A Priori Prefix tree a b c d ab ac ad bc bd cd abc abd acd bcd abcd Full prefix tree for B = { a , b , c , d } Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 191 / 230

Association Rules: A Priori Pruning the prefix tree Only generate unique item sets A-priori property ⇒ Prune branches at infrequent items Size-based pruning T = {{ a , b } , { a , b , c } , { b , c } , a b c d 4 11 9 7 { b } , { b , d } , { d } , { a , c } , { b , c } , { d } , { a , c } , { b , c } , { b , c , d } , ab ac ad bc bd cd { d } , { b } , { b , c , d } , { b , c , d }} 2 2 0 7 4 3 abc abd acd bcd 3 1 0 0 abcd 0 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 192 / 230

Association Rules: Frequent Pattern Mining Exploring the search tree Breadth-First Search : find all frequent sets of size k before moving on to size k + 1 → A-priori Depth-First Search : find all frequent sets containing element a before moving on to those that contain b but do not contain a Advantage : divide-and-conquer strategy, requires less memory → Eclat, FP-growth ... Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 193 / 230

Association Rules Summary Keyword co-occurrence is a way to mine relationships between concepts from text databases. It is an instance of association rule mining, which tries to find associations between the occurrences of sets of words. The classic algorithm for finding association rules is the Apriori algorithm, which enumerates all frequent itemsets in a branch-and-bound fashion. Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 194 / 230

3. Graph Mining based on: Chlo´ e-Agathe Azencott and Karsten Borgwardt. Course ’Data Mining in Bioinformatics’. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 195 / 230

Graphs are everywhere Coexpression network Social network Program flow Protein structure Chemical compound Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 196 / 230

Mining graph data Graph comparison Example: Compare PPIN between species Graph classification / regression Predict properties of objects represented as graphs Example: Predict toxicity of molecular compound, functionality of protein Graph nodes classification / regression Predict properties of objects connected on a graph Example: Predict functionality of protein, classify pixels in remote sensing images Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 197 / 230

Mining graph data Graph compression Representing graphs compactly Example: Store and mine web data Graph clustering Finding dense subnetworks of graphs Example: Find groups in social networks Link prediction Predicting relationships between nodes of the graph Example: Predict who should be added to your social network, predict interactions Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 198 / 230

Graph pattern mining Graph pattern mining Find frequent / informative graph patterns Summarize patterns Approximate patterns Applications Finding biological conserved subnetworks Finding functional modules Program control flow analysis Intrusion detection Building blocks for graph classification, clustering, compression, comparison Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 199 / 230

3.1 Frequent Subgraph Mining Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester 2016 200 / 230

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and - PowerPoint PPT Presentation

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and Karsten Borgwardt. Course Data Mining in Bioinformatics. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Relationship Mining Association Rule Mining Association Rule Mining Try to automatically find

Week 5 Video 3 Relationship Mining Association Rule Mining Association Rule Mining Try to

Association Rules from transactional databases ! Mining multilevel association rules from

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

Association rule mining Association rule induction: Originally designed for market basket analysis

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Today -

Williamsburg Community Chapel Mens Breakfast March 29, 2019 1 Thessalonians is our call to

6.888 Secure Hardware Design Mengjia Yan Fall 2020 Todays Agenda Introduce yourself

From minute utes t s to m milli llisec seconds Tips and Tricks for faster SQL queries

Mixed models in R using the lme4 package Part 6: Nonlinear mixed models Douglas Bates Madison

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

Results present the complete findings in numerical terms. Wu-Lin Chen (wlchen@pu.edu.tw) The

Current Management of Obesity Alka M. Kanaya, MD Professor of Medicine, Epidemiology &

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and - PowerPoint PPT Presentation

2.5 Association Rule Mining based on: Chlo e-Agathe Azencott and Karsten Borgwardt. Course Data Mining in Bioinformatics. Chapter Graph Mining. 2012 Department Biosysteme Karsten Borgwardt Data Mining 2 Course, Basel Spring Semester

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Relationship Mining Association Rule Mining Association Rule Mining Try to automatically find

Week 5 Video 3 Relationship Mining Association Rule Mining Association Rule Mining Try to

Association Rules from transactional databases ! Mining multilevel association rules from

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

Association rule mining Association rule induction: Originally designed for market basket analysis

CISC 4631 Data Mining Lecture 10: Association Rule Mining Theses slides are based on the slides

Rule Changes - Non rule change year Review of 2017 rule changes - just the easy to forgot

Common Rule Advanced Notice of Proposed Rulemaking (ANPRM) IRB Investigator Advanced Notice

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Rule #1: Have a takeaway. Rule #2: Keep It Simple. Rule #3: Repetition is Good. Rule #4: Be

Counting Rules, etc Product Rule Generalized Product Rule Division Rule Bijection

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Today -

Williamsburg Community Chapel Mens Breakfast March 29, 2019 1 Thessalonians is our call to

6.888 Secure Hardware Design Mengjia Yan Fall 2020 Todays Agenda Introduce yourself

From minute utes t s to m milli llisec seconds Tips and Tricks for faster SQL queries

Mixed models in R using the lme4 package Part 6: Nonlinear mixed models Douglas Bates Madison

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

Results present the complete findings in numerical terms. Wu-Lin Chen (wlchen@pu.edu.tw) The

Current Management of Obesity Alka M. Kanaya, MD Professor of Medicine, Epidemiology &amp;

Current Management of Obesity Alka M. Kanaya, MD Professor of Medicine, Epidemiology &