Administrative notes • March 14: Midterm 2: this will cover all lectures, labs and readings between Tue Jan 31 and Thu Mar 9 inclusive • Practice Midterm 2 is on Exercises webpage: http://www.ugrad.cs.ubc.ca/~cs100/2016W2/ exercises.html#exams • March 17: In the News call #3 • March 30: Project deliverables and individual report due Computational Thinking ct.cs.ubc.ca
Administrative notes • Check “Project Rubric” on the Connect grade centre to learn which rubric we will be using to grade your project. Find your rubric at http://www.ugrad.cs.ubc.ca/~cs100/2016W2/proje ct-grading.html#projectMarkingScheme. If you have questions, please email your project TA (also listed on Connect). • We will email you which projects you should review. Please ensure that email forwarding for your CS email (CS_ID@ugrad.cs.ubc.ca) works (you should have set this up in Lab 0). • Computational Thinking ct.cs.ubc.ca
Data Mining 4 Mining by Association: Apriori algorithm wrap-up Computational Thinking ct.cs.ubc.ca
Recall: How to predict the future? Association rules • An association rule X à Y suggests that people who buy items in set X are also likely to want items in Y • Valid association rules are “mined” from training data, e.g. store purchases • Association rules are useful to stores, and also in areas such as medical diagnoses, protein sequence composition, health insurance claim analysis and census data Computational Thinking ct.cs.ubc.ca
When is an association rule valid? We are given two thresholds: • Support threshold • Confidence threshold A rule X à Y is valid with respect to these thresholds if • The support of X ∪ Y is at least the support threshold • The confidence of X à Y is at least the confidence threshold Computational Thinking ct.cs.ubc.ca
Support: The degree to which items appear together The support of a set of items is the fraction of transactions that contain all items in the set. T1 Sushi, Chicken, Milk T2 Sushi, Bread T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen Here, the set {Chicken, Ramen, Milk} has support 3/7 Computational Thinking ct.cs.ubc.ca
Confidence: Cause à Effect The confidence of rule X à Y is the fraction of transactions containing all items in X that also contain all items in Y T1 Sushi, Chicken, Milk T2 Sushi, Bread The following rules both T3 Bread, Vegetables have confidence 3/3 = 1: T4 Sushi, Chicken, Bread • Ramen à {Milk, Chicken} T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk • {Ramen, Chicken} à Milk T7 Chicken, Milk, Ramen Computational Thinking ct.cs.ubc.ca
Exercise: Which rules X à Y are valid? Thresholds: support is 3/7, confidence is 1 • Is the support of X ∪ Y at least 3/7? (support: fraction of transactions that contain X ∪ Y ) • Is the confidence of X --> Y at least 1? (confidence: fraction of transactions containing X that also contain Y) T1 Sushi, Chicken, Milk T2 Sushi, Bread A. Chicken à Milk T3 Bread, Vegetables T4 Sushi, Chicken, Bread B. Ramen à Milk T5 Sushi, Chicken, Ramen, Bread, Milk C. Both T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen Computational Thinking ct.cs.ubc.ca
The association rule data mining problem • Input : A table of transactions, a support threshold and a confidence threshold • Output : all of the valid association rules Computational Thinking ct.cs.ubc.ca
The Apriori algorithm for finding valid association rules The Apriori algorithm has two main tasks: • Find all frequent itemsets , i.e., those with support at least the given support threshold • Find all rules X à Y with confidence at least the given confidence threshold Calculating association rules on terabytes of data can be sloooowww. The slowest part is finding the frequent itemsets . Let’s get back to these. Computational Thinking ct.cs.ubc.ca
A frequent itemset: a set whose support is at least some specified threshold T1 Sushi, Chicken, Milk Example: Let the support T2 Sushi, Bread threshold be 3/7 T3 Bread, Vegetables T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen {Chicken, Milk, Ramen} is a frequent itemset Computational Thinking ct.cs.ubc.ca
The Apriori algorithm key idea • The Apriori algorithm speeds up task of finding frequent itemsets, based on the observation that each subset of a frequent itemset must also be a frequent itemset • Let’s see how this is done Computational Thinking ct.cs.ubc.ca
A frequent itemset: a set whose support is at least some specified threshold T1 Sushi, Chicken, Milk Support threshold: 3/7 T2 Sushi, Bread Claim: Each subset of a T3 Bread, Vegetables frequent itemset is also a T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk frequent itemset T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen {Chicken, Milk, Ramen} is a frequent itemset and so {Chicken, Milk}, {Chicken, Ramen}, {Milk, Ramen} must also be frequent itemsets Computational Thinking ct.cs.ubc.ca
A frequent itemset: a set whose support is at least some specified threshold T1 Sushi, Chicken, Milk Support threshold: 3/7 T2 Sushi, Bread Claim: Each subset of a T3 Bread, Vegetables frequent itemset is also a T4 Sushi, Chicken, Bread T5 Sushi, Chicken, Ramen, Bread, Milk frequent itemset T6 Chicken, Ramen, Milk T7 Chicken, Milk, Ramen Conversely, {Vegetables} is not a frequent itemset. So any set containing Vegetables cannot be a frequent itemset. For example, {Sushi, Vegetables} is not frequent. Computational Thinking ct.cs.ubc.ca
The Apriori algorithm Finding frequent itemsets Transaction Items T1 apple, dates, rice, corn T2 corn, dates, tuna T3 apple, corn, dates, tuna T4 corn, tuna Support threshold 50% We’ll work through the algorithm to determine the frequent itemsets for this input Computational Thinking ct.cs.ubc.ca
Apriori round 1: Find all frequent itemsets of size 1 List candidate itemsets Transaction Items of size 1 T1 apple, dates, rice, corn {apple} T2 corn, dates, tuna {corn} T3 apple, corn, dates, tuna {dates} T4 corn, tuna {rice} Support threshold 50% {tuna} Computational Thinking ct.cs.ubc.ca
Apriori round 1: Find all frequent itemsets of size 1 Calculate the support of Transaction Items each candidate itemset T1 apple, dates, rice, corn Support: {apple} = 2/4 T2 corn, dates, tuna {corn} T3 apple, corn, dates, tuna {dates} T4 corn, tuna {rice} Support threshold 50% {tuna} What is the support for corn? a. 1/4 b. 2/4 c. 3/4 d. 4/4 Computational Thinking ct.cs.ubc.ca
Apriori round 1: Find all frequent itemsets of size 1 Calculate the support of Transaction Items each candidate itemset T1 apple, dates, rice, corn Support: {apple} = 2/4 T2 corn, dates, tuna {corn} = 4/4 T3 apple, corn, dates, tuna {dates} = 3/4 T4 corn, tuna {rice} = 1/4 Support threshold 50% {tuna} = 3/4 Computational Thinking ct.cs.ubc.ca
Apriori round 1: Find all frequent itemsets of size 1 Calculate the support of Transaction Items each candidate itemset T1 apple, dates, rice, corn Support: {apple} = 2/4 T2 corn, dates, tuna {corn} = 4/4 T3 apple, corn, dates, tuna {dates} = 3/4 T4 corn, tuna {rice} = 1/4 Support threshold 50% {tuna} = 3/4 Can any itemset containing rice ever be a frequent itemset, when the support threshold is 50%? A. Yes B. No Computational Thinking ct.cs.ubc.ca
Apriori round 1: Find all frequent itemsets of size 1 Set F 1 to be the list of Transaction Items frequent itemsets of size 1: T1 apple, dates, rice, corn {apple} = 2/4 T2 corn, dates, tuna {corn} = 4/4 T3 apple, corn, dates, tuna {dates} = 3/4 T4 corn, tuna {rice} = 1/4 Support threshold 50% {tuna} = 3/4 Computational Thinking ct.cs.ubc.ca
Apriori round 2: Find all frequent itemsets of size 2 List candidate itemsets of Transaction Items size 2: T1 apple, dates, rice, corn {apple, corn} T2 corn, dates, tuna {apple, dates} T3 apple, corn, dates, tuna {apple, tuna} T4 corn, tuna {corn, dates} Support threshold 50% {corn, tuna} {dates, tuna} Because {rice} is not frequent, any set that includes rice is not frequent, so we ignore itemsets that include rice. Computational Thinking ct.cs.ubc.ca
Apriori round 2: Find all frequent itemsets of size 2 Calculate the support of Transaction Items each candidate itemset T1 apple, dates, rice, corn {apple, corn} T2 corn, dates, tuna {apple, dates} T3 apple, corn, dates, tuna {apple, tuna} T4 corn, tuna {corn, dates} Support threshold 50% {corn, tuna} {dates, tuna} Group exercise: count support for these itemsets. Computational Thinking ct.cs.ubc.ca
Apriori round 2: Find all frequent itemsets of size 2 Calculate the support of Transaction Items each candidate itemset T1 apple, dates, rice, corn {apple, corn} = 2/4 T2 corn, dates, tuna {apple, dates} = 2/4 T3 apple, corn, dates, tuna {apple, tuna} = 1/4 T4 corn, tuna {corn, dates} = 3/4 Support threshold 50% {corn, tuna} = 3/4 {dates, tuna} = 2/4 Group exercise: count support for these itemsets. Computational Thinking ct.cs.ubc.ca
Recommend
More recommend