Basic Data Mining Algorithms Liyao Xiang http://xiangliyao.cn/ - PowerPoint PPT Presentation

EE226 Big Data Mining Lecture 3 Basic Data Mining Algorithms Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University http://jhc.sjtu.edu.cn/public/courses/EE226/

Notice • There will be a quiz in the next week’s class. Please take a piece of paper and pens.

Reference and Acknowledgement • Most of the slides are credited to Prof. Jiawei Han’s book “Data Mining: Concepts and Techniques.”

Outline • Basic Concepts in Frequent Pattern Mining • Frequent Itemset Mining Methods • Pattern Evaluation Methods

Basic Concepts • Frequent pattern: a pattern (a set of items, subsequences, substructures …) that appear frequently in a database • Finding frequent patterns is key to mining associations, correlations, clustering, classification and other relationships among data. • Applications: basket data analysis, cross-marketing, catalog design …

Basic Concepts TID Items Purchased • itemset: a set of one or more items 10 Beer, Nuts, Diaper 20 Beer, Co ff ee, Diaper • k-itemset: X = {x 1 , …, x k } 30 Beer, Diaper, Eggs • (absolute) support, or support 40 Nuts, Eggs, Milk count of X: frequency or 50 Nuts, Co ff ee, Diaper, Eggs, Milk occurrence of an itemset X • (relative) support: the fraction of transactions that contains X over all transaction customers customers who got beer who got diaper • An itemset X is frequent if X’s support is no less than a defined threshold min_sup customers who got both

Basic Concepts TID Items Purchased 10 Beer, Nuts, Diaper • support: probability that a 20 Beer, Co ff ee, Diaper transaction contains X ⋃ Y 30 Beer, Diaper, Eggs support(X ⇒ Y) = P(X ⋃ Y) 40 Nuts, Eggs, Milk • confidence: conditional prob. 50 Nuts, Co ff ee, Diaper, Eggs, Milk that a transaction having X also contains Y confidence(X ⇒ Y) = P(Y|X) customers customers who got beer who got diaper P ( Y | X ) = support( X ∪ Y ) support( X ) customers who got both

Basic Concepts • min_sup: minimum support TID Items Purchased threshold 10 Beer, Nuts, Diaper • min_conf: minimum support 20 Beer, Co ff ee, Diaper confidence threshold 30 Beer, Diaper, Eggs • e.g., find all rules X ⇒ Y with 40 Nuts, Eggs, Milk min_sup and min_conf 50 Nuts, Co ff ee, Diaper, Eggs, Milk let min_sup = 50%, min_conf = 50% frequent pattern: Beer: 3, Nuts: 3, Diaper: 4, Eggs: 3, {Beer, Diaper}: 3 customers customers • Association rules: who got beer who got diaper Beer ⇒ Diaper (60%, 100%) Diaper ⇒ Beer (60%, 75%) customers who got both

Basic Concepts • Association rule mining includes: 1. Find all frequent itemsets: frequency of itemsets ≥ min_sup 2. Generate strong association rules from the frequent itemsets • 1 is the major step, but challenging in that there may be a huge number of itemsets satisfying min_sup • An itemset is frequent ⇒ each of its subsets is frequent • Solution: mine closed frequent itemset and maximal frequent itemset • closed frequent itemset X: X is frequent and there is no super-itemset Y ⊃ X with the same support count as X • closed frequent itemset is a lossless compression of frequent itemset • maximal frequent itemset X: X is frequent and there is no super-itemset Y ⊃ X which is frequent

Basic Concepts • e.g., {<a 1 , …, a 100 >, < a 1 , …, a 50 >}, min_sup = 1 • What is the set of closed frequent itemset? • <a 1 , …, a 100 >: 1, < a 1 , …, a 50 >: 2 • What is the set of maximal frequent itemset? • <a 1 , …, a 100 >: 1 • We can assert <a 2 , a 45 > is frequent since a 2 , a 45 ∈ < a 1 , …, a 50 > but cannot assert their actual support count • How many itemsets are potentially to be generated in the worst case? • When min_sup is low, there exist potentially an exponential number of frequent itemsets • Worst case: M N where M = # distinct items, N = max length of transactions

Summary • frequent pattern • k-itemset • (absolute) support, support count, relative support • min_sup, confidence • closed frequent itemset, maximal frequent itemset

Outline • Basic Concepts in Frequent Pattern Mining • Frequent Itemset Mining Methods • Pattern Evaluation Methods

Frequent Itemset Mining Methods • Apriori: A Candidate Generation-and-Test Approach • Improving the E ffi ciency of Apriori • FP-Growth: A Frequent Pattern-Growth Approach • ECLAT: Frequent Pattern Mining with Vertical Data Format

Apriori • Downward Closure Property: any subset of a frequent itemset must be frequent • e.g., if {beer, diaper, nuts} is frequent, so is {beer, diaper} since every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori employs a level-wise search where k-itemsets are used to explore (k + 1)-itemsets. Steps: 1. Scan database once to get frequent 1-itemsets L 1 2. Join the k-frequent itemsets L k to generate length (k+1) candidate itemsets C’ k+1 3. Prune C' k+1 against the database to get C k+1 4. Scan (Test) database for the count of each candidate in C k+1 , obtain L k+1 5. Terminate when no frequent or candidate set can be generated

Basic Data Mining Algorithms Liyao Xiang http://xiangliyao.cn/ - PowerPoint PPT Presentation

EE226 Big Data Mining Lecture 3 Basic Data Mining Algorithms Liyao Xiang http://xiangliyao.cn/ Shanghai Jiao Tong University http://jhc.sjtu.edu.cn/public/courses/EE226/ Notice There will be a quiz in the next weeks class. Please take a

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining Chapter 5 Association Analysis : Basic Concepts Introduction to Data Mining, 2 nd

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

women participating in the Nutrition Links Project in a rural district of Ghana Afua

Parsing Contexts Context Sensitive Grammars for Everyone Jan Kur kurs@iam.unibe.ch Software

Supervised Learning of Complete Morphological Paradigms Greg Durrett and John DeNero UC

Managing many models February 2016 Hadley Wickham @hadleywickham Chief Scientist,

Crisis Management SE in the USA Bob Krouse Midwest Poultry Services April 5, 2011 Agenda What

Building Java Programs Chapter 1 Lecture 1-2: Static Methods reading: 1.4 - 1.5 2 Recall:

Anatomy of a Snowflake Goto Amsterdam 2012 you are here eBay Classifieds Group

An Efficient Algorithm for SPM with CP Problem of Sequential 2 1 1 J. AOGA , T. Guns , P.