Introduction to Data Mining Umberto Nanni Seminars of Software and - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Introduction to Data Mining Umberto Nanni Seminars of Software and Services for the Information Society 1

Data Mining • born before the Data Warehouse • collection of techniques from: Artificial Intelligence, • collection of techniques from: Artificial Intelligence, Pattern Recognition, Statistics (e.g., genetic algorithms, fuzzy logic, expert systems, neural networks, etc.) • targets: – descriptive goals: identify patterns of behavior, cause- – descriptive goals: identify patterns of behavior, cause- effect relationships, classifying individuals, etc. – predictive goals: predict trends, to classify individuals according to risk, etc. Umberto Nanni Seminars of Software and Services for the Information Society 2

Some applications for Data Mining • Data Analysis and Decision Support Systems • Market Analysis and Marketing – Target Marketing, Customer Relationship Management Target Marketing, Customer Relationship Management (CRM), Market Basket Analysis (MBA), market segmentation • Analysis and risk management – reliability forecasts, user loyalty, quality control, ... – detection of frauds and unusual patterns (outliers) detection of frauds and unusual patterns (outliers) • Text Mining • Web Mining, ClickStream Analysis • Genetic engineering, DNA interpretation, ... Umberto Nanni Seminars of Software and Services for the Information Society 3

Data Mining: associative rules IF X (“the customer purchases beer”) THEN Y (“the customer purchases diaper”) X → Y X → Y Support (what fraction of individual follows the rule): s = |X ∩ Y| s(X → Y) = F(X ∧ Y) |all| Confidence (what fraction of individual to whom the rule applies, follows the rule): applies, follows the rule): c = |X ∩ Y| c(X → Y) = F( Y | X ) |X| Range: economics (e.g.: market basket analysis ), telecommunication, health care, … Umberto Nanni Seminars of Software and Services for the Information Society 4

Data Mining: clustering • identify similarities, spot heterogeneity in the distribution in order to define homogeneous groups (unsupervised learning) • search clusters based on • search clusters based on – distribution of population – a notion of “distance” Example: DFI – Disease-Free Interval (5 years) (collaboration with Ist. Regina Elena, Roma) 5y disease-free interval clustering 5y disease-free interval s-phase s-phase Umberto Nanni Seminars of Software and Services for the Information Society 5

Data Mining: decision tree Determine the causes of an interesting phenomenon (with a set of output values), sorted by relevance – internal node: attribute value to be appraised – internal node: attribute value to be appraised – branching: value (or value interval) for an attribute – leave: one of the possible output values Example: age? will the customer buy a computer ? will the customer buy a computer ? <=30 <=30 30..40 30..40 >40 >40 yes student? credit? no yes low high no yes no yes Umberto Nanni Seminars of Software and Services for the Information Society 6

Data Mining: time sequences • spot recurrent / unusual patterns in time sequences • feature prediction • feature prediction Example (Least Cost Routing): routing a telephone call over the cheapest available connection (coooperation with Between – consulting firm) KEY QUESTION: given an outbound call from an internal line X cost cos toward an external number Y, how long the toward an external number Y, how long the call? Rates: connection fee flat rate duration Umberto Nanni Seminars of Software and Services for the Information Society 7

Neural Networks Problem: can you write a program which recognizes can you write a program which recognizes human writing of capital letters... ? Umberto Nanni Seminars of Software and Services for the Information Society 8

Data Mining: “interesting” results confusion matrix predicted value • Simplicity - For example: value effective valu right right – length of rules (associative) – length of rules (associative) wrong – size (decision tree) • Certainty - For example: – confidence (Association Rules): c(X → Y) = #(X and Y) / #(X) – reliability of classification • Usefulness - For example: – support (Association Rules) s(X → Y) = #(X and Y) / #(ALL) – support (Association Rules) s(X → Y) = #(X and Y) / #(ALL) • Novelty - For example: – not known previously – surprising – subsumption of other rules (included as special cases) Umberto Nanni Seminars of Software and Services for the Information Society 9

Confusion matrix Umberto Nanni Seminars of Software and Services for the Information Society 10 10

Confusion matrix & Terminology Positive (P), Negative (N) True Positive (TP), True Negative (TN) False Positive (FP), False Negative (FN) True Positive Rate [ sensitivity , recall ] TPR = TP / P = TP / (TP+FN) False Positive Rate FPR = FP / N = FP / (FP + TN) ACCuracy ACC = (TP + TN) / (P + N) SPeCificity (True Negative Rate) SPC = TN / N = TN / (FP + TN) = 1 - FPR SPC = TN / N = TN / (FP + TN) = 1 - FPR Positive Predictive Value [ precision ] PPV = TP / (TP + FP) Negative Predictive Value NPV = TN / (TN + FN) False Discovery Rate FDR = FP / (FP + TP) Umberto Nanni Seminars of Software and Services for the Information Society 11

ROC curve Receiver Operating Characteristic (from signal detection theory) Fundamental tool for evaluation of a learning algorithm. learning algorithm. Y axis: True Positive Rate (Sensitivity) X axis: False Positive Rate (100-Specificity) Each point on the ROC curve represents a Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The Area Under the ROC Curve (AUC) is a measure of how well a parameter can distinguish between two groups (YES/NO decision). Umberto Nanni Seminars of Software and Services for the Information Society 12

ROC curve: examples Umberto Nanni Seminars of Software and Services for the Information Society 13

Mining Rules from Databases – Algorithm: APRIORI Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. 20th International Conference on Very Large Data Bases (VLDB) , pp.487-499, Santiago, Chile, September 1994. APRIORI Algorithm: 1. L 1 = { large 1-itemsets } for ( k = 2; L k -1 ≠ ∅ ; k++ ) do begin 2. generation 3. C k = apriori-generate (L k -1 ) // Candidates (extending prev. tuples) forall transactions t ∈ D do begin 4. 5. C t = subset(C k , t) // Candidates contained in t forall candidates c ∈ C t do forall candidates c ∈ C t do 6. 6. pruning pruning 7. c.count++ 8. end L k = { c ∈ C k | c.count ≥ minsupport } 9. 10. end 11. ANSWER = U k L k Umberto Nanni Seminars of Software and Services for the Information Society 14

Introduction to Data Mining Umberto Nanni Seminars of Software and - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Introduction

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Nuts and Bolts of CCA Formation Data Management, Customer Call Center and Customer Enrollment

Provider Directory Advisory Group Meeting March 16, 2016 Welcome! Introductions,

your contacts in one basket? Dennis Solis Site Building, April 20, 2013 Dennis Solis Over 30

Webinar Increase Sales and Profits With Optimized Business Intelligence Presenter: Ram Pandit,

W ELC O M E TO # W C ETW EBC A ST July 12, 2018 The webcast will begin shortly. There is no audio

Information System & Database Design CS105 What is an Information System? Storage and

Chapter 1: Managerial Accounting: Tools for Decision Making Agenda Strategy Accounting

Domain-Specific Languages for Enterprise Systems Jesper Andersen 2 Patrick Bahr 1 Fritz Henglein 1

Introduction to Data Mining Umberto Nanni Seminars of Software and - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Introduction

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Nuts and Bolts of CCA Formation Data Management, Customer Call Center and Customer Enrollment

Provider Directory Advisory Group Meeting March 16, 2016 Welcome! Introductions,

your contacts in one basket? Dennis Solis Site Building, April 20, 2013 Dennis Solis Over 30

Webinar Increase Sales and Profits With Optimized Business Intelligence Presenter: Ram Pandit,

W ELC O M E TO # W C ETW EBC A ST July 12, 2018 The webcast will begin shortly. There is no audio

Information System &amp; Database Design CS105 What is an Information System? Storage and

Chapter 1: Managerial Accounting: Tools for Decision Making Agenda Strategy Accounting

Domain-Specific Languages for Enterprise Systems Jesper Andersen 2 Patrick Bahr 1 Fritz Henglein 1

Information System & Database Design CS105 What is an Information System? Storage and