Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19
Outline What is Data Mining? 1 From data to information Predictive vs. descriptive information Supervised vs. unsupervised learning Data Mining Tasks 2 Classification & regression Clustering & anomaly detection Association rules & sequential patterns Visualization & dimensionality reduction Data mining process 3 Software for data mining 4 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 2 / 19
Recommended textbooks CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 3 / 19
What is data mining? Data Mining Machine Learning Non-trivial extraction of useful, Field of study that gives computers new, hidden, and/or implicit infor- the ability to learn without being mation from data. explicitly programmed. Deep Learning Big Data A set of algorithms that attempt Extremely large data sets that may to model high-level data abstrac- be analyzed computationally to re- tions in data by using multiple pro- veal patterns, trends, and associa- cessing layers, composed of multi- tions, especially relating to human ple linear and non-linear transfor- behavior and interactions. mations. Related terms: knowledge discovery in databases (KDD), pattern recognition, data warehousing, OLAP, ETL, IT, etc. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19
What is data mining? CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19
What is data mining? CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19
What is data mining? From data to information CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information collected data ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information collected data ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information collected data ✲ ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information collected data ✲ ✲ ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information collected data ✲ ⊆ R O ( 100 + ) ✲ ✲ ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information collected data ✲ ⊆ R O ( 100 + ) ✲ ✲ ✲ � � � � ✠ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19
What is data mining? From data to information Examples of data mining tasks: Recommend movies on Netflix or books on Amazon. Object recognition in images and automatic image tagging Community detection in social networks (e.g., Facebook) Automatic medical diagnosis and treatment recommendation Examples of data processing tasks that are not considered data mining: Signature-based anti-virus Retrieving details from a contact list Text-based search in a document or on the web Quicksort, balanced trees, heaps, etc. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 6 / 19
What is data mining? Predictive vs. descriptive methods Predictive methods Predict unknown information from known data. How much would my house sell for, based on sales stats? Will Bob like Ghostbusters, based on his Netflix history? Descriptive methods Infer or extract interpretable patterns to describe data. What consumer profiles should my ads target? If Jim’s card is trying to charge $300 in a Disney store today, is it reasonable or a fraud? CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 7 / 19
What is data mining? Supervised vs. unsupervised learning Machine learning data analysis tasks are roughly divided into: Supervised learning Inferring information from labeled training data. Unsupervised learning Finding hidden patterns in unlabeled data. Semi-supervised learning Combine information from labeled and unlabeled data to model and deduce information. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 8 / 19
Data mining tasks Classification Classification Classify “items” into a finite set of classes, or “categories”. Training phase Classification model: Labeled data: � �� � � �� � { ( x 1 , ℓ 1 ) , . . . , ( x n , ℓ n ) } ⊂ X × L � ⇒ F : X → L , F ( x i ) = ℓ i | L | < ∞ Testing phase Classification result: New data: � �� � � �� � y 1 , y 2 , . . . ∈ X �→ classification model � ⇒ F ( y 1 ) , . . . , F ( y n ) ∈ L CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 9 / 19
Data mining tasks Classification - examples Example (MNIST digit classification) CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19
Data mining tasks Classification - examples Example (CalTech 101 image classification) Anchor Joshua-Tree Beaver Lotus Water-Lily CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19
Data mining tasks Regression Regression Compute (or infer) the value of a (piecewise) continuous function from a finite number of sampled “items” & values. This task is similar to classification, but here the model F can have an infinite range (e.g., ❘ or [ 0 , 1 ] ). Examples Market pricing of a house/apartment/car based on its features. Trend line & model fitting from collected experimental data. Weather predictions, such as temperature and probability of rain/snow. Confidence rating in diagnostics (or binary classifier). CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 11 / 19
Data mining tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19
Data mining tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19
Data mining tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. The quality of obtained clusters stems from their interpretability. Variations include known or unknown number of cluster number, as well as multiscale hierarchical clustering structures. Examples Clustering stocks to diversify stock market investment Community detection in social networks by clustering profiles Clustering genes and cells to uncover activities, reactions, and interactions. Network activity profiling by clustering packets/sessions. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19
Data mining tasks Anomaly detection Anomaly/outlier detection Detect significant deviations from normal behavior expressed by inferred data patterns. The notion of “normal behavior” can be defined in several ways, such as clustering or model fitting. Examples Fraud detection in credit cards Intrusion detection in cybersecurity Detecting bot traffic in online advertising Malfunction detection in process monitoring CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 13 / 19
Data mining tasks Association rules Association rule discovery Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it. Training phase Association rules: Observed transactions: � �� � � �� � F : 2 X → 2 X , T ⊆ T i �→ F ( T ) ≈ T i \ T T 1 , . . . , T n ⊆ X � ⇒ Testing phase Predicted information: Partial transactions: � �� � � �� � S 1 , S 2 , . . . ⊆ X �→ association rules � ⇒ ∀ i , S i �→ F ( S i ) ⊆ X \ S i CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19
Data mining tasks Association rules Association rule discovery Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it. Examples Active advertisements & recommendations (e.g., “Users who liked/bought this product also liked/bought that product”) Support decision making on shelve organization stores & supermarkets Name completions in emails, social networks, etc. Unlike classification, the actual testing phase is often less important than the discovered rules in this case. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19
Data mining tasks Sequential patterns Sequential pattern discovery Given a set of ordered event sequences, produce rules to predict unknown/missing/future events from prior and/or subsequent events. Similar in some sense to association rule discovery, but with an order or timeline aspect to each transaction. Examples String mining: Natural language processing Gene sequencing in DNA and RNA Frequent item purchase sequences Predicting outcomes of medical treatment CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 15 / 19
Recommend
More recommend