Data Mining Concepts Duen Horng (Polo) Chau Assistant Professor - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242   CSE6242 / CX4242: Data & Visual Analytics   Data Mining Concepts Duen Horng (Polo) Chau   Assistant Professor   Associate Director, MS Analytics   Georgia Tech Partly based on materials by   Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

http://www.amazon.com/Data-Science-Business- data-analytic-thinking/dp/1449361323

1. Classification   (or Probability Estimation) Predict which of a (small) set of classes an entity belong to. 3

(From previous class) 1. Classification   (or Probability Estimation) Predict which of a (small) set of classes an entity belong to. • email spam (y, n) • sentiment analysis (+, -, neutral) • news (politics, sports, …) • medical diagnosis (cancer or not) • shirt size (s, m, l) • face/cat detection • face detection (baby, middle-aged, etc) • buy /not buy - commerce • fraud detection • census: gender 4

(From previous class) 1. Classification   (or Probability Estimation) Predict which of a (small) set of classes an entity belong to. • Cancer testing (yes, no) • Movie genre (action, drama, etc.) • sports (win, loss) • email spam filter (spam, or not) • gesture detection (pinch, swipe…) • planet zone habitable or not • gene prediction • news types (sports, entertainment) • virus scanning (malware or not) 5

2. Regression (“value estimation”) Predict the numerical value of some variable for an entity. 6

(From previous class) 2. Regression (“value estimation”) Predict the numerical value of some variable for an entity. • stock value • real estate • wine valuation • food/commodity • sports betting • movie ratings • forex • any product sales • energy • spread of diseases (regression + geographical info) • web/commerce tra ffi c • GPA - how much time putting in?? 7

(From previous class) 2. Regression (“value estimation”) Predict the numerical value of some variable for an entity. • point value of wine (50-100) • credit score (start with classification; default or not) • stock prices — wall street • relationship between price and sales • weather • sports and game scores 8

3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. 9

(From previous class) 3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. • dating • recommender system (movies, items) • customers in marketing • price comparison (consumer, find similar priced) • comparing emails if they’re spam or not? • facebook/trend/friends suggestions • finding employees • similar youtube videos (e.g., more cat videos) • similar web pages (find near duplicates or representative sites) ~= clustering • plagiarism detection 10

(From previous class) 3. Similarity Matching Find similar entities (from a large dataset) based on what we know about them. • recommending items you may want to buy • find similar gene sequences (that may be repeating, or does similar things) • online dating • building auditing (energy consumption) • patent search • carpool matching (find people to carpool) • detecting fake identities 11

4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) 12

(From previous class) 4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) • groupings of similar bugs in code • optical character recognition • unknown vocabulary • topical analysis (tweets?) • land cover: tree/road/… • for advertising: grouping users for marketing purposes • fireflies clustering • speaker recognition (multiple people in same room) • astronomical clustering 13

(From previous class) 4. Clustering (unsupervised learning) Group entities together by their similarity. (User provides # of clusters) • cluster people into demographics groups (young, old, etc) • cluster people by accents (y’all, you all) • hierarchical clustering for metabolomics • clustering images on the web (cat?) • ~ = dimensionality reduction 14

5. Co-occurrence grouping (Many names: frequent itemset mining, association rule discovery, market-basket analysis) Find associations between entities based on transactions that involve them   (e.g., bread and milk often bought together) http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl- 15 was-pregnant-before-her-father-did/

6. Profiling / Pattern Mining /   Anomaly Detection (unsupervised) Characterize typical behaviors of an entity (person, computer router, etc.) so you can find trends and outliers . Examples?   computer instruction prediction   removing noise from experiment (data cleaning)   detect anomalies in network tra ffi c   moneyball   weather anomalies (e.g., big storm)   google sign-in (alert)   smart security camera   embezzlement   trending articles 16

7. Link Prediction / Recommendation Predict if two entities should be connected, and how strongly that link should be. linkedin/facebook: people you may know amazon/netflix: because you like terminator… suggest other movies you may also like 17

8. Data reduction (“dimensionality reduction”) Shrink a large dataset into smaller one, with as little loss of information as possible 1. if you want to visualize the data (in 2D/3D) 2. faster computation/less storage 3. reduce noise 18

Start Thinking About Project! • What problems do you want to solve? • Using what (large) datasets? • What techniques do you need? 19

Data Mining Concepts Duen Horng (Polo) Chau Assistant Professor - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Mining Concepts Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Data Mining Chapter 5 Association Analysis : Basic Concepts Introduction to Data Mining, 2 nd

Data Mining: Concepts and Techniques Additional Applications and Emerging Topics Li Xiong

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Outlier Detection Chapter 12 of Data Mining: Concepts and Techniques JIAWEI HAN, MICHELINE KAMBER,

Frequent Pattern Mining Overview Basic Concepts and Challenges Efficient and Scalable

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Course : Data mining Lecture : Basic concepts on discrete probability Aristides Gionis

Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9,

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Mining Concepts Duen Horng (Polo) Chau Assistant Professor - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Mining Concepts Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Data Mining Chapter 5 Association Analysis : Basic Concepts Introduction to Data Mining, 2 nd

Data Mining: Concepts and Techniques Additional Applications and Emerging Topics Li Xiong

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Outlier Detection Chapter 12 of Data Mining: Concepts and Techniques JIAWEI HAN, MICHELINE KAMBER,

Frequent Pattern Mining Overview Basic Concepts and Challenges Efficient and Scalable

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Course : Data mining Lecture : Basic concepts on discrete probability Aristides Gionis

Data Mining Concepts &amp; Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9,

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9,